Archive for the ‘LOD’ Category

Linked Data 2011 – 2014

Friday, July 25th, 2014

One of the more well known visualizations of the Linked Data Cloud has been updated to 2014. For comparison purposes, I have included the 2011 version as well.

LOD Cloud 2011

LOD Cloud 2011

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.”

LOD Cloud 2014

LOD Cloud 2014

From Adoption of Linked Data Best Practices in Different Topical Domains by Max Schmachtenberg, Heiko Paulheim, and Christian Bizer.

How would you characterize the differences between the two?

Partially a question of how to use large graph displays? Even though the originals (in some versions) are interactive, how often is an overview of related linked data sets required?

I first saw this in a tweet by Juan Sequeda.

A Methodology for Empirical Analysis of LOD Datasets

Friday, June 6th, 2014

A Methodology for Empirical Analysis of LOD Datasets by Vit Novacek.


CoCoE stands for Complexity, Coherence and Entropy, and presents an extensible methodology for empirical analysis of Linked Open Data (i.e., RDF graphs). CoCoE can offer answers to questions like: Is dataset A better than B for knowledge discovery since it is more complex and informative?, Is dataset X better than Y for simple value lookups due its flatter structure?, etc. In order to address such questions, we introduce a set of well-founded measures based on complementary notions from distributional semantics, network analysis and information theory. These measures are part of a specific implementation of the CoCoE methodology that is available for download. Last but not least, we illustrate CoCoE by its application to selected biomedical RDF datasets. (emphasis in original)

A deeply interesting work on the formal characteristics of LOD datasets but as we learned in Community detection in networks:… a relationship between a typology (another formal characteristic) and some hidden fact(s) may or may not exist.

Or to put it another way, formal characteristics are useful for rough evaluation of data sets but cannot replace a grounded actor considering their meaning. That would be you.

I first saw this in a tweet by Marin Dimitrov

Workload Matters: Why RDF Databases Need a New Design

Saturday, May 17th, 2014

Workload Matters: Why RDF Databases Need a New Design by Gunes¸ Aluc¸, M. Tamer ¨ Ozsu, and, Khuzaima Daudjee.


The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF is becoming widely utilized, RDF data management systems are being exposed to more diverse and dynamic workloads. Existing systems are workload-oblivious, and are therefore unable to provide consistently good performance. We propose a vision for a workload-aware and adaptive system. To realize this vision, we re-evaluate relevant existing physical design criteria for RDF and address the resulting set of new challenges.

The authors establish RDF data management systems are in need of better processing models. However, they mention a “prototype” only in their conclusion and offer no evidence concerning their possible alternatives for RDF processing.

I don’t doubt the need for better RDF processing but I would think the first step would be to determine the goals of RDF processing, separate and apart from the RDF model.

Simply because we conceptualize data as being encoded in “triples,” does not mean that computers must process them as “triples.” They can if it is advantageous but not if there are better processing models.

I first saw this in a tweet by Olaf Hartig.

OCLC Preview 194 Million…

Tuesday, February 25th, 2014

OCLC Preview 194 Million Open Bibliographic Work Descriptions by Richard Wallis.

From the post:

I have just been sharing a platform, at the OCLC EMEA Regional Council Meeting in Cape Town South Africa, with my colleague Ted Fons. A great setting for a great couple of days of the OCLC EMEA membership and others sharing thoughts, practices, collaborative ideas and innovations.

Ted and I presented our continuing insight into The Power of Shared Data, and the evolving data strategy for the bibliographic data behind WorldCat. If you want to see a previous view of these themes you can check out some recordings we made late last year on YouTube, from Ted – The Power of Shared Data – and me – What the Web Wants.

Today, demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front:

  1. The release of 194 Million Linked Data Bibliographic Work descriptions
  2. The WorldCat Linked Data Explorer interface

A preview release to be sure but one worth following!

Particularly with 194 million bibliographic work descriptions!

See Ralph’s post for the details.

JSON-LD Is A W3C Recommendation

Thursday, January 16th, 2014

JSON-LD Is A W3C Recommendation

From the post:

The RDF Working Group has published two Recommendations today:

  • JSON-LD 1.0. JSON is a useful data serialization and messaging format. This specification defines JSON-LD, a JSON-based format to serialize Linked Data. The syntax is designed to easily integrate into deployed systems that already use JSON, and provides a smooth upgrade path from JSON to JSON-LD. It is primarily intended to be a way to use Linked Data in Web-based programming environments, to build interoperable Web services, and to store Linked Data in JSON-based storage engines.
  • JSON-LD 1.0 Processing Algorithms and API. This specification defines a set of algorithms for programmatic transformations of JSON-LD documents. Restructuring data according to the defined transformations often dramatically simplifies its usage. Furthermore, this document proposes an Application Programming Interface (API) for developers implementing the specified algorithms.

It would make a great question on a markup exam to ask whether JSON reminded you more of the “Multicode Basic Concrete Syntax” or a “Variant Concrete Syntax?” For either answer, explain.

In any event, you will be encountering JSON-LD so these recommendations will be helpful.

SKOSsy – Thesauri on the fly!

Tuesday, January 14th, 2014

SKOSsy – Thesauri on the fly!

From the webpage:

SKOSsy extracts data from LOD sources like DBpedia (and basically from any RDF based knowledge base you like) and works well for automatic text mining and whenever a seed thesaurus should be generated for a certain domain, organisation or a project.

If automatically generated thesauri are loaded into an editor like PoolParty Thesaurus Manager (PPT) you can start to enrich the knowledge model by additional concepts, relations and links to other LOD sources. With SKOSsy, thesaurus projects you don´t have to be started in the open countryside anymore. See also how SKOSsy is integrated into PPT.

  • SKOSsy makes heavy use of Linked Data sources, especially DBpedia
  • SKOSsy can generate SKOS thesauri for virtually any domain within a few minutes
  • Such thesauri can be improved, curated and extended to one´s individual needs but they serve usually as “good-enough” knowledge models for any semantic search application you like
  • SKOSsy thesauri serve as a basis for domain specific text extraction and knowledge enrichment
  • SKOSsy based semantic search usually outperform search algorithms based on pure statistics since they contain high-quality information about relations, labels and disambiguation
  • SKOSsy works perfectly together with PoolParty product family

DBpedia is probably closer to some user’s vocabulary than most formal ones. 😉

I have the sense that rather than asking experts for their semantics (and how to represent them), we are about to turn to users to ask about their semantics (and choose simple ways to represent them).

If results that are useful to the average user are the goal, it is a move in the right direction.

Creating Knowledge out of Interlinked Data…

Thursday, November 7th, 2013

Creating Knowledge out of Interlinked Data – STATISTICAL OFFICE WORKBENCH by Bert Van Nuffelen and Karel Kremer.

From the slides:

LOD2 is a large-scale integrating project co-funded by the European Commission within the FP7 Information and Communication Technologies Work Programme. This 4-year project comprises leading Linked Open Data technology researchers, companies, and service providers. Coming from across 12 countries the partners are coordinated by the Agile Knowledge Engineering and Semantic Web Research Group at the University of Leipzig, Germany.

LOD2 will integrate and syndicate Linked Data with existing large-scale applications. The project shows the benefits in the scenarios of Media and Publishing, Corporate Data intranets and eGovernment.

LOD2 Stack Release 3.0 overview

Connecting the dots: Workbench for Statistical Office

In case you are interested:

LOD2 homepage

Ubuntu 12.04 Repository

VM User / Password: lod2demo / lod2demo

LOD2 blog

The LOD2 project expires in August of 2014.

Linked Data is going to be around, one way or the other, for quite some time.

My suggestion: Grab the last VM from LOD2 and a copy of its OS, store in a location that migrates as data systems change.

Shortfall of Linked Data

Tuesday, June 18th, 2013

Preparing a presentation I stumbled upon a graphic illustration of why we need better semantic techniques for the average author:

Linked Data in 2011:


Versus the WWW:


This must be why you don’t see any updated linked data clouds. The comparison is too shocking.

Particularly when you remember the WWW itself is only part of a much larger data cloud. (Ask the NSA about the percentages.)

Data is being produced every day, pushing us further and further behind with regard to its semantics. (And making the linked data cloud an even smaller percentage of all data.)

Authors have semantics in mind when they write.

The question is how to capture those semantics in machine readable form as nearly as seamlessly as authors write?


The Project With No Name

Thursday, April 4th, 2013

Fujitsu Labs And DERI To Offer Free, Cloud-Based Platform To Store And Query Linked Open Data by Jennifer Zaino.

From the post:

The Semantic Web Blog reported last year about a relationship formed between the Digital Enterprise Research Institute (DERI) and Fujitsu Laboratories Ltd. in Japan, focused on a project to build a large-scale RDF store in the cloud capable of processing hundreds of billions of triples. At the time, Dr. Michael Hausenblas, who was then a DERI research fellow, discussed Fujitsu Lab’s research efforts related to the cloud, its huge cloud infrastructure, and its identification of Big Data as an important trend, noting that “Linked Data is involved with answering at least two of the three Big Data questions” – that is, how to deal with volume and variety (velocity is the third).

This week, the DERI and Fujitsu Lab partners have announced a new data storage technology that stores and queries interconnected Linked Open Data, to be available this year, free of charge, on a cloud-based platform. According to a press release about the announcement, the data store technology collects and stores Linked Open Data that is published across the globe, and facilitates search processing through the development of a caching structure that is specifically adapted to LOD.

Typically, search performance deteriorates when searching for common elements that are linked together within data because of requirements around cross-referencing of massive data sets, the release says. The algorithm it has developed — which takes advantage of links in LOD link structures typically being concentrated in only a portion of server nodes, and of past usage frequency — caches only the data that is heavily accessed in cross-referencing to reduce disk accesses, and so accelerate searching.

Not sure what it means for the project between DERI and Fujitsu to have no name. Or at least no name in the press releases.

Until that changes, may I suggest: DERI and Fujitsu Project With No Name (DFPWNN)? 😉

With or without a name I was glad for DERI because, well, I like research and they do it quite well.

DFPWNN’s better query technology for LOD will demonstrate, in my opinion, the same semantic diversity found at Swoogle.

Linking up semantically diverse content means just that, a lot of semantically diverse content, linked up.

The bill for leaving semantic diversity as a problem to address “later” is about to come due.

Beacons of Availability

Sunday, March 17th, 2013

From Records to a Web of Library Data – Pt3 Beacons of Availability by Richard Wallis.

Beacons of Availability

As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.

As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.

Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff. How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal. In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.

I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google. A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources. You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources. Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist. A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines. Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.

I’m am an ardent sympathizer helping people to find “our stuff.”

I don’t disagree with the description of Google as: “…the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!”

But in all fairness to Google, I would remind you of Drabenstott’s research that found for the Library of Congress subject headings:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows:

children 32%
adults 40%
reference 53%
technical services librarians 56%

The Library of Congress subject classification has been around for more than a century and just over half of the librarians can use it correctly.

Let’s don’t wait more than a century to test the claim:*

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines.

* By “test” I don’t mean the sort of study, “…we recruited twelve LIS students but one had to leave before the study was complete….”

I am using “test” in the sense of a well designed and organized social science project with professional assistance from social scientists, UI test designers and the like.

I think OCLC is quite sincere in its promotion of linked data, but effectiveness is an empirical question, not one of sincerity.

From Records to a Web of Library Data – Pt2 Hubs of Authority

Saturday, March 16th, 2013

From Records to a Web of Library Data – Pt2 Hubs of Authority by Richard Wallis.

From the post:

Hubs of Authority

Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years. The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations. Two from personal experience come to mind, BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC. Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).

One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records. A large number of national libraries have such lists of agreed formats for author and organisational names. The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc. Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations.

These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data. As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other. A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.

I don’t deny that VIAF is a very useful tool but if you search for personal name, “Marilyn Monroe,” it returns:

1. Miller, Arthur, 1915-2005
National Library of Australia National Library of the Czech Republic National Diet Library (Japan) Deutsche Nationalbibliothek RERO (Switzerland) SUDOC (France) Library and Archives Canada National Library of Israel (Latin) National Library of Sweden NUKAT Center (Poland) Bibliothèque nationale de France Biblioteca Nacional de España Library of Congress/NACO

Miller, Arthur (Arthur Asher), 1915-2005
National Library of the Netherlands-test

Miller, Arthur, 1915-
Vatican Library Biblioteca Nacional de Portugal

ميلر، ارثر، 1915-2005 م.
Bibliotheca Alexandrina (Egypt)

Miller, Arthur
Wikipedia (en)-test

מילר, ארתור, 1915-2005
National Library of Israel (Hebrew)

2. Monroe, Marilyn, 1926-1962
National Library of Israel (Latin) National Library of the Czech Republic National Diet Library (Japan) Deutsche Nationalbibliothek SUDOC (France) Library and Archives Canada National Library of Australia National Library of Sweden NUKAT Center (Poland) Bibliothèque nationale de France Biblioteca Nacional de España Library of Congress/NACO

Monroe, Marilyn
National Library of the Netherlands-test Wikipedia (en)-test RERO (Switzerland)

Monroe, Marilyn American actress, model, and singer, 1926-1962
Getty Union List of Artist Names

Monroe, Marilyn, pseud.
Biblioteca Nacional de Portugal

3. DiMaggio, Joe, 1914-1999
Library of Congress/NACO Bibliothèque nationale de France

Di Maggio, Joe 1914-1999
Deutsche Nationalbibliothek

Di Maggio, Joseph Paul, 1914-1999
National Diet Library (Japan)

DiMaggio, Joe, 1914-
National Library of Australia

Dimaggio, Joseph Paul, 1914-1999
SUDOC (France)

DiMaggio, Joe (Joseph Paul), 1914-1999
National Library of the Netherlands-test

Dimaggio, Joe
Wikipedia (en)-test

4. Monroe, Marilyn
Deutsche Nationalbibliothek

5. Hurst-Monroe, Marlene
Library of Congress/NACO

6. Wolf, Marilyn Monroe
Deutsche Nationalbibliothek

Maybe Sir Tim is right, users “…can discover more things.”

Some of them are related, some of them are not.

BBC …To Explore Linked Data Technology [Instead of hand-curated content management]

Friday, February 1st, 2013

BBC News Lab to Explore Linked Data Technology by Angela Guess.

From the post:

Matt Shearer of the BBC recently reported that the BBC’s News Lab team will begin exploring linked data technologies. He writes, “Hi I’m Matt Shearer, delivery manager for Future Media News. I manage the delivery of the News Product and I also lead on BBC News Labs. BBC News Labs is an innovation project which was started during 2012 to help us harness the BBC’s wider expertise to explore future opportunities. Generally speaking BBC News believes in allowing creative technologists to innovate and influence the direction of the News product. For example the delivery of BBC News’ responsive design mobile service started in 2011 when we made space for a multidiscipline project to explore responsive design opportunities for BBC News. With this in mind the BBC News team setup News Labs to explore linked data technologies.”

Shearer goes on, “The BBC has been making use of linked data technologies in its internal content production systems since 2011. As explained by Jem Rayfield this enabled the publishing of news aggregation pages ‘per athlete’, ‘per sport’ and ‘per event’ for the 2012 Olympics – something that would not have been possible with hand-curated content management. Linked data is being rolled out on BBC News from early 2013 to enrich the connections between BBC News stories, content assets, the wider BBC website and the World Wide Web. We framed each challenge/opportunity for the News Lab in terms of a clear ‘problem space’ (as opposed to a set of requirements that may limit options) supported by research findings, audience needs, market needs, technology opportunities and framed with the BBC News Strategy.”

Read more here.

(emphasis added)

Apologies for the long quote but I wanted to capture the BBC’s comparison of using linked data to hand-curated content management in context.

I never dreamed the BBC was still using “hand-curated content management” as a measure of modern IT systems.

Quite remarkable.

On the other hand, perhaps they were being kind to the linked data experiment by using a measure that enables it to excel.

If you know which one, please comment.


Is Linked Data the future of data integration in the enterprise?

Tuesday, January 15th, 2013

Is Linked Data the future of data integration in the enterprise? by John Walker.

From the post:

Following the basic Linked Data principles we have assigned HTTP URIs as names for things (resources) providing an unambiguous identifier. Next up we have converted data from a variety of sources (XML, CSV, RDBMS) into RDF.

One of the key features of RDF is the ability to easily merge data about a single resource from multiple source into a single “supergraph” providing a more complete description of the resource. By loading the RDF into a graph database, it is possible to make an endpoint available which can be queried using the SPARQL query language. We are currently using Dydra as their cloud-based database-as-a-service model provides an easy entry route to using RDF without requiring a steep learning curve (basically load your RDF and you’re away), but there are plenty of other options like Apache Jena and OpenRDF Sesame. This has made it very easy for us to answer to complex questions requiring data from multiple sources, moreover we can stand up APIs providing access to this data in minutes.

By using a Linked Data Plaform such as Graphity we can make our identifiers (HTTP URIs) dereferencable. In layman’s terms when someone plugs the URI into a browser, we provide a description of the resource in HTML. Using content negotiation we are able to provide this data in one of the standard machine-readable XML, JSON or Turtle formats. Graphity uses Java and XSLT 2.0 which our developers already have loads of experience with and provides powerful mechanisms with which we will be able to develop some great web apps.

What do you make of:

One of the key features of RDF is the ability to easily merge data about a single resource from multiple source into a single “supergraph” providing a more complete description of the resource.


I suppose if by some accident we all use the same URI as an identifier, that would be the case. But that hardly requires URIs, Linked Data or RDF.

Scientific conferences on digital retrieval the 1950’s worried about diversity of nomenclature being barriers to discovery of resources. If we haven’t addressed the semantic diversity issue in sixty (60) years of talking about it, it isn’t clear how creating another set of diverse names is going to help.

There may be other reasons for using URIs but seamless merging doesn’t appear to be one of them.

Moreover, how do I know what you have identified with a URI?

You can return one or more properties for a URI, but which ones matter for the identity of the subject it identifies?

I first saw this at Linked Data: The Future of Data Integration by Angela Guess.

@AMS Webinars on Linked Data

Wednesday, January 9th, 2013

@AMS Webinars on Linked Data

From the website:

The traditional approach of sharing data within silos seems to have reached its end. From governments and international organizations to local cities and institutions, there is a widespread effort of opening up and interlinking their data. Linked Data, a term coined by Tim Berners-Lee in his design note regarding the Semantic Web architecture, refers to a set of best practices for publishing, sharing, and interlinking structured data on the Web.

Linked Open Data (LOD), a concept that has leapt onto the scene in the last years, is Linked Data distributed under an open license that allows its reuse for free. Linked Open Data becomes a key element to achieve interoperability and accessibility of data, harmonisation of metadata and multilinguality.

There are four remaining seminars in this series:

Webinar in French | 22nd January 2013 – 11:00am Rome time
Clarifiez le sens de vos données publiques grâce au Web de données
Christophe Guéret, Royal Netherlands Academy of Arts and Sciences, Data Archiving and Networked Services (DANS)

Webinar in Chinese | 29th January 2013 – 02:00am Rome time
基于网络的研讨会 “题目:理解和利用关联数据 --图情档博(LAM)作为关联数据的提供者和使用者”
Marcia Zeng, School of Library and Information Science, Kent State University

Webinar in Russian | 5th February 2013 – 11:00am Rome time
Введение в концепцию связанных открытых данных
Irina Radchenko, Centre of Semantic Technologies, Higher School of Economics

Webinar in Arabic | 12th February 2013 – 11:00am Rome time
Ibrahim Elbadawi, UAE Federal eGovernment

Mark your agenda! New Free Webinars @ AIMS on Linked Open Data for registration and more details.

Callimachus Version 1.0

Friday, January 4th, 2013

Callimachus Version 1.0 by Eric Franzon.

From the post:

The Callimachus Project has announced that the latest release of the Open Source version of Callimachus is available for immediate download.

Callimachus began as a linked data management system in 2009 and is an Open Source system for navigating, managing, visualizing and building applications on Linked Data.

Version 1.0 introduces several new features, including:

  • Built-in support for most types of Persistent URLs (PURLs), including Active PURLs.
  • Scripted HTTP content type conversions via XProc pipelines.
  • Ability to access remote Linked Data via SPARQL SERVICE keyword and XProc pipelines.
  • Named Queries can now have a custom view page. The view page can be a template for the resources in the query result.
  • Authorization can now be performed based on IP addresses or the DNS domain of the client.

Library Hi Tech Journal seeks papers on LOV & LOD

Saturday, December 8th, 2012

Library Hi Tech Journal seeks papers on LOV & LOD

From the post:

Library Hi Tech (LHT) seeks papers about new works, initiatives, trends and research in the field of linking and opening vocabularies. This call for papers is inspired by the 2012 LOV Symposium: Linking and Opening Vocabularies symposium and SKOS-2-HIVE —Helping Interdisciplinary Vocabulary Engineering workshop—held at the Universidad Carlos III de Madrid (UC3M).

This Library Hi Tech special issue might include papers delivered at the UC3M-LOV events and other original works related with this subject, not yet published.

Topics: LOV & LOD

Papers specifically addressing research and development activities, implementation challenges and solutions, and educative aspects of Linked Open Vocabularies (LOV) and/or in a broader sense Linked Open Data, are of particular interest.

Those interested in submitting an article should send papers before 30 January 2013. Full articles should be between 4,000 and 8,000 words. References should use the Harvard style. Please submit completed articles via the Scholar One online submission system. All final submissions will be peer reviewed.

On the style for references, you may find the Author Guidelines at LHT useful.

More generally, see Harvard System, posted by the University Library of Anglia Ruskin University.

Linked Data Platform 1.0

Friday, October 26th, 2012

Linked Data Platform 1.0

From the working draft:

A set of best practices and simple approach for a read-write Linked Data architecture, based on HTTP access to web resources that describe their state using RDF.

Just in case you are keeping up with the Linked Data effort.

I first saw this at

An Introduction to Linked Open Data in Libraries, Archives & Museums

Thursday, August 9th, 2012

An Introduction to Linked Open Data in Libraries, Archives & Museums by Jon Voss.

From the description:

According to a definition, “The term Linked Data refers to a set of best practices for publishing and connecting structured data on the web.” This has enormous implications for discoverability and interoperability for libraries, archives, and museums, not to mention a dramatic shift in the World Wide Web as we know it. In this introductory presentation, we’ll explore the fundamental elements of Linked Open Data and discover how rapidly growing access to metadata within the world’s libraries, archives and museums is opening exciting new possibilities for understanding our past, and may help in predicting our future.

Be forewarned that Jon thinks “mashing up” music tracks has a good result.

And you will encounter advocates for Linked Data in libraries.

You should be prepared to encounter both while topic mapping.

What is Linked Data

Tuesday, July 10th, 2012

What is Linked Data by John Goodwin.

From the post:

In the early 1990s there began to emerge a new way of using the internet to link documents together. It was called the World Wide Web. What the Web did that was fundamentally new was that it enabled people to publish documents on the internet and link them such that you could navigate from one document to another.

Part of Sir Tim Berners-Lee’s original vision of the Web was that it should also be used to publish, share and link data. This aspect of Sir Tim’s original vision has gained a lot of momentum over the last few years and has seen the emergence of the Linked Data Web.

The Linked Data Web is not just about connecting datasets, but about linking information at the level of a single statement or fact. The idea behind the Linked Data Web is to use URIs (these are like the URLs you type into your browser when going to a particular website) to identify resources such as people, places and organisations, and to then use web technology to provide some meaningful and useful information when these URIs are looked up. This ‘useful information’ can potentially be returned in a number of different encodings or formats, but the standard way for the linked data web is to use something called RDF (Resource Description Framework).

An introductory overview of the rise and use of linked data.

John is involved in efforts at to provide open access to governmental data and one form of that delivery will be linked data.

You will be encountering linked data, both as a current and legacy format so it is worth your time to learn it now.

I first saw this at

Cascading map-side joins over HBase for scalable join processing

Sunday, July 1st, 2012

Cascading map-side joins over HBase for scalable join processing by Martin Przyjaciel-Zablocki, Alexander Schätzle, Thomas Hornung, Christopher Dorner, and Georg Lausen.


One of the major challenges in large-scale data processing with MapReduce is the smart computation of joins. Since Semantic Web datasets published in RDF have increased rapidly over the last few years, scalable join techniques become an important issue for SPARQL query processing as well. In this paper, we introduce the Map-Side Index Nested Loop Join (MAPSIN join) which combines scalable indexing capabilities of NoSQL storage systems like HBase, that suffer from an insufficient distributed processing layer, with MapReduce, which in turn does not provide appropriate storage structures for efficient large-scale join processing. While retaining the flexibility of commonly used reduce-side joins, we leverage the effectiveness of map-side joins without any changes to the underlying framework. We demonstrate the significant benefits of MAPSIN joins for the processing of SPARQL basic graph patterns on large RDF datasets by an evaluation with the LUBM and SP2Bench benchmarks. For most queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude.

Some topic map applications include Linked Data/RDF processing capabilities.

The salient comment here being: “For most queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude.

Joint International Workshop on Entity-oriented and Semantic Search

Thursday, May 31st, 2012

1st Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012

Important Dates:

  • Submissions Due: July 2, 2012
  • Notification of Acceptance: July 23, 2012
  • Camera Ready: August 1, 2012
  • Workshop date: August 16th, 2012

Located at the 35th ACM SIGIR Conference, Portland, Oregon, USA, August 12–16, 2012.

From the homepage of the workshop:

About the Workshop:

The workshop encompasses various tasks and approaches that go beyond the traditional bag-of-words paradigm and incorporate an explicit representation of the semantics behind information needs and relevant content. This kind of semantic search, based on concepts, entities and relations between them, has attracted attention both from industry and from the research community. The workshop aims to bring people from different communities (IR, SW, DB, NLP, HCI, etc.) and backgrounds (both academics and industry practitioners) together, to identify and discuss emerging trends, tasks and challenges. This joint workshop is a sequel of the Entity-oriented Search and Semantic Search Workshop series held at different conferences in previous years.


The workshop aims to gather all works that discuss entities along three dimensions: tasks, data and interaction. Tasks include entity search (search for entities or documents representing entities), relation search (search entities related to an entity), as well as more complex tasks (involving multiple entities, spatio-temporal relations inclusive, involving multiple queries). In the data dimension, we consider (web/enterprise) documents (possibly annotated with entities/relations), Linked Open Data (LOD), as well as user generated content. The interaction dimension gives room for research into user interaction with entities, also considering how to display results, as well as whether to aggregate over multiple entities to construct entity profiles. The workshop especially encourages submissions on the interface of IR and other disciplines, such as the Semantic Web, Databases, Computational Linguistics, Data Mining, Machine Learning, or Human Computer Interaction. Examples of topic of interest include (but are not limited to):

  • Data acquisition and processing (crawling, storage, and indexing)
  • Dealing with noisy, vague and incomplete data
  • Integration of data from multiple sources
  • Identification, resolution, and representation of entities (in documents and in queries)
  • Retrieval and ranking
  • Semantic query modeling (detecting, modeling, and understanding search intents)
  • Novel entity-oriented information access tasks
  • Interaction paradigms (natural language, keyword-based, and hybrid interfaces) and result representation
  • Test collections and evaluation methodology
  • Case studies and applications

We particularly encourage formal evaluation of approaches using previously established evaluation benchmarks: Semantic Search Challenge 2010, Semantic Search Challenge 2011, TREC Entity Search Track.

All workshops are special to someone. This one sounds more special than most. Collocated with the ACM SIGIR 2012 meeting. Perhaps that’s the difference.

Nature Publishing Group releases linked data platform

Sunday, April 8th, 2012

Nature Publishing Group releases linked data platform

From the post:

Nature Publishing Group (NPG) today is pleased to join the linked data community by opening up access to its publication data via a linked data platform. NPG’s Linked Data Platform is available at

The platform includes more than 20 million Resource Description Framework (RDF) statements, including primary metadata for more than 450,000 articles published by NPG since 1869. In this first release, the datasets include basic citation information (title, author, publication date, etc) as well as NPG specific ontologies. These datasets are being released under an open metadata license, Creative Commons Zero (CC0), which permits maximal use/re-use of this data.

NPG’s platform allows for easy querying, exploration and extraction of data and relationships about articles, contributors, publications, and subjects. Users can run web-standard SPARQL Protocol and RDF Query Language (SPARQL) queries to obtain and manipulate data stored as RDF. The platform uses standard vocabularies such as Dublin Core, FOAF, PRISM, BIBO and OWL, and the data is integrated with existing public datasets including CrossRef and PubMed.

More information about NPG’s Linked Data Platform is available at Sample queries can be found at

You may find it odd that I would cite such a resource on the same day as penning Technology speedup graph where I speak so harshly about the Semantic Web.

On the contrary, disagreement about the success/failure of the Semantic Web and its retreat to Linked Data is an example of conflicting semantics. Conflicting semantics not being a “feature” of the Semantic Web.

Besides, Nature is a major science publisher and their experience with Linked Data is instructive.

Such as the NPG specific ontologies. 😉 Not what you were expecting?

This is a very useful resource and the Nature Publishing Group is to be commended for it.

The creation of metadata about the terms used within articles and the relationships between those terms as well as other publications, will make it more useful still.

Linked Data Basic Profile 1.0

Wednesday, April 4th, 2012

Linked Data Basic Profile 1.0

A group of W3C members, IBM, DERI, EMC, Oracle, Red Hat, Tasktop and have made a submission to the W3C with the title: Linked Data Basic Profile 1.0.

The submission consists of:

Linked Data Basic Profile 1.0

Linked Data Basic Profile Use Cases and Requirements

Linked Data Basic Profile RDF Schema

Interesting proposal.

Doesn’t try to do everything. The old 303/TBL is relegated to pagination. Probably a good use for it.


I-CHALLENGE 2012 : Linked Data Cup

Friday, January 6th, 2012

I-CHALLENGE 2012 : Linked Data Cup


When Sep 5, 2012 – Sep 7, 2012
Where Graz, Austria
Submission Deadline Apr 2, 2012
Notification Due May 7, 2012
Final Version Due Jun 4, 2012

From the call for submissions:

The yearly organised Linked Data Cup (formerly Triplification Challenge) awards prizes to the most promising innovation involving linked data. Four different technological topics are addressed: triplification, interlinking, cleansing, and application mash-ups. The Linked Data Cup invites scientists and practitioners to submit novel and innovative (5 star) linked data sets and applications built on linked data technology.

Although more and more data is triplified and published as RDF and linked data, the question arises how to evaluate the usefulness of such approaches. The Linked Data Cup therefore requires all submissions to include a concrete use case and problem statement alongside a solution (triplified data set, interlinking/cleansing approach, linked data application) that showcases the usefulness of linked data. Submissions that can provide measurable benefits of employing linked data over traditional methods are preferred.
Note that the call is not limited to any domain or target group. We accept submissions ranging from value-added business intelligence use cases to scientific networks to the longest tail of information domains. The only strict requirement is that the employment of linked data is very well motivated and also justified (i.e. we rank approaches higher that provide solutions, which could not have been realised without linked data, even if they lack technical or scientific brilliance). (emphasis added)

I don’t know what the submissions are going to look like but the conference organizers should get high marks for academic honesty. I don’t think I have ever seen anyone say:

we rank approaches higher that provide solutions, which could not have been realised without linked data, even if they lack technical or scientific brilliance

We have all seen challenges with qualifying requirements but I don’t recall any that would privilege lesser work because of a greater dependence on a requirement. Or at least that would publicly claim that was the contest policy. Have there been complaints from technically or scientifically brilliant approaches about judging in the past?

Will have to watch the submissions and results to see if technically or scientifically brilliant approaches get passed over in favor of lesser approaches. Will be a signal to all first rate competitors to seek recognition elsewhere.

Linked Data Paradigm Can Fuel Linked Cities

Thursday, December 29th, 2011

Linked Data Paradigm Can Fuel Linked Cities

The small city of Cluj in Romania, of some half-million inhabitants, is responsible for a 2.5 million triple store, as part of a Recognos-led project to develop a “Linked City” community portal. The project was submitted for this year’s ICT Call – SME initiative on Digital Content and Languages, FP7-ICT-2011-SME-DCL. While it didn’t receive funding from that competition, Recognos semantic web researcher Dia Miron, is hopeful of securing help from alternate sources in the coming year to expand the project, including potentially bringing the concept of linked cities to other communities in Romania or elsewhere in Europe.

The idea was to publish information from sources such as local businesses about their services and products, as well as data related to the local government and city events, points of interest and projects, using the Linked Data paradigm, says Miron. Data would also be geolocated. “So we take all the information we can get about a city so that people can exploit it in a uniform manner,” she says.

The first step was to gather the data and publish it in a standard format using RDF and OWL; the next phase, which hasn’t taken place yet (it’s funding-dependent), is to build exportation channels for the data. “First we wanted a simple query engine that will exploit the data, and then we wanted to build a faceted search mechanism for those who don’t know the data structure to exploit and navigate through the data,” she says. “We wanted to make it easier for someone not very acquainted with the models. Then we wanted also to provide some kind of SMS querying because people may not always be at their desks. And also the final query service was an augmented reality application to be used to explore the city or to navigate through the city to points of interest or business locations.”

Local Cluj authorities don’t have the budgets to support the continuation of the project on their own, but Miron says the applications will be very generic and can easily be transferred to support other cities, if they’re interested in helping to support the effort. Other collaborators on the project include Ontotext and STI Innsbruck, as well as the local Cluj council.

I don’t doubt this would be useful information for users but is this the delivery model that is going to work for users, assuming it is funded? Here or elsewhere?

How hard do users work with searches? See Keyword and Search Engines Statistics to get an idea by country.

Some users can be trained to perform fairly complex searches but I suspect that is a distinct minority. And the type of searches that need to be performed vary by domain.

For example, earlier today, I was searching for information on “spectral graph theory,” which I suspect has different requirements than searching for 24-hour sushi bars within a given geographic area.

I am not sure how to isolate those different requirements, much less test how close any approach is to satisfying them, but I do think both areas merit serious investigation.

A Look Into Linking Government Data

Saturday, November 19th, 2011

A Look Into Linking Government Data

From the post:

Due out next month from Springer Publishing is Linking Government Data, a book that highlights some of the leading-edge applications of Linked Data to problems of government operations and transparency. David Wood, CTO of 3RoundStones and co-chair of the W3C RDF Working Group, writes and edits the volume, which includes contributions from others exploring the intersection of government and the Semantic Web.


Some of the hot spots for this are down under, in Australia and New Zealand. The U.K., of course, also has done extremely well, with the portal an acknowledged leader in LOD efforts – and certainly comfortably ahead of the U.S. site.

He also thinks it’s worth noting that, just because you might not see a country openly publishing its data as Linked Data, it doesn’t mean that it’s not there. Often someone, somewhere – even if it’s just at one government agency — is using Linked Data principles, experimentally or in real projects. “Like commercial organizations, governments often use them internally and not publish externally,” he notes. “The spectrum of adoption can be individual or a trans-government mandate or everything in between.”

OK, but you would think if there were some major adoption, it would be mentioned in a post promoting the book. Australia, New Zealand and Nixon’s “Silent Majority” in the U.S. are using linked data. Can’t see them but they are there. That’s like RIAA music piracy estimates, just sheer fiction for all but true believers.

But as far as the U.S.A., the rhetoric shifts from tangible benefit to “can lead to,” “can save money,” etc.:

The economics of the Linked Data approach, Wood says, show unambiguous benefit. Think of how it can save money in the U.S. on current expenditures for data warehousing. And think of the time- and productivity-savings, for example, of having government information freely available on the web in a standard format in a way that can be reused and recombined with other data. In the U.S., “government employees wouldn’t have to divert their daily work to answer Freedom of Information requests because the information is proactively published,” he says. It can lead to better policy decisions because government researchers wouldn’t have to spend enormous amounts of time trying to integrate data from multiple agencies in varying formats to combine it and find connections between, for example, places where people live and certain kinds of health problems that may be prevalent there.

And democracy and society also profit when it’s easy for citizens to access published information on where the government is spending their money, or when it’s easy for scientists and researchers to get data the government collects around scientific efforts so that it can be reused for purposes not originally envisioned.

“Unambiguous benefit” means that we have two systems, one using practice Y and other using practice X and when compared (assuming the systems are comparable): there is a clear different of Z% of some measurable metric that can be attributed to the different practices.


Personally I think linked data can be beneficial but that is subject to measurement and demonstration in some particular context.

As soon as this work is released, I would appreciate pointers to unambiguous benefit shown by comparison of agencies in the U.S.A. doing comparable work with some metric that makes that demonstration. But that has to be more than speculation or “can.”

LDIF – Linked Data Integration Framework Version 0.3.

Friday, October 7th, 2011

LDIF – Linked Data Integration Framework Version 0.3

From the email announcement:

The LDIF – Linked Data Integration Framework can be used within Linked Data applications to translate heterogeneous data from the Web of Linked Data into a clean local target representation while keeping track of data provenance. LDIF provides an expressive mapping language for translating data from the various vocabularies that are used on the Web into a consistent, local target vocabulary. LDIF includes an identity resolution component which discovers URI aliases in the input data and replaces them with a single target URI based on user-provided matching heuristics. For provenance tracking, the LDIF framework employs the Named Graphs data model.

Compared to the previous release 0.2, the new LDIF release provides:

  • data access modules for gathering data from the Web via file download, crawling and accessing SPARQL endpoints. Web data is cached locally for further processing.
  • a scheduler for launching data import and integration jobs as well as for regularly updating the local cache with data from remote sources.
  • a second use case that shows how LDIF is used to gather and integrate data from several music-related Web data sources.

More information about LDIF, concrete usage examples and performance details are available at

Over the next months, we plan to extend LDIF along the following lines:

  1. Implement a Hadoop Version of the Runtime Environment in order to be able to scale to really large amounts of input data. Processes and data will be distributed over a cluster of machines.
  2. Add a Data Quality Evaluation and Data Fusion Module which allows Web data to be filtered according to different data quality assessment policies and provides for fusing Web data according to different conflict resolution methods.

Uses SILK (SILK – Link Discovery Framework Version 2.5) identity resolution semantics.

Efficient Multidimensional Blocking for Link Discovery without losing Recall

Tuesday, October 4th, 2011

Efficient Multidimensional Blocking for Link Discovery without losing Recall

Jack Park did due diligence on the SILK materials before I did and forwarded a link to this paper.


Over the last three years, an increasing number of data providers have started to publish structured data according to the Linked Data principles on the Web. The resulting Web of Data currently consists of over 28 billion RDF triples. As the Web of Data grows, there is an increasing need for link discovery tools which scale to very large datasets. In record linkage, many partitioning methods have been proposed which substantially reduce the number of required entity comparisons. Unfortunately, most of these methods either lead to a decrease in recall or only work on metric spaces. We propose a novel blocking method called Multi-Block which uses a multidimensional index in which similar objects are located near each other. In each dimension the entities are indexed by a different property increasing the efficiency of the index significantly. In addition, it guarantees that no false dismissals can occur. Our approach works on complex link specifications which aggregate several di fferent similarity measures. MultiBlock has been implemented as part of the Silk Link Discovery Framework. The evaluation shows a speedup factor of several 100 for large datasets compared to the full evaluation without losing recall.

From deeper in the paper:

If the similarity between two entities exceeds a threshold $\theta$, a link between these two entities is generated. $sim$ is computed by evaluating a link specification $s$ (in record linkage typically called linkage decision rule [23]) which specifies the conditions two entities must fulfi ll in order to be interlinked.

If I am reading this paper correctly, there isn’t a requirement (as in record linkage) that we normalized the data to a common format before writing the rule for comparisons. That in and of itself is a major boon. To say nothing of the other contributions of this paper.

SILK – Link Discovery Framework Version 2.5 released

Tuesday, October 4th, 2011

SILK – Link Discovery Framework Version 2.5 released

I was quite excited to see under “New Data Transformations”…”Merge Values of different inputs.”

But the documentation for Transformation must be lagging behind or I have a different understanding of what it means to “Merge Values of different inputs.”

Perhaps I should ask: What does SILK mean by “Merge Values of different inputs?”

Picking out an issue that is of particular interest to me is not meant to be a negative comment on the project. An impressive bit of work for any EU funded project.

Another question: Has anyone looked at the SILK- Link Specification Language (SILK-LSL) as an input into declaring equivalence/processing for arbitrary data objects? Just curious.

Robert Isele posted this announcement about SILK on October 3, 2011:

we are happy to announce version 2.5 of the Silk Link Discovery Framework for the Web of Data.

The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Using the declarative Silk – Link Specification Language (Silk-LSL), developers can specify the linkage rules data items must fulfill in order to be interlinked. These linkage rules may combine various similarity metrics and can take the graph around a data item into account, which is addressed using an RDF path language.

Linkage rules can either be written manually or developed using the Silk Workbench. The Silk Workbench, is a web application which guides the user through the process of interlinking different data sources.

Version 2.5 includes the following additions to the last major release 2.4:

(1) Silk Workbench now includes a function to learn linkage rules from the reference links. The learning function is based on genetic programming and capable of learning complex linkage rules. Similar to a genetic algorithm, genetic programming starts with a randomly created population of linkage rules. From that starting point, the algorithm iteratively transforms the population into a population with better linkage rules by applying a number of genetic operators. As soon as either a linkage rule with a full f-Measure has been found or a specified maximum number of iterations is reached, the algorithm stops and the user can select a linkage rule.

(2) A new sampling tab allows for fast creation of the reference link set. It can be used to bootstrap the learning algorithm by generating a number of links which are then rated by the user either as correct or incorrect. In this way positive and negative reference links are defined which in turn can be used to learn a linkage rule. If a previous learning run has already been executed, the sampling tries to generate links which contain features which are not yet covered by the current reference link set.

(2) The new help sidebar provides the user with a general description of the current tab as well as with suggestions for the next steps in the linking process. As new users are usually not familiar with the steps involved in interlinking two data sources, the help sidebar currently provides basic guidance to the user and will be extended in future versions.

(3) Introducing per-comparison thresholds:

  • On popular request, thresholds can now be specified on each comparison.
  • Backwards-compatible: Link specifications using a global threshold can still be executed.

(4) New distance measures:

  • Jaccard Similarity
  • Dice’s coefficient
  • DateTime Similarity
  • Tokenwise Similarity, contributed by Florian Kleedorfer, Research Studios Austria

(5) New data transformations:

  • RemoveEmptyValues
  • Tokenizer
  • Merge Values of multiple inputs

(6) New DataSources and Outputs

  • In addition to reading from SPARQL endpoints, Silk now also supports reading from RDF dumps in all common formats. Currently the data set is held in memory and it is not available in the Workbench yet, but future versions will improve this.
  • New SPARQL/Update Output: In addition to writing the links to a file, Silk now also supports writing directly to a triple store using SPARQL/Update.

(7) Various improvements and bugfixes


More information about the Silk Link Discovery Framework is available at:

The Silk framework is provided under the terms of the Apache License, Version 2.0 and can be downloaded from:

The development of Silk was supported by Vulcan Inc. as part of its Project Halo ( and by the EU FP7 project LOD2-Creating Knowledge out of Interlinked Data (, Ref. No. 257943).

Thanks to Christian Becker, Michal Murawicki and Andrea Matteini for contributing to the Silk Workbench.