Archive for the ‘Semantic Web’ Category

IPLD (Interplanetary Linked Data)

Thursday, June 1st, 2017

IPLD (Interplanetary Linked Data)

IPLD is the data model of the content-addressable web. It allows us to treat all hash-linked data structures as subsets of a unified information space, unifying all data models that link data with hashes as instances of IPLD.


A data model for interoperable protocols.

Content addressing through hashes has become a widely-used means of connecting data in distributed systems, from the blockchains that run your favorite cryptocurrencies, to the commits that back your code, to the web’s content at large. Yet, whilst all of these tools rely on some common primitives, their specific underlying data structures are not interoperable.

Enter IPLD: IPLD is a single namespace for all hash-inspired protocols. Through IPLD, links can be traversed across protocols, allowing you explore data regardless of the underlying protocol.

The webpage is annoyingly vague so you will need to visit the IPLD spec Github page and consider this whitepaper: IPFS – Content Addressed, Versioned, P2P File System (DRAFT 3) by Juan Benet.

As you read, can annotation of “links” avoid confusing of addresses with identifiers?

We’ve seen that before and the inability to acknowledge/correct the mistake was deadly.

HOBBIT – Holistic Benchmarking of Big Linked Data

Saturday, December 26th, 2015

HOBBIT – Holistic Benchmarking of Big Linked Data

From the “about” page:

HOBBIT is driven by the needs of the European industry. Thus, the project objectives were derived from the needs of the European industry (represented by our industrial partners) in combination with the results of prior and ongoing efforts including BIG, BigDataEurope, LDBC Council and many more. The main objectives of HOBBIT are:

  1. Building a family of industry-relevant benchmarks,
  2. Implementing a generic evaluation platform for the Big Linked Data value chain,
  3. Providing periodic benchmarking results including diagnostics to further the improvement of BLD processing tools,
  4. (Co-)Organizing challenges and events to gather benchmarking results as well as industry-relevant KPIs and datasets,
  5. Supporting companies and academics during the creation of new challenges or the evaluation of tools.

As we found in Avoiding Big Data: More Business Intelligence Than You Would Think, 3/4 of businesses cannot extract value from data they already possess, making any investment in “big data” a sure loser for them.

Which makes me wonder about what “big data” the HOBBIT project intends to use for benchmarking “Big Linked Data?”

Then I saw on the homepage:

The HOBBIT partners such as TomTom, USU, AGT and others will provide more than 25 trillions of sensor data to be bechmarked within the HOBBIT project.

“…25 trillions of sensor data….?” sounds odd until you realize that TomTom is:

TomTom founded in 1991 is a world leader of products for in-car location and navigation products.

OK, so the “Big Linked Data” in question isn’t random “linked data,” but a specialized kind of “linked data.”

That’s less risky than building a human brain with no clear idea of where to start, but it addresses a narrow window on linked data.

The HOBBIT Kickoff meeting Luxembourg 18-19 January 2016 announcement still lacks a detailed agenda.

‘Linked data can’t be your goal. Accomplish something’

Friday, December 18th, 2015

Tim Strehle points to his post: Jonathan Rochkind: Linked Data Caution, which is a collection of quotes from Linked Data Caution (Jonathan Rochkind).

In the process, Tim creates his own quote, inspired by Rochkind:

‘Linked data can’t be your goal. Accomplish something’

Which is easy to generalize to:

‘***** can’t be your goal. Accomplish something’

Whether your hobby horse is linked data, graphs, noSQL, big data, or even topic maps, technological artifacts are just and only that, artifacts.

Unless and until such artifacts accomplish something, they are curios, relics venerated by pockets of the faithful.

Perhaps marketers in 2016 should be told:

Skip the potential benefits of your technology. Show me what it has accomplished (past tense) for users similar to me.

With that premise, you could weed through four or five vendors in a morning. 😉

Linked Data Repair and Certification

Saturday, June 27th, 2015

1st International Workshop on Linked Data Repair and Certification (ReCert 2015) is a half-day workshop at the 8th International Conference on Knowledge Capture (K-CAP 2015).

I know, not nearly as interesting as talking about Raquel Welch, but someone has to. 😉

From the post:

In recent years, we have witnessed a big growth of the Web of Data due to the enthusiasm shown by research scholars, public sector institutions and some private companies. Nevertheless, no rigorous processes for creating or mapping data have been systematically followed in most cases, leading to uneven quality among the different datasets available. Though low quality datasets might be adequate in some cases, these gaps in quality in different datasets sometimes hinder the effective exploitation, especially in industrial and production settings.

In this context, there are ongoing efforts in the Linked Data community to define the different quality dimensions and metrics to develop quality assessment frameworks. These initiatives have mostly focused on spotting errors as part of independent research efforts, sometimes lacking a global vision. Further, up to date, no significant attention has been paid to the automatic or semi-automatic repair of Linked Data, i.e., the use of unattended algorithms or supervised procedures for the correction of errors in linked data. Repaired data is susceptible of receiving a certification stamp, which together with reputation metrics of the sources can lead to having trusted linked data sources.

The goal of the Workshop on Linked Data Repair and Certification is to raise the awareness of dataset repair and certification techniques for Linked Data and to promote approaches to assess, monitor, maintain, improve, and certify Linked Data quality.

There is a call for papers with the following deadlines:

Paper submission: Monday, July 20, 2015

Acceptance Notification: Monday August 3, 2015

Camera-ready version: Monday August 10, 2015

Workshop: Monday October 7, 2015

Now that linked data exists, someone has to undertake the task of maintaining it. You could make links in linked data into topics in a topic map and add properties that would make them easier to match and maintain. Just a thought.

As far as “trusted link data sources,” I think the correct phrasing is: “less untrusted data sources than others.”

You know the phrase: “In God we trust, all others pay cash.”

Same is true for data. It may be a “trusted” source, but verify the data first, then trust.

URLs Are Porn Vulnerable

Monday, June 22nd, 2015

Graham Cluley reports in Heinz takes the heat over saucy porn QR code that some bottles of Heinz Hot Ketchup provide more than “hot” ketchup. The QR code on the bottle leads to a porn site. (It is hard to put a “prize” in a ketchup bottle.)

Graham observes a domain registration lapsed for Heinz and the new owner wasn’t in the same line of work.

Are you presently maintaining every domain you have ever registered?

The lesson here is that URLs (as identifiers) are porn vulnerable.

Web Page Structure, Without The Semantic Web

Saturday, May 30th, 2015

Could a Little Startup Called Diffbot Be the Next Google?

From the post:

Diffbot founder and CEO Mike Tung started the company in 2009 to fix a problem: there was no easy, automated way for computers to understand the structure of a Web page. A human looking at a product page on an e-commerce site, or at the front page of a newspaper site, knows right away which part is the headline or the product name, which part is the body text, which parts are comments or reviews, and so forth.

But a Web-crawler program looking at the same page doesn’t know any of those things, since these elements aren’t described as such in the actual HTML code. Making human-readable Web pages more accessible to software would require, as a first step, a consistent labeling system. But the only such system to be seriously proposed, Tim Berners-Lee’s Semantic Web, has long floundered for lack of manpower and industry cooperation. It would take a lot of people to do all the needed markup, and developers around the world would have to adhere to the Resource Description Framework prescribed by the World Wide Web Consortium.

Tung’s big conceptual leap was to dispense with all that and attack the labeling problem using computer vision and machine learning algorithms—techniques originally developed to help computers make sense of edges, shapes, colors, and spatial relationships in the real world. Diffbot runs virtual browsers in the cloud that can go to a given URL; suck in the page’s HTML, scripts, and style sheets; and render it just as it would be shown on a desktop monitor or a smartphone screen. Then edge-detection algorithms and computer-vision routines go to work, outlining and measuring each element on the page.

Using machine-learning techniques, this geometric data can then be compared to frameworks or “ontologies”—patterns distilled from training data, usually by humans who have spent time drawing rectangles on Web pages, painstakingly teaching the software what a headline looks like, what an image looks like, what a price looks like, and so on. The end result is a marked-up summary of a page’s important parts, built without recourse to any Semantic Web standards.

The irony here, of course, is that much of the information destined for publication on the Web starts out quite structured. The WordPress content-management system behind Xconomy’s site, for example, is built around a database that knows exactly which parts of this article should be presented as the headline, which parts should look like body text, and (crucially, to me) which part is my byline. But these elements get slotted into a layout designed for human readability—not for parsing by machines. Given that every content management system is different and that every site has its own distinctive tags and styles, it’s hard for software to reconstruct content types consistently based on the HTML alone.

There are several themes here that are relevant to topic maps.

First, it is true that most data starts with some structure, styles if you will, before it is presented for user consumption. Imagine an authoring application that automatically and unknown to its user, metadata that can then provide semantics for its data.

Second, the recognition of structure approach being used by Diffbot is promising in the large but should also be promising in the small as well. Local documents of a particular type are unlikely to have the variance of documents across the web. Meaning that with far less effort, you can build recognition systems that can empower more powerful searching of local document repositories.

Third, and perhaps most importantly, while the results may not be 100% accurate, the question for any such project should be how much accuracy is required? If I am mining social commentary blogs, a 5% error rate on recognition of speakers might be acceptable, because for popular threads or speakers, those errors are going to be quickly corrected. Unpopular threads or authors never followed, does that come under no harm/no foul?

Highly recommended for reading/emulation.

Controlled Vocabularies and the Semantic Web

Wednesday, February 18th, 2015

Controlled Vocabularies and the Semantic Web Journal of Library Metadata – Special Issue Call for Papers

From the webpage:

Ranging from large national libraries to small and medium-sized institutions, many cultural heritage organizations, including libraries, archives, and museums, have been working with controlled vocabularies in linked data and semantic web contexts.  Such work has included transforming existing vocabularies, thesauri, subject heading schemes, authority files, term and code lists into SKOS and other machine-consumable linked data formats. 

This special issue of the Journal of Library Metadata welcomes articles from a wide variety of types and sizes of organizations on a wide range of topics related to controlled vocabularies, ontologies, and models for linked data and semantic web deployment, whether theoretical, experimental, or actual. 

Topics include, but are not restricted to the following:

  • Converting existing vocabularies into SKOS and/or other linked data formats.
  • Publishing local vocabularies as linked data in online repositories such as the Open Metadata Registry.
  • Development or use of special tools, platforms and interfaces that facilitate the creation and deployment of vocabularies as linked data.
  • Working with Linked Data / Semantic Web W3C standards such as RDF, RDFS, SKOS, and OWL.
  • Work with the BIBFRAME, Europeana, DPLA, CIDOC-CRM, or other linked data / semantic web models, frameworks, and ontologies.
  • Challenges in transforming existing vocabularies and models into linked data and semantic web vocabularies and models.

Click here for a complete list of possible topics.

Researchers and practitioners are invited to submit a proposal (approximately 500 words) including a problem statement, problem significance, objectives, methodology, and conclusions (or tentative conclusions for work in progress). Proposals must be received by March 1, 2015. Full manuscripts (4000-7000 words) are expected to be submitted by June 1, 2015. All submitted manuscripts will be reviewed on a double-blind review basis.

Please forward inquiries and proposal submissions electronically to the guest editors at:

Proposal Deadline: March 1, 2015.

Library of Metadata online. Unfortunately one of those journals where authors have to pay for their work to be accessible to others. The interface makes it look like you are going to have access until you attempt to view a particular article. I didn’t stumble across any that were accessible but I only tried four (4) or (5) of them.

Interesting journal if you have access to it or if you are willing to pay $40.00 per article for viewing. I worked for an academic publisher for a number of years and have an acute sense of the value-add publishers bring to the table. Volunteer authors, volunteer editors, etc.

SPARQLES: Monitoring Public SPARQL Endpoints

Sunday, February 15th, 2015

SPARQLES: Monitoring Public SPARQL Endpoints by Pierre-Yves Vandenbussche, Jürgen Umbrich, Aidan Hogan, and Carlos Buil-Aranda.


We describe SPARQLES: an online system that monitors the health of public SPARQL endpoints on the Web by probing them with custom-designed queries at regular intervals. We present the architecture of SPARQLES and the variety of analytics that it runs over public SPARQL endpoints, categorised by availability, discoverability, performance and interoperability. To motivate the system, we gives examples of some key questions about the health and maturation of public SPARQL endpoints that can be answered by the data it has collected in the past year(s). We also detail the interfaces that the system provides for human and software agents to learn more about the recent history and current state of an individual SPARQL endpoint or about overall trends concerning the maturity of all endpoints monitored by the system.

I started to pass on this article since it does date from 2009 but am now glad that I didn’t. The service is still active and can be found at:

The discoverability of SPARQL endpoints is reported to be:


From the article:

[VoID Description:] The Vocabulary of Interlinked Data-sets (VoID) [2] has become the de facto standard for describing RDF datasets (in RDF). The vocabulary allows for specifying, e.g., an OpenSearch description, the number of triples a dataset contains, the number of unique subjects, a list of properties and classes used, number of triples associated with each property (used as predicate), number of instances of a given class, number of triples used to describe all instances of a given class, predicates used to describe class instances, and so forth. Likewise, the description of the dataset is often enriched using external vocabulary, such as for licensing information.

[SD Description:] Endpoint capabilities – such as supported SPARQL version, query and update features, I/O formats, custom functions, and/or entailment regimes – can be described in RDF using the SPARQL 1.1 Service Description (SD) vocabulary, which became a W3C Recommendation in March 2013 [21]. Such descriptions, if made widely available, could help a client find public endpoints that support the features it needs (e.g., find SPARQL 1.1 endpoints)

No, I’m not calling your attention to this to pick on SPARQL, especially, but the lack of discoverability raises a serious issue for any information retrieval system that hopes to better the dumb luck searching.

Clearly SPARQL has the capability to increase discoverability, whether those mechanisms would be effective or not cannot be answered due to lack of use. So my first question is: Why aren’t the mechanisms of SPARQL being used to increase discoverability?

Or perhaps better, having gone to the trouble to construct a SPARQL endpoint, why aren’t people taking the next step to make them more discoverable?

Is it because discoverability benefits some remote and faceless user instead of those being called upon to make the endpoint more discoverable? In that sense, it is a lack of positive feedback for the person tasked with increasing discoverability?

I ask because if we can’t find the key to motivating people to increase the discoverability of information (SPARQL or no) then we are in serious trouble as the rate of big data continues to increase. The amount of data will continue to grow and discoverability continues to go down. That can’t be a happy circumstance for anyone interested in discovering information.


I first saw this in a tweet by Ruben Verborgh.

Review of Large-Scale RDF Data Processing in MapReduce

Wednesday, January 7th, 2015

Review of Large-Scale RDF Data Processing in MapReduce by Ke Hou, Jing Zhang and Xing Fang.


Resource Description Framework (RDF) is an important data presenting standard of semantic web and how to process, the increasing RDF data is a key problem for development of semantic web. MapReduce is a widely-used parallel programming model which can provide a solution to large-scale RDF data processing. This study reviews the recent literatures on RDF data processing in MapReduce framework in aspects of the forward-chaining reasoning, the simple querying and the storage mode determined by the related querying method. Finally, it is proposed that the future research direction of RDF data processing should aim at the scalable, increasing and complex RDF data query.

I count twenty-nine (29) projects with two to three sentence summaries of each one. Great starting point for an in-depth review of RDF data processing using mapreduce.

I first saw this in a tweet by Marin Dimitrov.

Google’s Secretive DeepMind Startup Unveils a “Neural Turing Machine”

Wednesday, December 31st, 2014

Google’s Secretive DeepMind Startup Unveils a “Neural Turing Machine”

From the post:

One of the great challenges of neuroscience is to understand the short-term working memory in the human brain. At the same time, computer scientists would dearly love to reproduce the same kind of memory in silico.

Today, Google’s secretive DeepMind startup, which it bought for $400 million earlier this year, unveils a prototype computer that attempts to mimic some of the properties of the human brain’s short-term working memory. The new computer is a type of neural network that has been adapted to work with an external memory. The result is a computer that learns as it stores memories and can later retrieve them to perform logical tasks beyond those it has been trained to do.

Of particular interest to topic mappers and folks looking for realistic semantic solutions for big data. In particular the concept of “recoding,” which is how the human brain collapses multiple chunks of data into one chunk for easier access/processing.

It sounds close to referential transparency to me but where the transparency is optional. That is you don’t have to look unless you need the details.

The full article will fully repay the time to read it and then some:

Neural Turing Machines by Alex Graves, Greg Wayne, Ivo Danihelka.


We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.

The paper was revised on 10 December 2014 so if you read an earlier version, you may want to read it again. Whether Google cracks this aspect of the problem of intelligence or not, it sounds like an intriguing technique with applications in topic map/semantic processing.

Linked Open Data Visualization Revisited: A Survey

Saturday, December 20th, 2014

Linked Open Data Visualization Revisited: A Survey by Oscar Peña, Unai Aguilera and Diego López-de-Ipiña.


Mass adoption of the Semantic Web’s vision will not become a reality unless the benefits provided by data published under the Linked Open Data principles are understood by the majority of users. As technical and implementation details are far from being interesting for lay users, the ability of machines and algorithms to understand what the data is about should provide smarter summarisations of the available data. Visualization of Linked Open Data proposes itself as a perfect strategy to ease the access to information by all users, in order to save time learning what the dataset is about and without requiring knowledge on semantics.

This article collects previous studies from the Information Visualization and the Exploratory Data Analysis fields in order to apply the lessons learned to Linked Open Data visualization. Datatype analysis and visualization tasks proposed by Ben Shneiderman are also added in the research to cover different visualization features.

Finally, an evaluation of the current approaches is performed based on the dimensions previously exposed. The article ends with some conclusions extracted from the research.

I would like to see a version of this article after it has had several good editing passes. From the abstract alone, “…benefits provided by data…” and “…without requiring knowledge on semantics…” strike me as extremely problematic.

Data, accessible or not, does not provide benefits. The results of processing data may, which may explain the lack of enthusiasm when large data dumps are made web accessible. In and of itself, it is just another large dump of data. The results of processing that data may be very useful, but that is another step in the process.

I don’t think “…without requiring knowledge of semantics…” is in line with the rest of the article. I suspect the authors meant the semantics of data sets could be conveyed to users without their researching them prior to using the data set. I think that is problematic but it has the advantage of being plausible.

The various theories of visualization and datatypes (pages 3-8) don’t seem to advance the discussion and I would either drop that content or tie it into the actual visualization suites discussed. It’s educational but its relationship to the rest of the article is tenuous.

The coverage of visualization suites is encouraging and useful, but with an overall tighter focus, more time could be spent on each one and their entries being correspondingly longer.

Hopefully we will see a later, edited version of this paper as a good summary/guide to visualization tools for linked data would be a useful addition to the literature.

I first saw this in a tweet by Marin Dimitrov.

clj-turtle: A Clojure Domain Specific Language (DSL) for RDF/Turtle

Tuesday, November 11th, 2014

clj-turtle: A Clojure Domain Specific Language (DSL) for RDF/Turtle by Frédéerick Giasson.

From the post:

Some of my recent work leaded me to heavily use Clojure to develop all kind of new capabilities for Structured Dynamics. The ones that knows us, knows that every we do is related to RDF and OWL ontologies. All this work with Clojure is no exception.

Recently, while developing a Domain Specific Language (DSL) for using the Open Semantic Framework (OSF) web service endpoints, I did some research to try to find some kind of simple Clojure DSL that I could use to generate RDF data (in any well-known serialization). After some time, I figured out that no such a thing was currently existing in the Clojure ecosystem, so I choose to create my simple DSL for creating RDF data.

The primary goal of this new project was to have a DSL that users could use to created RDF data that could be feed to the OSF web services endpoints such as the CRUD: Create or CRUD: Update endpoints.

What I choose to do is to create a new project called clj-turtle that generates RDF/Turtle code from Clojure code. The Turtle code that is produced by this DSL is currently quite verbose. This means that all the URIs are extended, that the triple quotes are used and that the triples are fully described.

This new DSL is mean to be a really simple and easy way to create RDF data. It could even be used by non-Clojure coder to create RDF/Turtle compatible data using the DSL. New services could easily be created that takes the DSL code as input and output the RDF/Turtle code. That way, no Clojure environment would be required to use the DSL for generating RDF data.

I mention Frédéerick’s DSL for RDF despite my doubts about RDF. Good or not, RDF has achieved the status of legacy data.

How To Build Linked Data APIs…

Wednesday, October 15th, 2014

This is the second high signal-to-noise presentation I have seen this week! I am sure that streak won’t last but I will enjoy it as long as it does.

Resources for after you see the presentation: Hydra: Hypermedia-Driven Web APIs, JSON for Linking Data, and, JSON-LD 1.0.

Near the end of the presentation, Marcus quotes Phil Archer, W3C Data Activity Lead:

Archer on Semantic Web

Which is an odd statement considering that JSON-LD 1.0 Section 7 Data Model, reads in part:

JSON-LD is a serialization format for Linked Data based on JSON. It is therefore important to distinguish between the syntax, which is defined by JSON in [RFC4627], and the data model which is an extension of the RDF data model [RDF11-CONCEPTS]. The precise details of how JSON-LD relates to the RDF data model are given in section 9. Relationship to RDF.

And section 9. Relationship to RDF reads in part:

JSON-LD is a concrete RDF syntax as described in [RDF11-CONCEPTS]. Hence, a JSON-LD document is both an RDF document and a JSON document and correspondingly represents an instance of an RDF data model. However, JSON-LD also extends the RDF data model to optionally allow JSON-LD to serialize Generalized RDF Datasets. The JSON-LD extensions to the RDF data model are:…

Is JSON-LD “…a concrete RDF syntax…” where you can ignore RDF?

Not that I was ever a fan of RDF but standards should be fish or fowl and not attempt to be something in between.

Sir Tim Berners-Lee speaks out on data ownership

Thursday, October 9th, 2014

Sir Tim Berners-Lee speaks out on data ownership by Alex Hern.

From the post:

The data we create about ourselves should be owned by each of us, not by the large companies that harvest it, the Tim Berners-Lee, the inventor of the world wide web, said today.

Berners-Lee told the IPExpo Europe in London’s Excel Centre that the potential of big data will be wasted as its current owners use it to serve ever more “queasy” targeted advertising.

Berners-Lee, who wrote the first memo detailing the idea of the world wide web 25 years ago this year, while working for physics lab Cern in Switzerland, told the conference that the value of “merging” data was under-appreciated in many areas.

Speaking to public data providers, he said: “I’m not interested in your data; I’m interested in merging your data with other data. Your data will never be as exciting as what I can merge it with.

No disagreement with: …the value of “merging” data was under-appreciated in many areas. 😉

Considerable disagreement on how best to accomplish that merging but will be an empirical question when people wake up to the value of “merging” data.

Berners-Lee may be right about who “should” own data about ourselves, but that isn’t in fact who owns it now. Changing property laws means taking rights away from those with them under the current regime and creating new rights for others in a new system. Property laws have changed before but it requires more than slogans and wishful thinking to make it so.

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise

Tuesday, September 16th, 2014

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise by Cailin O’Connor.

From the post:

Is our behavior determined by genetics, or are we products of our environments? What matters more for the development of living things—internal factors or external ones? Biologists have been hotly debating these questions since shortly after the publication of Darwin’s theory of evolution by natural selection. Charles Darwin’s half-cousin Francis Galton was the first to try to understand this interplay between “nature and nurture” (a phrase he coined) by studying the development of twins.

But are nature and nurture the whole story? It seems not. Even identical twins brought up in similar environments won’t really be identical. They won’t have the same fingerprints. They’ll have different freckles and moles. Even complex traits such as intelligence and mental illness often vary between identical twins.

Of course, some of this variation is due to environmental factors. Even when identical twins are raised together, there are thousands of tiny differences in their developmental environments, from their position in the uterus to preschool teachers to junior prom dates.

But there is more to the story. There is a third factor, crucial to development and behavior, that biologists overlooked until just the past few decades: random noise.

In recent years, noise has become an extremely popular research topic in biology. Scientists have found that practically every process in cells is inherently, inescapably noisy. This is a consequence of basic chemistry. When molecules move around, they do so randomly. This means that cellular processes that require certain molecules to be in the right place at the right time depend on the whims of how molecules bump around. (bold emphasis added)

Is another word for “noise” chaos?

The sort of randomness that impacts our understanding of natural languages? That leads us to use different words for the same thing and the same word for different things?

The next time you see a semantically deterministic system be sure to ask if they have accounted for the impact of noise on the understanding of people using the system. 😉

To be fair, no system can but the pretense that noise doesn’t exist in some semantic environments (think description logic, RDF) is more than a little annoying.

You might want to start following the work of Cailin O’Connor (University of California, Irvine, Logic and Philosophy of Science).

Disclosure: I have always had a weakness for philosophy of science so your mileage may vary. This is real philosophy of science and not the strained crys of “science” you see on most mailing list discussions.

I first saw this in a tweet by John Horgan.

Exploring a SPARQL endpoint

Monday, August 25th, 2014

Exploring a SPARQL endpoint by Bob DuCharme.

From the post:

In the second edition of my book Learning SPARQL, a new chapter titled “A SPARQL Cookbook” includes a section called “Exploring the Data,” which features useful queries for looking around a dataset that you know little or nothing about. I was recently wondering about the data available at the SPARQL endpoint, so to explore it I put several of the queries from this section of the book to work.

An important lesson here is how easy SPARQL and RDF make it to explore a dataset that you know nothing about. If you don’t know about the properties used, or whether any schema or schemas were used and how much they was used, you can just query for this information. Most hypertext links below will execute the queries they describe using’s SNORQL interface.

Bob’s ease at using SPARQL reminds me of a story of an ex-spy who was going through customs for the first time in years. As part of that process, he accused a customs officer of having memorized print that was too small to read easily. The which the officer replied, “I am familiar with it.” 😉

Bob’s book on SPARQL and his blog will help you become a competent SPARQL user.

I don’t suppose SPARQL is any worse off semantically than SQL, which has been in use for decades. It is troubling that I can discover dc:title but have no way to investigate how it was used by a particular content author.

Oh, to be sure, the term dc:title makes sense to me, but that is a smoothing function as a reader and may or may not be the same “sense” as occurs to the person who completed such a term.

You can read data sets using your own understanding of tokens but I would do so with a great deal of caution.

The Truth About Triplestores [Opaqueness]

Friday, August 22nd, 2014

The Truth About Triplestores

A vendor “truth” document from Ontotext. Not that being from a vendor is a bad thing, but you should always consider the source of a document when evaluating its claims.

Quite naturally I jumped to: “6. Data Integration & Identity Resolution: Identifying the same entity across disparate data sources.”

With so many different databases and systems existing inside any single organization, how do companies integrate all of their data? How do they recognize that an entity in one database is the same entity in a completely separate database?

Resolving identities across disparate sources can be tricky. First, they need to be identified and then linked.

To do this effectively, you need two things. Earlier, we mentioned that through the use of text analysis, the same entity spelled differently can be recognized. Once this happens, the references to entities need to be stored correctly in the triplestore. The triplestore needs to support predicates that can declare two different Universal Resource Indicators (URIs) as one in the same. By doing this, you can align the same real-world entity used in different data sources. The most standard and powerful predicate used to establish mappings between multiple URIs of a single object is owl:sameAs. In turn, this allows you to very easily merge information from multiple sources including linked open data or proprietary sources. The ability to recognize entities across multiple sources holds great promise helping to manage your data more effectively and pinpointing connections in your data that may be masked by slightly different entity references. Merging this information produces more accurate results, a clearer picture of how entities are related to one another and the ability to improve the speed with which your organization operates.

In case you are unfamiliar with owl:sameAS, here is an example from OWL Web Ontology Language Reference

<rdf:Description rdf:about="#William_Jefferson_Clinton">:
  <owl:sameAs rdf:resource="#BillClinton"/>

The owl:sameAs in this case is opaque because there is no way to express why an author thought #William_Jefferson_Clinton and #BillClinton were about the same subject. You could argue that any prostitute in Columbia would recognize that mapping so let’s try a harder case.

<rdf:Description rdf:about="#United States of America">:
  <owl:sameAs rdf:resource="#الولايات المتحدة الأمريكية"/>

Less confident than you were about the first one?

The problem with owl:sameAs is its opaqueness. You don’t know why an author used owl:sameAs. You don’t know what property or properties they saw that caused them to use one of the various understandings of owl:sameAs.

Without knowing those properties, accepting any owl:sameAs mapping is buying a pig in a poke. Not a proposition that interests me. You?

I first saw this in a tweet by graphityhq.


Tuesday, July 15th, 2014

RDFUnit – an RDF Unit-Testing suite

From the post:

RDFUnit is a test driven data-debugging framework that can run automatically generated (based on a schema) and manually generated test cases against an endpoint. All test cases are executed as SPARQL queries using a pattern-based transformation approach.

For more information on our methodology please refer to our report:

Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in Proceedings of the 23rd International Conference on World Wide Web.

RDFUnit in a Nutshell

  • Test case: a data constraint that involves one or more triples. We use SPARQL as a test definition language.
  • Test suite: a set of test cases for testing a dataset
  • Status: Success, Fail, Timeout (complexity) or Error (e.g. network). A Fail can be an actual error, a warning or a notice
  • Data Quality Test Pattern (DQTP): Abstract test cases that can be intantiated into concrete test cases using pattern bindings
  • Pattern Bindings: valid replacements for a DQTP variable
  • Test Auto Generators (TAGs): Converts RDFS/OWL axioms into concrete test cases

If you are working with RDF data, this will certainly be helpful.

BTW, don’t miss the publications further down on the homepage for RDFUnit.

I first saw this in a tweet by Marin Dimitrov.

Linked Data Guidelines (Australia)

Tuesday, July 15th, 2014

First Version of Guidelines for Publishing Linked Data released by Allan Barger.

From the post:

The Australian Government Linked Data Working group (AGLDWG) is pleased to announce the release of a first version of a set of guidelines for the publishing of Linked Datasets on at:

The “URI Guidelines for publishing Linked Datasets on” document provides a set of general guidelines aimed at helping Australian Government agencies to define and manage URIs for Linked Datasets and the resources described within that are published on The Australian Government Linked Data Working group has developed the report over the last two years while the first datasets under the sub-domain have been published following the patterns defined in this document.

Thought you might find this useful in mapping linked data sets from the Australian government to:

  • non-Australian government linked data sets
  • non-government linked data sets
  • non-linked data data sets (all sources)
  • pre-linked data data sets (all sources)
  • post-linked data data sets (all sources)


JSON-LD for software discovery…

Monday, June 16th, 2014

JSON-LD for software discovery, reuse and credit by Afron Smith.

From the post:

JSON-LD is a way of describing data with additional context (or semantics if you like) so that for a JSON record like this:

{ "name" : "Arfon" }

when there’s an entity called name you know that it means the name of a person and not a place.

If you haven’t heard of JSON-LD then there are some great resources here and an excellent short screencast on YouTube here.

One of the reasons JSON-LD is particularly exciting is that it’s a lightweight way of organising JSON-formatted data and giving semantic meaning without having to care about things like RDF data models, XML and the (note the capitals) Semantic Web. Being much more succinct than XML and JavaScript native, JSON has over the past few years become the way to expose data through a web-based API. JSON-LD offers a way for API provides (and consumers) to share data more easily with little or no ambiguity about what the data they’re describing.

The YouTube video “What is JSON-LD?” by Manu Sporny makes an interesting point about the “ambiguity problem,” that is do you mean by “name” what I mean by “name” as a property?

At about time mark 5:36, Manu addresses the “ambiguity problem.”

The resolution of the ambiguity is to use a hyperlink as an identifier, the implication being that if we use the same identifier, we are talking about the same thing. (That isn’t true in real life, cf. the many meanings of owl:sameAS, but for simplicity sake, let’s leave that to one side.)

OK, what is the difference in both of us using the string “name” and both of us using the string “”? Both of them are opaque strings that either match or don’t. This just kicks the semantic can a little bit further down the road.

Let me use a better example from

"@context": "",
"@id": "",
"name": "John Lennon",
"born": "1940-10-09",
"spouse": ""

If you follow you will obtain a 2.4k JSON-LD file that contains (in part):

“Person”: “

Following that link results in a webpage that reads in part:

The Person class represents people. Something is a Person if it is a person. We don’t nitpic about whether they’re alive, dead, real, or imaginary. The Person class is a sub-class of the Agent class, since all people are considered ‘agents’ in FOAF.

and it is said to be:

Disjoint With: Project Organization

Ambiguity jumps back to the fore with: Something is a Person if it is a person.

What is that solipsism? Tautology?

There is no opportunity to say what properties are necessary to qualify as a “person” in the sense defined FOAF.

You may think that is nit-picking but without the ability to designate properties required to be a “person,” it isn’t possible to talk about U.S.C Title 42: 1983 civil rights actions where municipalities are held to be “persons” within the meaning of this law. That’s just one example. There are numerous variations on “person” for legal purposes.

You could argue that JSON-LD is for superficial or bubble-gum semantics but it is too useful a syntax for that fate.

Rather I would like to see JSON-LD to make ambiguity “manageable” by its users. True, you could define a “you know what I mean” document like FOAF, if that suits your purposes. On the other hand, you should be able to define required key/value pairs for any subject and for any key or value to extend an existing definition.

How far you need to go is on a case by case basis. For apps that display “AI” by tracking you and pushing more ads your way, FOAF may well be sufficient. For those of us with non-advertising driven interests, other diversions may await.

Elasticsearch, RethinkDB and the Semantic Web

Wednesday, June 11th, 2014

Elasticsearch, RethinkDB and the Semantic Web by Michel Dumontier.

From the post:

Everyone is handling big data nowadays, or at least, so it seems. Hadoop is very popular among the Big Data wranglers and it is often mentioned as the de facto solution. I have dabbled into working with Hadoop over the past years and found that: yes, it is very suitable for certain kinds of data mining/analysis and for those it provides high data crunching throughput, but, no, it cannot answer queries quickly and you cannot port every algorithm into Hadoop’s map/reduce paradigm. I have since turned to Elasticsearch and more recently to RethinkDB. It is a joy to work with the latter and it performs faceting just as well as Elasticsearch for the benchmark data that I used, but still permits me to carry out more complex data mining and analysis too.

The story here describes the data that I am working with a bit, it shows how it can be turned into a data format that both Elasticsearch and RethinkDB understand, how the data is being loaded and indexed, and finally, how to get some facets out of the systems.

Interesting post on biomedical data in RDF N-Quads format which is converted into JSON and then processed with ElasticSearch and RethinkDB.

I first saw this in a tweet by Joachim Baran.

Data as Code. Code as Data:…

Tuesday, May 27th, 2014

Data as Code. Code as Data: Tighther Semantic Web Development Using Clojure by Frédérick Giasson.

From the post:

I have been professionally working in the field of the Semantic Web for more than 7 years now. I have been developing all kind of Ontologies. I have been integrating all kind of datasets from various sources. I have been working with all kind of tools and technologies using all kind of technologies stacks. I have been developing services and user interfaces of all kinds. I have been developing a set of 27 web services packaged as the Open Semantic Framework and re-implemented the core Drupal modules to work with RDF data has I wanted it to. I did write hundred of thousands of line of codes with one goal in mind: leveraging the ideas and concepts of the Semantic Web to make me, other developers, ontologists and data-scientists working more accurately and efficiently with any kind data.

However, even after doing all that, I was still feeling a void: a disconnection between how I was think about data and how I was manipulating it using the programming languages I was using, the libraries I was leveraging and the web services that I was developing. Everything is working, and is working really well; I did gain a lot of productivity in all these years. However, I was still feeling that void, that disconnection between the data and the programming language.

Frédérick promises to walk us through serializing RDF data into Clojure code.

Doesn’t that sound interesting?

Hmmm, will we find that data has semantics? And subjects that the data represents?

Can’t say, don’t know. But I am very interested in finding out how far Frédérick will go with “Data as Code. Code as Data.”

Crossing the Chasm…

Tuesday, May 27th, 2014

Crossing the Chasm with Semantic Technology by Marin Dimitrov.

From the description:

After more than a decade of active efforts towards establishing Semantic Web, Linked Data and related standards, the verdict of whether the technology has delivered its promise and has proven itself in the enterprise is still unclear, despite the numerous existing success stories.

Every emerging technology and disruptive innovation has to overcome the challenge of “crossing the chasm” between the early adopters, who are just eager to experiment with the technology potential, and the majority of the companies, who need a proven technology that can be reliably used in mission critical scenarios and deliver quantifiable cost savings.

Succeeding with a Semantic Technology product in the enterprise is a challenging task involving both top quality research and software development practices, but most often the technology adoption challenges are not about the quality of the R&D but about successful business model generation and understanding the complexities and challenges of the technology adoption lifecycle by the enterprise.

This talk will discuss topics related to the challenge of “crossing the chasm” for a Semantic Technology product and provide examples from Ontotext’s experience of successfully delivering Semantic Technology solutions to enterprises.

I differ from Dimitrov’s on some of the details but a solid +1! for slides 29 and 30.

I think you will recognize immediate similarity, at least on slide 29, to some of the promotions for topic maps.

Of course, the next question is how to get to slide 30 isn’t it?

Workload Matters: Why RDF Databases Need a New Design

Saturday, May 17th, 2014

Workload Matters: Why RDF Databases Need a New Design by Gunes¸ Aluc¸, M. Tamer ¨ Ozsu, and, Khuzaima Daudjee.


The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF is becoming widely utilized, RDF data management systems are being exposed to more diverse and dynamic workloads. Existing systems are workload-oblivious, and are therefore unable to provide consistently good performance. We propose a vision for a workload-aware and adaptive system. To realize this vision, we re-evaluate relevant existing physical design criteria for RDF and address the resulting set of new challenges.

The authors establish RDF data management systems are in need of better processing models. However, they mention a “prototype” only in their conclusion and offer no evidence concerning their possible alternatives for RDF processing.

I don’t doubt the need for better RDF processing but I would think the first step would be to determine the goals of RDF processing, separate and apart from the RDF model.

Simply because we conceptualize data as being encoded in “triples,” does not mean that computers must process them as “triples.” They can if it is advantageous but not if there are better processing models.

I first saw this in a tweet by Olaf Hartig.

Is That An “Entity” On Your Webpage?

Sunday, March 30th, 2014

How To Tell Search Engines What “Entities” Are On Your Web Pages by Barbara Starr.

From the post:

Search engines have increasingly been incorporating elements of semantic search to improve some aspect of the search experience — for example, using markup to create enhanced displays in SERPs (as in Google’s rich snippets).

Elements of semantic search are now present at almost all stages of the search process, and the Semantic Web has played a key role. Read on for more detail and to learn how to take advantage of this opportunity to make your web pages more visible in this evolution of search.

semantic search

The identifications are fairly coarse, that is you get a pointer (URL) that identifies a subject but no idea why someone picked that URL.

But, we all know how well coarse pointers, document level pointers, have worked for the WWW.

Kinda surprising because we have had sub-document indexing for centuries.

Odd how simply pointing to a text blob suddenly became acceptable.

Think of the efforts by Google and as an attempt to recover indexing as it existed in the centuries before the advent of the WWW.

ZooKeys 50 (2010) Special Issue

Wednesday, January 29th, 2014

Taxonomy shifts up a gear: New publishing tools to accelerate biodiversity research by Lyubomir Penev, et. al.

From the editorial:

The principles of Open Access greatly facilitate dissemination of information through the Web where it is freely accessed, shared and updated in a form that is accessible to indexing and data mining engines using Web 2.0 technologies. Web 2.0 turns the taxonomic information into a global resource well beyond the taxonomic community. A significant bottleneck in naming species is the requirement by the current Codes of biological nomenclature ruling that new names and their associated descriptions must be published on paper, which can be slow, costly and render the new information difficult to find. In order to make progress in documenting the diversity of life, we must remove the publishing impediment in order to move taxonomy “from a cottage industry into a production line” (Lane et al. 2008), and to make best use of new technologies warranting the fastest and widest distribution of these new results.

In this special edition of ZooKeys we present a practical demonstration of such a process. The issue opens with a forum paper from Penev et al. (doi: 10.3897/zookeys.50.538) that presents the landscape of semantic tagging and text enhancements in taxonomy. It describes how the content of the manuscript is enriched by semantic tagging and marking up of four exemplar papers submitted to the publisher in three different ways: (i) written in Microsoft Word and submitted as non-tagged manuscript (Stoev et al., doi: 10.3897/zookeys.50.504); (ii) generated from Scratchpads (Blagoderov et al., doi: 10.3897/zookeys.50.506 and Brake and Tschirnhaus, doi: 10.3897/zookeys.50.505); (iii) generated from an author’s database (Taekul et al., doi: 10.3897/zookeys.50.485). The latter two were submitted as XML-tagged manuscript. These examples demonstrate the suitability of the workflow to a range of possibilities that should encompass most current taxonomic efforts. To implement the aforementioned routes for XML mark up in prospective taxonomic publishing, a special software tool (Pensoft Mark Up Tool, PMT) was developed and its features were demonstrated in the current issue. The XML schema used was version #123 of TaxPub, an extension to the Document Type Definitions (DTD) of the US National Library of Medicine (NLM) (

A second forum paper from Blagoderov et al. (doi: 10.3897/zookeys.50.539) sets out a workflow that describes the assembly of elements from a Scratchpad taxon page ( to export a structured XML file. The publisher receives the submission, automatically renders the file into the journal‘s layout style as a PDF and transmits it to a selection of referees, based on the key words in the manuscript and the publisher’s database. Several steps, from the author’s decision to submit the manuscript to final publication and dissemination, are automatic. A journal editor first spends time on the submission when the referees’ reports are received, making the decision to publish, modify or reject the manuscript. If the decision is to publish, then PDF proofs are sent back to the author and, when verified, the paper is published both on paper and on-line, in PDF, HTML and XML formats. The original information is also preserved on the original Scratchpad where it may, in due course, be updated. A visitor arriving at the web site by tracing the original publication will be able to jump forward to the current version of the taxon page.

This sounds like the promise of SGML/XML made real doesn’t it?

See the rest of the editorial or ZooKeys 50 for a very good example of XML and semantics in action.

This is a long way from the “related” or “recent” article citations in most publisher interfaces. Thoughts on how to make that change?

JSON-LD and Why I Hate the Semantic Web

Tuesday, January 28th, 2014

JSON-LD and Why I Hate the Semantic Web by Manu Sporny.

From the post:

JSON-LD became an official Web Standard last week. This is after exactly 100 teleconferences typically lasting an hour and a half, fully transparent with text minutes and recorded audio for every call. There were 218+ issues addressed, 2,000+ source code commits, and 3,102+ emails that went through the JSON-LD Community Group. The journey was a fairly smooth one with only a few jarring bumps along the road. The specification is already deployed in production by companies like Google, the BBC,, Yandex, Yahoo!, and Microsoft. There is a quickly growing list of other companies that are incorporating JSON-LD. We’re off to a good start.

In the previous blog post, I detailed the key people that brought JSON-LD to where it is today and gave a rough timeline of the creation of JSON-LD. In this post I’m going to outline the key decisions we made that made JSON-LD stand out from the rest of the technologies in this space.

I’ve heard many people say that JSON-LD is primarily about the Semantic Web, but I disagree, it’s not about that at all. JSON-LD was created for Web Developers that are working with data that is important to other people and must interoperate across the Web. The Semantic Web was near the bottom of my list of “things to care about” when working on JSON-LD, and anyone that tells you otherwise is wrong. :P

TL;DR: The desire for better Web APIs is what motivated the creation of JSON-LD, not the Semantic Web. If you want to make the Semantic Web a reality, stop making the case for it and spend your time doing something more useful, like actually making machines smarter or helping people publish data in a way that’s useful to them.


Something to get your blood pumping early in the week.

Although, I don’t think it is healthy for Manu to hold back so much. 😉

Read the comments to the post as well.

Provenance Reconstruction Challenge 2014

Thursday, January 23rd, 2014

Provenance Reconstruction Challenge 2014


  • February 17, 2014 Test Data released
  • May 18, 2014 Last day to register for participation
  • May 19, 2014 Challenge Data released
  • June 13, 2014 Provenance Reconstruction Challenge Event at Provenance Week – Cologne Germany

From the post:

While the use of version control systems, workflow engines, provenance aware filesystems and databases, is growing there is still a plethora of data that lacks associated data provenance. To help solve this problem, a number of research groups have been looking at reconstructing the provenance of data using the computational environment in which it resides. This research however is still very new in the community. Thus, the aim the Provenance Reconstruction Challenge is to help spur research into the reconstruction of provenance by providing a common task and datasets for experimentation.

The Challenge

Challenge participants will receive an open data set and corresponding provenance graphs (in W3C PROV formant). They will then have several months to work with the data trying to reconstruct the provenance graphs from the open data set. 3 weeks before the challenge face-2-face event the participants will receive a new data set and a gold standard provenance graph. Participants are asked to register before the challenge dataset is released and to prepare a short description of their system to be placed online after the event.

The Event

At the event, we will have presentations of the results and the systems as well as a group conversation around the techniques used. The event will result in a joint report about techniques for reproducing provenance and paths forward.

For further information on the W3C PROV format:

Provenance Working Group

PROV at Semantic Web Wiki.

PROV Implementation Report (60 implementations as of 30 April 2013)

I first saw this in a tweet by Paul Groth.

Hash-URIs for Verifiable, Immutable, and Permanent Digital Artifacts

Thursday, January 23rd, 2014

Hash-URIs for Verifiable, Immutable, and Permanent Digital Artifacts by Tobias Kuhn and Michel Dumontier.


To make digital resources on the web verifiable, immutable, and permanent, we propose a technique to include cryptographic hash values in URIs. We show how such hash-URIs can be used for approaches like nanopublications to make not only specific resources but their entire reference trees verifiable. Digital resources can be identified not only on the byte level but on more abstract levels, which means that resources keep their hash values even when presented in a different format. Our approach sticks to the core principles of the web, namely openness and decentralized architecture, is fully compatible with existing standards and protocols, and can therefore be used right away. Evaluation of our reference implementations shows that these desired properties are indeed accomplished by our approach, and that it remains practical even for very large files.

I rather like the author’s summary of their approach:

our proposed approach boils down to the idea that references can be made completely unambiguous and veri able if they contain a hash value of the referenced digital artifact.

Hash-URIs (assuming proper generation) would be completely unambiguous and verifiable for digital artifacts.

However, the authors fail to notice two important issues with Hash-URIs:

  1. Hash-URIs are not human readable.
  2. Not being human readable means that mappings between Hash-URIs and other references to digital artifacts will be fragile and hard to maintain.

For example,

In prose an author will not say, “As found by “” (from the article).

In some publishing styles, authors will say: “…as a new way of scientifi c publishing [8].”

In other styles, authors will say: “Computable functions are therefore those “calculable by finite means” (Turing, 1936: 230).”

That is to say of necessity there will be a mapping between the unambiguous and verifiable reference (UVR) and the ones used by human authors/readers.

Moreover, should the mapping between UVRs and their human consumable equivalents be lost, recovery is possible but time consuming.

The author’s go to some lengths to demonstrate the use of Hash-URIs with RDF files. RDF is one approach among many to digital artifacts.

If the mapping issues between Hash-URIs and other identifiers can be addressed, a more general approach to digital artifacts would make this proposal more viable.

I first saw this in a tweet by Tobias Kuhn.

ISWC, Sydney 2013 (videos)

Tuesday, December 3rd, 2013

12th International Semantic Web Conference (ISWC), Sydney 2013

From the webpage:

ISWC 2013 is the premier international forum, for the Semantic Web / Linked Data Community. Here, scientists, industry specialists, and practitioners meet to discuss the future of practical, scalable, user-friendly, and game changing solutions.

Detailed information can be found at the ISWC 2013 website.

I count thirty-six (36) videos (including two tutorials).

Some of them are fairly short so suitable for watching while standing in checkout lines. 😉