Archive for the ‘RDF’ Category

PubChemRDF

Friday, January 31st, 2014

PubChemRDF

From the webpage:

Semantic Web technologies are emerging as an increasingly important approach to distribute and integrate scientific data. These technologies include the trio of the Resource Description Framework (RDF), Web Ontology Language (OWL), and SPARQL query language. The PubChemRDF project provides RDF formatted information for the PubChem Compound, Substance, and Bioassay databases.

This document provides detailed technical information (release notes) about the PubChemRDF project. Downloadable RDF data is available on the PubChemRDF FTP Site. Past presentations on the PubChemRDF project are available giving a PubChemRDF introduction and on the PubChemRDF details. The PubChem Blog may provide most recent updates on the PubChemRDF project. Please note that the PubChemRDF is evolving as a function of time. However, we intend for such enhancements to be backwards compatible by adding additional information and annotations.

A twitter post commented on there being 59 billion triples.

Nothing to sneeze at but I was more impressed with the types of connections at page 8 of ftp://ftp.ncbi.nlm.nih.gov/pubchem/presentations/pubchem_rdf_details.pdf.

I am sure there are others but just on that slide:

  • sio:has_component
  • sio:is_stereoisomer_of
  • sio:is_isotopologue_of
  • sio:has_same_connectivity_as
  • sio:similar_to_by_PubChem_2D_similarity_algorithm
  • sio:similar_to_by_PubChem_3D_similarity_algorithm

Using such annotations, the user could decide on what basis to consider compounds “similar” or not.

True, it is non-obvious how I would offer an alternative vocabulary for isotopologue but in this domain, that may not be a requirement.

That we can offer alternative vocabularies for any domain does not mean there is a requirement for alternative vocabularies in any particular domain.

A great source of data!

I first saw this in a tweet by Paul Groth.

Storing and querying RDF in Neo4j

Sunday, January 26th, 2014

Storing and querying RDF in Neo4j by Bob DuCharme.

From the post:

In the typical classification of NoSQL databases, the “graph” category is one that was not covered in the “NoSQL Databases for RDF: An Empirical Evaluation” paper that I described in my last blog entry. (Several were “column-oriented” databases, which I always thought sounded like triple stores—the “table” part of they way people describe these always sounded to me like a stretched metaphor designed to appeal to relational database developers.) A triplestore is a graph database, and Brazilian software developer Paulo Roberto Costa Leite has developed a SPARQL plugin for Neo4j, the most popular of the NoSQL graph databases. This gave me enough incentive to install Neo4j and play with it and the SPARQL plugin.

As Bob points out, the plugin isn’t ready for prime time but I mention it in case you are interested in yet another storage solution for RDF.

Stardog 2.1 Hits Scalability Breakthrough

Thursday, January 23rd, 2014

Stardog 2.1 Hits Scalability Breakthrough

From the post:

The new release (2.1) of Stardog, a leading RDF database, hits new scalability heights with a 50-fold increase over previous versions. Using commodity server hardware at the $10,000 price point, Stardog can manage, query, search, and reason over datasets as large as 50B RDF triples.

The new scalability increases put Stardog into contention for the largest semantic technology, linked data, and other graph data enterprise projects. Stardog’s unique feature set, including reasoning and integrity constraint validation, at large scale means it will increasingly serve as the basis for complex software projects.

A 50-fold increase in performance! That’s impressive!

The post points to Kendall Clark’s blog for the details.

As you may be guessing, better hashing and memory usage were some of the major keys to the speedup.

Hash-URIs for Verifiable, Immutable, and Permanent Digital Artifacts

Thursday, January 23rd, 2014

Hash-URIs for Verifiable, Immutable, and Permanent Digital Artifacts by Tobias Kuhn and Michel Dumontier.

Abstract:

To make digital resources on the web verifiable, immutable, and permanent, we propose a technique to include cryptographic hash values in URIs. We show how such hash-URIs can be used for approaches like nanopublications to make not only specific resources but their entire reference trees verifiable. Digital resources can be identified not only on the byte level but on more abstract levels, which means that resources keep their hash values even when presented in a different format. Our approach sticks to the core principles of the web, namely openness and decentralized architecture, is fully compatible with existing standards and protocols, and can therefore be used right away. Evaluation of our reference implementations shows that these desired properties are indeed accomplished by our approach, and that it remains practical even for very large files.

I rather like the author’s summary of their approach:

our proposed approach boils down to the idea that references can be made completely unambiguous and veri able if they contain a hash value of the referenced digital artifact.

Hash-URIs (assuming proper generation) would be completely unambiguous and verifiable for digital artifacts.

However, the authors fail to notice two important issues with Hash-URIs:

  1. Hash-URIs are not human readable.
  2. Not being human readable means that mappings between Hash-URIs and other references to digital artifacts will be fragile and hard to maintain.

For example,

In prose an author will not say, “As found by “http://example.org/r1.RA5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70″ (from the article).

In some publishing styles, authors will say: “…as a new way of scientifi c publishing [8].”

In other styles, authors will say: “Computable functions are therefore those “calculable by finite means” (Turing, 1936: 230).”

That is to say of necessity there will be a mapping between the unambiguous and verifiable reference (UVR) and the ones used by human authors/readers.

Moreover, should the mapping between UVRs and their human consumable equivalents be lost, recovery is possible but time consuming.

The author’s go to some lengths to demonstrate the use of Hash-URIs with RDF files. RDF is one approach among many to digital artifacts.

If the mapping issues between Hash-URIs and other identifiers can be addressed, a more general approach to digital artifacts would make this proposal more viable.

I first saw this in a tweet by Tobias Kuhn.

JSON-LD Is A W3C Recommendation

Thursday, January 16th, 2014

JSON-LD Is A W3C Recommendation

From the post:

The RDF Working Group has published two Recommendations today:

  • JSON-LD 1.0. JSON is a useful data serialization and messaging format. This specification defines JSON-LD, a JSON-based format to serialize Linked Data. The syntax is designed to easily integrate into deployed systems that already use JSON, and provides a smooth upgrade path from JSON to JSON-LD. It is primarily intended to be a way to use Linked Data in Web-based programming environments, to build interoperable Web services, and to store Linked Data in JSON-based storage engines.
  • JSON-LD 1.0 Processing Algorithms and API. This specification defines a set of algorithms for programmatic transformations of JSON-LD documents. Restructuring data according to the defined transformations often dramatically simplifies its usage. Furthermore, this document proposes an Application Programming Interface (API) for developers implementing the specified algorithms.

It would make a great question on a markup exam to ask whether JSON reminded you more of the “Multicode Basic Concrete Syntax” or a “Variant Concrete Syntax?” For either answer, explain.

In any event, you will be encountering JSON-LD so these recommendations will be helpful.

rrdf 2.0: Updates, some fixes, and a preprint

Sunday, January 5th, 2014

rrdf 2.0: Updates, some fixes, and a preprint by Egon Willighagen.

From the post:

It all started 3.5 years ago with a question on BioStar: how can one import RDF into R and because lack of an answer, I hacked up rrdf. Previously, I showed two examples and a vignette. Apparently, it was a niche, and I received good feedback. And it is starting to get cited in literature, e.g. by Vissoci et al.  Furthermore, I used it in the ropenphacts package so when I write that up, I like to have something to refer people to for detail about the used rrdf package.

Thus, during the x-mas holidays I wrote up what I had in my mind, resulting in this preprint on the PeerJ PrePrints server, for you to comment on.

An RDF extraction tool for your toolkit.

RDF Nostalgia for the Holidays

Tuesday, December 17th, 2013

Three RDF First Public Working Drafts Published

From the announcement:

  • RDF 1.1 Primer, which explains how to use this language for representing information about resources in the World Wide Web.
  • RDF 1.1: On Semantics of RDF Datasets, which presents some issues to be addressed when defining a formal semantics for datasets, as they have been discussed in the RDF Working Group, and specify several semantics in terms of model theory, each corresponding to a certain design choice for RDF datasets.
  • What’s New in RDF 1.1

The drafts, which may someday become W3C Notes, could be helpful.

There isn’t as much RDF loose in the world as COBOL but what RDF does exist will keep the need for RDF instructional materials alive.

Scalable Property and Hypergraphs in RDF

Wednesday, November 6th, 2013

From the description:

There is a misconception that Triple Stores are not ‘true’ graph databases because they supposedly do not support Property Graphs and Hypergraphs.

We will demonstrate that Property and Hypergraphs are not only natural to Triple Stores and RDF but allow for potentially even more powerful graph models than non-RDF approaches.

AllegroGraph defends their implementation of Triple Stores as both property and hypergraphs.

The second story (see also A Letter Regarding Native Graph Databases) I have heard in two days based upon an unnamed vendor trash talking other graph databases.

Are graph databases catching on enough for that kind of marketing effort?

BTW, AllegroGraph does have a Free Server Edition Download.

Limited to 5 million triples but that should capture your baseball card collection or home recipe book. ;-)

Implementations of RDF 1.1?

Tuesday, November 5th, 2013

W3C Invites Implementations of five Candidate Recommendations for version 1.1 of the Resource Description Framework (RDF)

From the post:

The RDF Working Group today published five Candidate Recommendations for version 1.1 of the Resource Description Framework (RDF), a widespread and stable technology for data interoperability:

  • RDF 1.1 Concepts and Abstract Syntax defines the basics which underly all RDF syntaxes and systems. It provides for general data interoperability.
  • RDF 1.1 Semantics defines the precise semantics of RDF data, supporting use with a wide range of “semantic” or “knowledge” technologies.
  • RDF 1.1 N-Triples defines a simple line-oriented syntax for serializing RDF data. N-Triples is a minimalist subset of Turtle.
  • RDF 1.1 TriG defines an extension to Turtle (aligned with SPARQL) for handling multiple RDF Graphs in a single document.
  • RDF 1.1 N-Quads defines an extension to N-Triples for handling multiple RDF Graphs in a single document.

All of these technologies are now stable and ready to be widely implemented. Each specification (except Concepts) has an associated Test Suite and includes a link to an Implementation Report showing how various software currently fares on the tests. If you maintain RDF software, please review these specifications, update your software if necessary, and (if relevant) send in test results as explained in the Implementation Report.

RDF 1.1 is a refinement to the 2004 RDF specifications, designed to simplify and improve RDF without breaking existing deployments.

In case you are curious about where this lies in the W3C standards process, see: W3C Technical Report Development Process.

I never had a reason to ferret it out before but now that I did, I wanted to write it down.

Has it been ten (10) years since RDF 1.0?

I think it has been fifteen (15) since the Semantic Web was going to teach the web to sing or whatever the slogan was.

Like all legacy data, RDF will never go away, later systems will have to account for it.

Like COBOL I suppose.

DynamiTE: Parallel Materialization of Dynamic RDF Data

Wednesday, October 23rd, 2013

DynamiTE: Parallel Materialization of Dynamic RDF Data by Jacopo Urbani, Alessandro Margara, Ceriel Jacobs, Frank van Harmelen, Henri Bal.

Abstract:

One of the main advantages of using semantically annotated data is that machines can reason on it, deriving implicit knowledge from explicit information. In this context, materializing every possible implicit derivation from a given input can be computationally expensive, especially when considering large data volumes.

Most of the solutions that address this problem rely on the assumption that the information is static, i.e., that it does not change, or changes very infrequently. However, the Web is extremely dynamic: online newspapers, blogs, social networks, etc., are frequently changed so that outdated information is removed and replaced with fresh data. This demands for a materialization that is not only scalable, but also reactive to changes.

In this paper, we consider the problem of incremental materialization, that is, how to update the materialized derivations when new data is added or removed. To this purpose, we consider the df RDFS fragment [12], and present a parallel system that implements a number of algorithms to quickly recalculate the derivation. In case new data is added, our system uses a parallel version of the well-known semi-naive evaluation of Datalog. In case of removals, we have implemented two algorithms, one based on previous theoretical work, and another one that is more efficient since it does not require a complete scan of the input.

We have evaluated the performance using a prototype system called DynamiTE, which organizes the knowledge bases with a number of indices to facilitate the query process and exploits parallelism to improve the performance. The results show that our methods are indeed capable to recalculate the derivation in a short time, opening the door to reasoning on much more dynamic data than is currently possible.

Not “lite” reading but refreshing to see the dynamic nature of information being taken as a starting point.

Echoes of re-merging on the entry or deletion of information in a topic map. Yes?

Source code online: https://github.com/jrbn/dynamite (Java)

I first saw this in a tweet by Marin Dimitrov.

MarkLogic Rolls Out the Red Carpet for…

Thursday, October 10th, 2013

MarkLogic Rolls Out the Red Carpet for Semantic Triples by Alex Woodie.

From the post:

You write a query with great care, and excitedly hit the “enter” button, only to see a bunch of gobbledygook spit out on the screen. MarkLogic says the chances of this happening will decrease thanks to the new RDF Triple Store feature that it formally introduced today with the launch of version 7 of its eponymous NoSQL database.

The capability to store and search semantic triples in MarkLogic 7 is one of the most compelling new features of the new NoSQL database. The concept of semantic triples is central to the Resource Description Framework (RDF) way of storing and searching for information. Instead of relating information in a database using an “entity-relationship” or “class diagram” model, the RDF framework enables links between pieces of data to be searched using the “subject-predicate-object” concept, which more closely corresponds to the way humans think and communicate.

The real power of this approach becomes evident when one considers the hugely disparate nature of information on the Internet. An RDF powered application can build links between different pieces of data, and effectively “learn” from the connections created by the semantic triples. This is the big (and as yet unrealized) pipe dream of the semantic Web.

RDF has been around for a while, and while you probably wouldn’t call it mainstream, there are a handful of applications using this approach. What makes MarkLogic’s approach unique is that it’s storing the semantic triples–the linked data–right inside the main NoSQL database, where it can make use of all the rich data and metadata stored in documents and other semi-structured files that NoSQL databases like MarkLogic are so good at storing.

This approach puts semantic triples right where it can do the most good. “Until now there has been a disconnect between the incredible potential of semantics and the value organizations have been able to realize,” states MarkLogic’s senior vice president of product strategy, Joe Pasqua.

“Managing triples in dedicated triple stores allowed people to see connections, but the original source of that data was disconnected, ironically losing context,” he continues. “By combining triples with a document store that also has built-in querying and APIs for delivery, organizations gain the insights of triples while connecting the data to end users who can search documents with the context of all the facts at their fingertips.”

A couple of things caught my eye in this post.

First, the comment that:

RDF has been around for a while, and while you probably wouldn’t call it mainstream, there are a handful of applications using this approach.

I can’t disagree so why would MarkLogic make RDF support a major feature of this release?

Second, the next sentence reads:

What makes MarkLogic’s approach unique is that it’s storing the semantic triples–the linked data–right inside the main NoSQL database, where it can make use of all the rich data and metadata stored in documents and other semi-structured files that NoSQL databases like MarkLogic are so good at storing.

I am reading that to mean that if you store all the documents in which triples appear, along with the triples, you have more context. Yes?

Trivially true but I not sure how significant an advantage that would be. Shouldn’t all that “contextual” metadata be included with the triples?

But I haven’t gotten a copy of version 7 so that’s all speculation on my part.

If you have a copy of MarkLogic 7, care to comment?

Thanks!

Stardog 2.0.0 (26 September 2013)

Friday, September 27th, 2013

Stardog 2.0.0 (26 September 2013)

From the docs page:

Introducing Stardog

Stardog is a graph database—fast, lightweight, pure Java storage for mission-critical apps—that supports:

  • the RDF data model
  • SPARQL 1.1 query language
  • HTTP and SNARL protocols for remote access and control
  • OWL 2 and rules for inference and data analytics
  • Java, JavaScript, Ruby, Python, .Net, Groovy, Spring, etc.

New features in 2.0:

I was amused to read in Stardog Rules Syntax:

Stardog supports two different syntaxes for defining rules. The first is native Stardog Rules syntax and is based on SPARQL, so you can re-use what you already know about SPARQL to write rules. Unless you have specific requirements otherwise, you should use this syntax for user-defined rules in Stardog. The second, the de facto standard RDF/XML syntax for SWRL. It has the advantage of being supported in many tools; but it‘s not fun to read or to write. You probably don’t want to use it. Better: don’t use this syntax! (emphasis in the original)

Install and play with it over the weekend. It’s a good way to experience RDF and SPARQL.

DBpedia 3.9 released…

Monday, September 23rd, 2013

DBpedia 3.9 released, including wider infobox coverage, additional type statements, and new YAGO and Wikidata links by Christopher Sahnwaldt.

From the post:

we are happy to announce the release of DBpedia 3.9.

The most important improvements of the new release compared to DBpedia 3.8 are:

1. the new release is based on updated Wikipedia dumps dating from March / April 2013 (the 3.8 release was based on dumps from June 2012), leading to an overall increase in the number of concepts in the English edition from 3.7 to 4.0 million things.

2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen, leading to richer and cleaner concept descriptions.

3. we extended the DBpedia type system to also cover Wikipedia articles that do not contain an infobox.

4. we provide links pointing from DBpedia concepts to Wikidata concepts and updated the links pointing at YAGO concepts and classes, making it easier to integrate knowledge from these sources.

The English version of the DBpedia knowledge base currently describes 4.0 million things, out of which 3.22 million are classified in a consistent Ontology, including 832,000 persons, 639,000 places (including 427,000 populated places), 372,000 creative works (including 116,000 music albums, 78,000 films and 18,500 video games), 209,000 organizations (including 49,000 companies and 45,000 educational institutions), 226,000 species and 5,600 diseases.

We provide localized versions of DBpedia in 119 languages. All these versions together describe 24.9 million things, out of which 16.8 million overlap (are interlinked) with the concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 12.6 million unique things in 119 different languages; 24.6 million links to images and 27.6 million links to external web pages; 45.0 million external links into other RDF datasets, 67.0 million links to Wikipedia categories, and 41.2 million YAGO categories.

Altogether the DBpedia 3.9 release consists of 2.46 billion pieces of information (RDF triples) out of which 470 million were extracted from the English edition of Wikipedia, 1.98 billion were extracted from other language editions, and about 45 million are links to external data sets.

Detailed statistics about the DBpedia data sets in 24 popular languages are provided at Dataset Statistics.

The main changes between DBpedia 3.8 and 3.9 are described below. For additional, more detailed information please refer to the Change Log.

Almost like an early holiday present isn’t it? ;-)

I continue to puzzle over the notion of “extraction.”

Not that I have an alternative but extracting data only kicks the data can one step down the road.

When someone wants to use my extracted data, they are going to extract data from my extraction. And so on.

That seems incredibly wasteful and error-prone.

Enough money is spend doing the ETL shuffle every year that research on ETL avoidance should be a viable proposition.

Context Aware Searching

Thursday, September 19th, 2013

Scaling Up Personalized Query Results for Next Generation of Search Engines

From the post:

North Carolina State University researchers have developed a way for search engines to provide users with more accurate, personalized search results. The challenge in the past has been how to scale this approach up so that it doesn’t consume massive computer resources. Now the researchers have devised a technique for implementing personalized searches that is more than 100 times more efficient than previous approaches.

At issue is how search engines handle complex or confusing queries. For example, if a user is searching for faculty members who do research on financial informatics, that user wants a list of relevant webpages from faculty, not the pages of graduate students mentioning faculty or news stories that use those terms. That’s a complex search.

“Similarly, when searches are ambiguous with multiple possible interpretations, traditional search engines use impersonal techniques. For example, if a user searches for the term ‘jaguar speed,’ the user could be looking for information on the Jaguar supercomputer, the jungle cat or the car,” says Dr. Kemafor Anyanwu, an assistant professor of computer science at NC State and senior author of a paper on the research. “At any given time, the same person may want information on any of those things, so profiling the user isn’t necessarily very helpful.”

Anyanwu’s team has come up with a way to address the personalized search problem by looking at a user’s “ambient query context,” meaning they look at a user’s most recent searches to help interpret the current search. Specifically, they look beyond the words used in a search to associated concepts to determine the context of a search. So, if a user’s previous search contained the word “conservation” it would be associated with concepts likes “animals” or “wildlife” and even “zoos.” Then, a subsequent search for “jaguar speed” would push results about the jungle cat higher up in the results — and not the automobile or supercomputer. And the more recently a concept has been associated with a search, the more weight it is given when ranking results of a new search.

I rather like the contrast of ambiguous searches being resolved with “impersonal techniques.”

The paper, Scaling Concurrency of Personalized Semantic Search over Large RDF Data by Haizhou Fu, Hyeongsik Kim, and Kemafor Anyanwu, has this abstract:

Recent keyword search techniques on Semantic Web are moving away from shallow, information retrieval-style approaches that merely find “keyword matches” towards more interpretive approaches that attempt to induce structure from keyword queries. The process of query interpretation is usually guided by structures in data, and schema and is often supported by a graph exploration procedure. However, graph exploration-based interpretive techniques are impractical for multi-tenant scenarios for large database because separate expensive graph exploration states need to be maintained for different user queries. This leads to significant memory overhead in situations of large numbers of concurrent requests. This limitation could negatively impact the possibility of achieving the ultimate goal of personalizing search. In this paper, we propose a lightweight interpretation approach that employs indexing to improve throughput and concurrency with much less memory overhead. It is also more amenable to distributed or partitioned execution. The approach is implemented in a system called “SKI” and an experimental evaluation of SKI’s performance on the DBPedia and Billion Triple Challenge datasets show orders-of-magnitude performance improvement over existing techniques.

If you are interesting in scaling issues for topic maps, note the use of indexing as opposed to graph exploration techniques in this paper.

Also consider mining “discovered” contexts that lead to “better” results from the viewpoint of users. Those could be the seeds for serializing those contexts as topic maps.

Perhaps even directly applicable to work by researchers, librarians, intelligence analysts.

Seasoned searchers use richer contexts in searching that the average user and if those contexts are captured, they could enrich the search contexts of the average user.

Three RDFa Recommendations Published

Thursday, August 22nd, 2013

Three RDFa Recommendations Published

From the announcement:

  • HTML+RDFa 1.1, which defines rules and guidelines for adapting the RDFa Core 1.1 and RDFa Lite 1.1 specifications for use in HTML5 and XHTML5. The rules defined in this specification not only apply to HTML5 documents in non-XML and XML mode, but also to HTML4 and XHTML documents interpreted through the HTML5 parsing rules.
  • The group also published two Second Editions for RDFa Core 1.1 and XHTML+RDFa 1.1, folding in the errata reported by the community since their publication as Recommendations in June 2012; all changes were editorial.
  • The group also updated the a RDFa 1.1 Primer.

The deeper I get into HTML+RDFa 1.1, the more I think a random RDFa generator would be an effective weapon against government snooping.

Something copies some percentage of your text and places it in a comment and generates random RDFa 1.1 markup for it, thus: <!– – your content + RDFa – –>.

Improves the stats for the usage of RDFa 1.1 and if the government tries to follow all the RDFa 1.1 rules, well, let’s just say they will have less time for other mischief. ;-)

Wikidata RDF export available [And a tale of “part of.”]

Tuesday, August 13th, 2013

Wikidata RDF export available by Markus Krötzsch.

From the post:

I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported.

Wikidata (homepage)

WikiData:Database download.

I read an article about combining data released under different licenses earlier today. No problems here because the data is released under Creative Commons CCO License. What for content in other namespaces. Different licensing may apply.

To run the Python script wda-export-data.py I had to install Python-bitarray, just in case you get an error message it is missing.

Use the data with caution.

The entry for Wikipedia reports in part:

part of     List of Wikimedia projects

If you follow “part of” you will find:

this item is a part of that item

Also known as:

section of
system of
subsystem of
subassembly of
sub-system of
sub-assembly of
merged into
contained within
assembly of
within a set

“[P]art of” covers enough semantic range to return Google-like results (bad).

Not to mention that as a subject, I think “Wikipedia” is a bit more than an entry in a list.

Don’t you?

Fast Graph Kernels for RDF

Tuesday, July 30th, 2013

Fast Graph Kernels for RDF

From the post:

As a complement to two papers that we will present at the ECML/PKDD 2013 conference in Prague in September we created a webpage with additional material.

The first paper: “A Fast Approximation of the Weisfeiler-Lehman Graph Kernel for RDF Data” was accepted into the main conference and the second paper: “A Fast and Simple Graph Kernel for RDF” was accepted at the DMoLD workshop.

We include links to the papers, to the software and to the datasets used in the experiments, which are stored in figshare. Furthermore, we explain how to rerun the experiments from the papers using a precompiled JAR file, to make the effort required as minimal as possible.

Kudos to the authors for enabling others to duplicate their work! https://github.com/Data2Semantics/d2s-tools

Interesting to think of processing topics as sub-graphs consisting only of the subject identity properties. Deferring processing of other properties until the topic is requested.

The Problem with RDF and Nuclear Power

Tuesday, June 25th, 2013

The Problem with RDF and Nuclear Power by Manu Sporny.

Manu starts his post:

Full disclosure: I am the chair of the RDFa Working Group, the JSON-LD Community Group, a member of the RDF Working Group, as well as other Semantic Web initiatives. I believe in this stuff, but am critical about the path we’ve been taking for a while now.

(…)

RDF shares a number of these similarities with nuclear power. RDF is one of the best data modeling mechanisms that humanity has created. Looking into the future, there is no equally-powerful, viable alternative. So, why has progress been slow on this very exciting technology? There was no public mis-information campaign, so where did this negative view of RDF come from?

In short, RDF/XML was the Semantic Web’s 3 Mile Island incident. When it was released, developers confused RDF/XML (bad) with the RDF data model (good). There weren’t enough people and time to counter-act the negative press that RDF was receiving as a result of RDF/XML and thus, we are where we are today because of this negative perception of RDF. Even Wikipedia’s page on the matter seems to imply that RDF/XML is RDF. Some purveyors of RDF think that the public perception problem isn’t that bad. I think that when developers hear RDF, they think: “Not in my back yard”.

The solution to this predicament: Stop mentioning RDF and the Semantic Web. Focus on tools for developers. Do more dogfooding.

Over the years I have become more and more agnostic towards data models.

The real question for any data model is whether it fits your requirements. What other test would you have?

For merging data held in different data models or data models that don’t recognize the same subject identified differently, then subject identity and its management comes into play.

Subject identity and its management not being an area that has only one answer for any particular problem.

Manu does have concrete suggestions for how to advance topic maps, either as a practice of subject identity or a particular data model:

  1. The message shouldn’t be about the technology. It should be about the problems we have today and a concrete solution on how to address those problems.
  2. Demonstrate real value. Stop talking about the beauty of RDF, theoretical value, or design. Deliver production-ready, open-source software tools.
  3. Build a network of believers by spending more of your time working with Web developers and open-source projects to convince them to publish Linked Data. Dogfood our work.

A topic map version of those suggestions:

  1. The message shouldn’t be about the technology. It should be about the problems we have today and a concrete solution on how to address those problems.
  2. Demonstrate real value. Stop talking about the beauty of topic maps, theoretical value, or design. Deliver high quality content from merging diverse data sources. (Tools will take care of themselves if the content is valuable enough.)
  3. Build a network of customers by spending more of your time using topic maps to distinguish your content from content from the average web sewer.

As an information theorist I should be preaching to myself. Yes?

;-)

As the semantic impedance of the “Semantic Web,” “big data,” “NSA Data Cloud,” increases, the opportunities for competitive, military, industrial advantage from reliable semantic integration will increase.

Looking for showcase opportunities.

Suggestions?

Glimmer

Thursday, June 20th, 2013

Glimmer: An RDF Search Engine

New RDF search engine from Yahoo, built on Hadoop (0.23) and MG4j.

I first saw this in a tweet by Yves Raimond.

The best part being pointed to the MG4j project, which I haven’t looked at in a year or more.

More news on that tomorrow!

Hafslund Sesam — an archive on semantics

Thursday, June 13th, 2013

Hafslund Sesam — an archive on semantics by Lars Marius Garshol and Axel Borge.

Abstract:

Sesam is an archive system developed for Hafslund, a Norwegian energy company. It achieves the often-sought but rarely-achieved goal of automatically enriching metadata by using semantic technologies to extract and integrate business data from business applications. The extracted data is also indexed with a search engine together with the archived documents, allowing true enterprise search.

A curious paper that requires careful reading.

Since the paper makes technology choices, it’s only appropriate to start with the requirements:

The system must handle 1000 users, although not necessarily simultaneously.

Initial calculations of data size assumed 1.4 million customers and 1 million electric meters with 30-50 properties each. Including various other data gave a rough estimate on the order of 100 million statements.

The archive must be able to receive up to 2 documents per second over an interval of many hours, in order to handle about 100,000 documents a day during peak periods. The documents would mostly be paper forms recording electric meter readings.

To inherit metadata tags automatically requires running queries to achieve transitive closure. Assuming on average 10 queries for each document, the system must be able to handle 20 queries per second on 100 million statements.

In the next section, the authors concede that the fourth requirement, “RDF data integration” was unrealistic, so the fourth requirement was dropped:

The canonical approach to RDF data integration is currently query federation of SPARQL queries against a set of heterogeneous data sources, often using R2RML. Given the size of the data set, the generic nature of the transitive closure queries, and the number of data sources to be supported, we considered achieving 20 queries per second with query federation unrealistic.

Which leaves only:

The system must handle 1000 users, although not necessarily simultaneously.

Initial calculations of data size assumed 1.4 million customers and 1 million electric meters with 30-50 properties each. Including various other data gave a rough estimate on the order of 100 million statements.

The archive must be able to receive up to 2 documents per second over an interval of many hours, in order to handle about 100,000 documents a day during peak periods. The documents would mostly be paper forms recording electric meter readings.

as the requirements to be met.

I mention that because of the following technology choice statement:

To write generic code we must use a schemaless data representation, which must also be standards-based. The only candidates were Topic Maps [ISO13250-2] and RDF. The available Topic Maps implementations would not be able to handle the query throughput at the data sizes required. Testing of the Virtuoso triple store indicated that it could handle the workload just fine. RDF thus appeared to be the only suitable technology.

But there is no query throughput requirement. At least not for the storage mechanism. For deduplication in the ERP system (section 3.5), the authors choose to follow neither topic maps nor RDF but a much older technology, record linkage.

The other query mechnism is a Recommind search engine, which is reported to not be able to index and search at the same time. (section 4.1)

If I am reading the paper correctly, data from different sources are stored as received from various sources and owl:sameAs statements are used to map data to the archives schema.

I puzzle at that point because RDF is simply a format and OWL a means to state a mapping statement.

Given the semantic vagaries of owl:sameAs (Semantic Drift and Linked Data/Semantic Web), I have to wonder about the longer term maintenance of owl:sameAs mappings?

There is no expression of a reason for “sameAs” A reason that might prompt a future maintainer of the system to follow or not some particular “sameAs.”

Still, the project was successful and that counts for more than using any single technology to the exclusion of all others.

The comments on performance of topic maps options does make me mindful of the lack of benchmark data sets for topic maps.

Rya: A Scalable RDF Triple Store for the Clouds

Tuesday, June 11th, 2013

Rya: A Scalable RDF Triple Store for the Clouds by Roshan Punnoose, Adina Crainiceanu, and David Rapp.

Abstract:

Resource Description Framework (RDF) was designed with the initial goal of developing metadata for the Internet. While the Internet is a conglomeration of many interconnected networks and computers, most of today’s best RDF storage solutions are confined to a single node. Working on a single node has significant scalability issues, especially considering the magnitude of modern day data. In this paper we introduce a scalable RDF data management system that uses Accumulo, a Google Bigtable variant. We introduce storage methods, indexing schemes, and query processing techniques that scale to billions of triples across multiple nodes, while providing fast and easy access to the data through conventional query mechanisms such as SPARQL. Our performance evaluation shows that in most cases, our system outperforms existing distributed RDF solutions, even systems much more complex than ours.

Based on Accumulo (open-source NoSQL database by the NSA).

Interesting re-thinking of indexing of triples.

Future work includes owl:sameAs, owl:inverseOf and other inferencing rules.

Certainly a project to watch.

Coming soon: new, expanded edition of “Learning SPARQL”

Monday, June 3rd, 2013

Coming soon: new, expanded edition of “Learning SPARQL” by Bob DuCharme.

From the post:

55% more pages! 23% fewer mentions of the semantic web!

sparql

I’m very pleased to announce that O’Reilly will make the second, expanded edition of my book Learning SPARQL available sometime in late June or early July. The early release “raw and unedited” version should be available this week.

I wonder if Bob is going to start an advertising trend with “fewer mentions of the semantic web?”

;-)

Looking forward to the update!

Not that I care about SPARQL all that much but I’ll learn something.

Besides, I have liked Bob’s writing style from back in the SGML days.

4Store

Saturday, June 1st, 2013

4Store

From the about page:

4store is a database storage and query engine that holds RDF data. It has been used by Garlik as their primary RDF platform for three years, and has proved itself to be robust and secure.

4store’s main strengths are its performance, scalability and stability. It does not provide many features over and above RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist.

This was mentioned in a report by Bryan Thompson so I wanted to capture a link to the original site.

The latest tarball is dated 10 July 2012.

Big Data RDF Store Benchmarking Experiences

Friday, May 31st, 2013

Big Data RDF Store Benchmarking Experiences by Peter Boncz.

From the post:

Recently we were able to present new BSBM results, testing the RDF triple stores Jena TDB, BigData, BIGOWLIM and Virtuoso on various data sizes. These results extend the state-of-the-art in various dimensions:

  • scale: this is the first time that RDF store benchmark results on such a large size have been published. The previous published BSBM results published were on 200M triples, the 150B experiments thus mark a 750x increase in scale.
  • workload: this is the first time that results on the Business Intelligence (BI) workload are published. In contrast to the Explore workload, which features short-running “transactional” queries, the BI workload consists of queries that go through possibly billions of triples, grouping and aggregating them (using the respective functionality, new in SPARQL1.1).
  • architecture: this is the first time that RDF store technology with cluster functionality has been publicly benchmarked.

Clusters are great but also difficult to use.

Peter’s post is one of those rare ones that exposes the second half of that statement.

Impressive hardware and results.

Given the hardware and effort required, are we pursuing “big data” for the sake of “big data?”

Not just where RDF is concerned but in general?

Shouldn’t the first question always be: What is the relevant data?

If you can’t articulate the relevant data, isn’t that a commentary on your understanding of the problem?

A Trillion Triples in Perspective

Saturday, May 18th, 2013

Mozart Meets MapReduce by Isaac Lopez.

From the post:

Big data has been around since the beginning of time, says Thomas Paulmichl, founder and CEO of Sigmaspecto, who says that what has changed is how we process the information. In a talk during Big Data Week, Paulmichl encouraged people to open up their perspective on what big data is, and how it can be applied.

During the talk, he admonished people to take a human element into big data. Paulmichl demonstrated this by examining the work of musical prodigy, Mozart – who Paulmichl noted is appreciated greatly by both music scientists, as well as the common music listener.

“When Mozart makes choices on writing a piece of work, the number of choices that he has and the kind of neural algorithms that his brain goes through to choose things is infinitesimally higher that what we call big data – it’s really small data in comparison,” he said.

Taking Mozart’s The Magic Flute as an example, Paulmichl, discussed the framework that Mozart used to make his choices by examining a music sheet outlining the number of bars, the time signature, the instrument and singer voicing.

“So from his perspective, he sits down, and starts to make what we as data scientists call quantitative choices,” explained Paulmichl. “Do I put a note here, down here, do I use a different instrument; do I use a parallel voicing for different violins – so these are all metrics that his brain has to decide.”

Exploring the mathematics of the music, Paulmichl concluded that in looking at The Magic Flute, Mozart had 4.72391E+21 creative variations (and then some) that he could have taken with the direction of it over the course of the piece. “We’re not talking about a trillion dataset; we’re talking about a sextillion or more,” he says adding that this is a very limited cut of the quantitative choice that his brain makes at every composition point.

“[A] sextillion or more…” puts the question of processing a trillion triples into perspective.

Another musical analogy?

Triples are the one finger version of Jingle Bells*:

*The gap is greater than the video represents but it is still amusing.

Does your analysis/data have one finger subtlety?

A self-updating road map of The Cancer Genome Atlas

Friday, May 17th, 2013

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

Non-Adoption of Semantic Web, Reason #1002

Monday, May 13th, 2013

Kingsley Idehen offers yet another explanation/excuses for non-adoption of the semantic web in On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen by Roberto V. Zicari.

The highlight of this interview reads:

The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point

You may recall Kingsley’s demonstration of the non-complexity of authoring for the Semantic Web in The Semantic Web Is Failing — But Why? (Part 3).

Could it be users sense the “lock-in” of RDF/Semantic Web?

Q14. Big Data Analysis: could you connect Virtuoso with Hadoop? How does Viruoso relate to commercial data analytics platforms, e.g Hadapt, Vertica?

K​ingsley Uyi Idehen: You can integrate data managed by Hadoop based ETL workflows via ODBC or Web Services driven by Hapdoop clusters that expose RESTful interaction patterns for data access. As for how Virtuoso relates to the likes of Vertica re., analytics, this is about Virtuoso being the equivalent of Vertica plus the added capability of RDF based data management, Linked Data Deployment, and share-nothing clustering. There is no job that Vertica performs that Virtuoso can’t perform.

There are several jobs that Virtuoso can perform that Vertica, VoltDB, Hadapt, and many other NoSQL and NewSQL simply cannot perform with regards to scalable, high-performance RDF data management and Linked Data deployment. Remember, RDF based Linked Data is all about data management and data access without any kind of platform lock-in. Virtuoso locks you into a value proposition (performance and scale) not the platform itself. (emphasis added to last sentence)

It’s comforting to know RDF/Semantic Web “lock-in” has our best interest at heart.

See Kingley dodging the next question on Virtuoso’s ability scale:

Q15. Do you also benchmark loading trillion of RDF triples? Do you have current benchmark results? How much time does it take to querying them?

K​ingsley Uyi Idehen: As per my earlier responses, there is no shortage of benchmark material for Virtuoso.

The benchmarks are also based on realistic platform configurations unlike the RDBMS patterns of the past which compromised the utility of TPC benchmarks.

Full Disclosure: I haven’t actually counted all of Kingsley’s reasons for non-adoption of the Semantic Web. The number I assign here may be high or low.

The ChEMBL database as linked open data

Thursday, May 9th, 2013

The ChEMBL database as linked open data by Egon L Willighagen, Andra Waagmeester, Ola Spjuth, Peter Ansell, Antony J Williams, Valery Tkachenko, Janna Hastings, Bin Chen and David J Wild. (Journal of Cheminformatics 2013, 5:23 doi:10.1186/1758-2946-5-23).

Abstract:

Background Making data available as Linked Data using Resource Description Framework (RDF) promotes integration with other web resources. RDF documents can natively link to related data, and others can link back using Uniform Resource Identifiers (URIs). RDF makes the data machine-readable and uses extensible vocabularies for additional information, making it easier to scale up inference and data analysis.

Results This paper describes recent developments in an ongoing project converting data from the ChEMBL database into RDF triples. Relative to earlier versions, this updated version of ChEMBL-RDF uses recently introduced ontologies, including CHEMINF and CiTO; exposes more information from the database; and is now available as dereferencable, linked data. To demonstrate these new features, we present novel use cases showing further integration with other web resources, including Bio2RDF, Chem2Bio2RDF, and ChemSpider, and showing the use of standard ontologies for querying.

Conclusions We have illustrated the advantages of using open standards and ontologies to link the ChEMBL database to other databases. Using those links and the knowledge encoded in standards and ontologies, the ChEMBL-RDF resource creates a foundation for integrated semantic web cheminformatics applications, such as the presented decision support.

You already know about the fragility of ontologies so no need to repeat that rant here.

Having material encoded with an ontology, on the other hand, after vetting, can be a source that you wrap with a topic map.

So all that effort isn’t lost.

5 heuristics for writing better SPARQL queries

Wednesday, April 3rd, 2013

5 heuristics for writing better SPARQL queries by Paul Groth.

From the post:

In the context of the Open PHACTS and the Linked Data Benchmark Council projects, Antonis Loizou and I have been looking at how to write better SPARQL queries. In the Open PHACTS project, we’ve been writing super complicated queries to integrate multiple data sources and from experience we realized that different combinations and factors can dramatically impact performance. With this experience, we decided to do something more systematic and test how different techniques we came up with mapped to database theory and worked in practice. We just submitted a paper for review on the outcome. You can find a preprint (On the Formulation of Performant SPARQL Queries) on arxiv.org at http://arxiv.org/abs/1304.0567. The abstract is below. The fancy graphs are in the paper.

Paper Abstract:

The combination of the flexibility of RDF and the expressiveness of SPARQL provides a powerful mechanism to model, integrate and query data. However, these properties also mean that it is nontrivial to write performant SPARQL queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to write “good” queries. The goal of this paper is to begin to bridge this gap. It describes 5 heuristics that can be applied to create optimised queries. The heuristics are informed by formal results in the literature on the semantics and complexity of evaluating SPARQL queries, which ensures that queries following these rules can be optimised effectively by an underlying RDF store. Moreover, we empirically verify the efficacy of the heuristics using a set of openly available datasets and corresponding SPARQL queries developed by a large pharmacology data integration project. The experimental results show improvements in performance across 6 state-of-the-art RDF stores.

Just in case you have to query RDF stores as part of your topic map work.

Be aware that: The effectiveness of your SPARQL query will vary based on the RDF Store.

Or as the authors say:

SPARQL, due to its expressiveness , provides a plethora of different ways to express the same constraints, thus, developers need to be aware of the performance implications of the combination of query formulation and RDF Store. This work provides empirical evidence that can help developers in designing queries for their selected RDF Store. However, this raises questions about the effectives of writing complex generic queries that work across open SPARQL endpoints available in the Linked Open Data Cloud. We view the optimisation of queries independent of underlying RDF Store technology as a critical area of research to enable the most effective use of these endpoints. (page 21)

I hope their research is successful.

Varying performance, especially as reported in their paper, doesn’t bode well for cross-RDF Store queries.