Archive for the ‘RDF’ Category

Wikidata RDF export available [And a tale of "part of."]

Tuesday, August 13th, 2013

Wikidata RDF export available by Markus Krötzsch.

From the post:

I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported.

Wikidata (homepage)

WikiData:Database download.

I read an article about combining data released under different licenses earlier today. No problems here because the data is released under Creative Commons CCO License. What for content in other namespaces. Different licensing may apply.

To run the Python script wda-export-data.py I had to install Python-bitarray, just in case you get an error message it is missing.

Use the data with caution.

The entry for Wikipedia reports in part:

part of     List of Wikimedia projects

If you follow “part of” you will find:

this item is a part of that item

Also known as:

section of
system of
subsystem of
subassembly of
sub-system of
sub-assembly of
merged into
contained within
assembly of
within a set

“[P]art of” covers enough semantic range to return Google-like results (bad).

Not to mention that as a subject, I think “Wikipedia” is a bit more than an entry in a list.

Don’t you?

Fast Graph Kernels for RDF

Tuesday, July 30th, 2013

Fast Graph Kernels for RDF

From the post:

As a complement to two papers that we will present at the ECML/PKDD 2013 conference in Prague in September we created a webpage with additional material.

The first paper: “A Fast Approximation of the Weisfeiler-Lehman Graph Kernel for RDF Data” was accepted into the main conference and the second paper: “A Fast and Simple Graph Kernel for RDF” was accepted at the DMoLD workshop.

We include links to the papers, to the software and to the datasets used in the experiments, which are stored in figshare. Furthermore, we explain how to rerun the experiments from the papers using a precompiled JAR file, to make the effort required as minimal as possible.

Kudos to the authors for enabling others to duplicate their work! https://github.com/Data2Semantics/d2s-tools

Interesting to think of processing topics as sub-graphs consisting only of the subject identity properties. Deferring processing of other properties until the topic is requested.

The Problem with RDF and Nuclear Power

Tuesday, June 25th, 2013

The Problem with RDF and Nuclear Power by Manu Sporny.

Manu starts his post:

Full disclosure: I am the chair of the RDFa Working Group, the JSON-LD Community Group, a member of the RDF Working Group, as well as other Semantic Web initiatives. I believe in this stuff, but am critical about the path we’ve been taking for a while now.

(…)

RDF shares a number of these similarities with nuclear power. RDF is one of the best data modeling mechanisms that humanity has created. Looking into the future, there is no equally-powerful, viable alternative. So, why has progress been slow on this very exciting technology? There was no public mis-information campaign, so where did this negative view of RDF come from?

In short, RDF/XML was the Semantic Web’s 3 Mile Island incident. When it was released, developers confused RDF/XML (bad) with the RDF data model (good). There weren’t enough people and time to counter-act the negative press that RDF was receiving as a result of RDF/XML and thus, we are where we are today because of this negative perception of RDF. Even Wikipedia’s page on the matter seems to imply that RDF/XML is RDF. Some purveyors of RDF think that the public perception problem isn’t that bad. I think that when developers hear RDF, they think: “Not in my back yard”.

The solution to this predicament: Stop mentioning RDF and the Semantic Web. Focus on tools for developers. Do more dogfooding.

Over the years I have become more and more agnostic towards data models.

The real question for any data model is whether it fits your requirements. What other test would you have?

For merging data held in different data models or data models that don’t recognize the same subject identified differently, then subject identity and its management comes into play.

Subject identity and its management not being an area that has only one answer for any particular problem.

Manu does have concrete suggestions for how to advance topic maps, either as a practice of subject identity or a particular data model:

  1. The message shouldn’t be about the technology. It should be about the problems we have today and a concrete solution on how to address those problems.
  2. Demonstrate real value. Stop talking about the beauty of RDF, theoretical value, or design. Deliver production-ready, open-source software tools.
  3. Build a network of believers by spending more of your time working with Web developers and open-source projects to convince them to publish Linked Data. Dogfood our work.

A topic map version of those suggestions:

  1. The message shouldn’t be about the technology. It should be about the problems we have today and a concrete solution on how to address those problems.
  2. Demonstrate real value. Stop talking about the beauty of topic maps, theoretical value, or design. Deliver high quality content from merging diverse data sources. (Tools will take care of themselves if the content is valuable enough.)
  3. Build a network of customers by spending more of your time using topic maps to distinguish your content from content from the average web sewer.

As an information theorist I should be preaching to myself. Yes?

;-)

As the semantic impedance of the “Semantic Web,” “big data,” “NSA Data Cloud,” increases, the opportunities for competitive, military, industrial advantage from reliable semantic integration will increase.

Looking for showcase opportunities.

Suggestions?

Glimmer

Thursday, June 20th, 2013

Glimmer: An RDF Search Engine

New RDF search engine from Yahoo, built on Hadoop (0.23) and MG4j.

I first saw this in a tweet by Yves Raimond.

The best part being pointed to the MG4j project, which I haven’t looked at in a year or more.

More news on that tomorrow!

Hafslund Sesam — an archive on semantics

Thursday, June 13th, 2013

Hafslund Sesam — an archive on semantics by Lars Marius Garshol and Axel Borge.

Abstract:

Sesam is an archive system developed for Hafslund, a Norwegian energy company. It achieves the often-sought but rarely-achieved goal of automatically enriching metadata by using semantic technologies to extract and integrate business data from business applications. The extracted data is also indexed with a search engine together with the archived documents, allowing true enterprise search.

A curious paper that requires careful reading.

Since the paper makes technology choices, it’s only appropriate to start with the requirements:

The system must handle 1000 users, although not necessarily simultaneously.

Initial calculations of data size assumed 1.4 million customers and 1 million electric meters with 30-50 properties each. Including various other data gave a rough estimate on the order of 100 million statements.

The archive must be able to receive up to 2 documents per second over an interval of many hours, in order to handle about 100,000 documents a day during peak periods. The documents would mostly be paper forms recording electric meter readings.

To inherit metadata tags automatically requires running queries to achieve transitive closure. Assuming on average 10 queries for each document, the system must be able to handle 20 queries per second on 100 million statements.

In the next section, the authors concede that the fourth requirement, “RDF data integration” was unrealistic, so the fourth requirement was dropped:

The canonical approach to RDF data integration is currently query federation of SPARQL queries against a set of heterogeneous data sources, often using R2RML. Given the size of the data set, the generic nature of the transitive closure queries, and the number of data sources to be supported, we considered achieving 20 queries per second with query federation unrealistic.

Which leaves only:

The system must handle 1000 users, although not necessarily simultaneously.

Initial calculations of data size assumed 1.4 million customers and 1 million electric meters with 30-50 properties each. Including various other data gave a rough estimate on the order of 100 million statements.

The archive must be able to receive up to 2 documents per second over an interval of many hours, in order to handle about 100,000 documents a day during peak periods. The documents would mostly be paper forms recording electric meter readings.

as the requirements to be met.

I mention that because of the following technology choice statement:

To write generic code we must use a schemaless data representation, which must also be standards-based. The only candidates were Topic Maps [ISO13250-2] and RDF. The available Topic Maps implementations would not be able to handle the query throughput at the data sizes required. Testing of the Virtuoso triple store indicated that it could handle the workload just fine. RDF thus appeared to be the only suitable technology.

But there is no query throughput requirement. At least not for the storage mechanism. For deduplication in the ERP system (section 3.5), the authors choose to follow neither topic maps nor RDF but a much older technology, record linkage.

The other query mechnism is a Recommind search engine, which is reported to not be able to index and search at the same time. (section 4.1)

If I am reading the paper correctly, data from different sources are stored as received from various sources and owl:sameAs statements are used to map data to the archives schema.

I puzzle at that point because RDF is simply a format and OWL a means to state a mapping statement.

Given the semantic vagaries of owl:sameAs (Semantic Drift and Linked Data/Semantic Web), I have to wonder about the longer term maintenance of owl:sameAs mappings?

There is no expression of a reason for “sameAs” A reason that might prompt a future maintainer of the system to follow or not some particular “sameAs.”

Still, the project was successful and that counts for more than using any single technology to the exclusion of all others.

The comments on performance of topic maps options does make me mindful of the lack of benchmark data sets for topic maps.

Rya: A Scalable RDF Triple Store for the Clouds

Tuesday, June 11th, 2013

Rya: A Scalable RDF Triple Store for the Clouds by Roshan Punnoose, Adina Crainiceanu, and David Rapp.

Abstract:

Resource Description Framework (RDF) was designed with the initial goal of developing metadata for the Internet. While the Internet is a conglomeration of many interconnected networks and computers, most of today’s best RDF storage solutions are confined to a single node. Working on a single node has significant scalability issues, especially considering the magnitude of modern day data. In this paper we introduce a scalable RDF data management system that uses Accumulo, a Google Bigtable variant. We introduce storage methods, indexing schemes, and query processing techniques that scale to billions of triples across multiple nodes, while providing fast and easy access to the data through conventional query mechanisms such as SPARQL. Our performance evaluation shows that in most cases, our system outperforms existing distributed RDF solutions, even systems much more complex than ours.

Based on Accumulo (open-source NoSQL database by the NSA).

Interesting re-thinking of indexing of triples.

Future work includes owl:sameAs, owl:inverseOf and other inferencing rules.

Certainly a project to watch.

Coming soon: new, expanded edition of “Learning SPARQL”

Monday, June 3rd, 2013

Coming soon: new, expanded edition of “Learning SPARQL” by Bob DuCharme.

From the post:

55% more pages! 23% fewer mentions of the semantic web!

sparql

I’m very pleased to announce that O’Reilly will make the second, expanded edition of my book Learning SPARQL available sometime in late June or early July. The early release “raw and unedited” version should be available this week.

I wonder if Bob is going to start an advertising trend with “fewer mentions of the semantic web?”

;-)

Looking forward to the update!

Not that I care about SPARQL all that much but I’ll learn something.

Besides, I have liked Bob’s writing style from back in the SGML days.

4Store

Saturday, June 1st, 2013

4Store

From the about page:

4store is a database storage and query engine that holds RDF data. It has been used by Garlik as their primary RDF platform for three years, and has proved itself to be robust and secure.

4store’s main strengths are its performance, scalability and stability. It does not provide many features over and above RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist.

This was mentioned in a report by Bryan Thompson so I wanted to capture a link to the original site.

The latest tarball is dated 10 July 2012.

Big Data RDF Store Benchmarking Experiences

Friday, May 31st, 2013

Big Data RDF Store Benchmarking Experiences by Peter Boncz.

From the post:

Recently we were able to present new BSBM results, testing the RDF triple stores Jena TDB, BigData, BIGOWLIM and Virtuoso on various data sizes. These results extend the state-of-the-art in various dimensions:

  • scale: this is the first time that RDF store benchmark results on such a large size have been published. The previous published BSBM results published were on 200M triples, the 150B experiments thus mark a 750x increase in scale.
  • workload: this is the first time that results on the Business Intelligence (BI) workload are published. In contrast to the Explore workload, which features short-running “transactional” queries, the BI workload consists of queries that go through possibly billions of triples, grouping and aggregating them (using the respective functionality, new in SPARQL1.1).
  • architecture: this is the first time that RDF store technology with cluster functionality has been publicly benchmarked.

Clusters are great but also difficult to use.

Peter’s post is one of those rare ones that exposes the second half of that statement.

Impressive hardware and results.

Given the hardware and effort required, are we pursuing “big data” for the sake of “big data?”

Not just where RDF is concerned but in general?

Shouldn’t the first question always be: What is the relevant data?

If you can’t articulate the relevant data, isn’t that a commentary on your understanding of the problem?

A Trillion Triples in Perspective

Saturday, May 18th, 2013

Mozart Meets MapReduce by Isaac Lopez.

From the post:

Big data has been around since the beginning of time, says Thomas Paulmichl, founder and CEO of Sigmaspecto, who says that what has changed is how we process the information. In a talk during Big Data Week, Paulmichl encouraged people to open up their perspective on what big data is, and how it can be applied.

During the talk, he admonished people to take a human element into big data. Paulmichl demonstrated this by examining the work of musical prodigy, Mozart – who Paulmichl noted is appreciated greatly by both music scientists, as well as the common music listener.

“When Mozart makes choices on writing a piece of work, the number of choices that he has and the kind of neural algorithms that his brain goes through to choose things is infinitesimally higher that what we call big data – it’s really small data in comparison,” he said.

Taking Mozart’s The Magic Flute as an example, Paulmichl, discussed the framework that Mozart used to make his choices by examining a music sheet outlining the number of bars, the time signature, the instrument and singer voicing.

“So from his perspective, he sits down, and starts to make what we as data scientists call quantitative choices,” explained Paulmichl. “Do I put a note here, down here, do I use a different instrument; do I use a parallel voicing for different violins – so these are all metrics that his brain has to decide.”

Exploring the mathematics of the music, Paulmichl concluded that in looking at The Magic Flute, Mozart had 4.72391E+21 creative variations (and then some) that he could have taken with the direction of it over the course of the piece. “We’re not talking about a trillion dataset; we’re talking about a sextillion or more,” he says adding that this is a very limited cut of the quantitative choice that his brain makes at every composition point.

“[A] sextillion or more…” puts the question of processing a trillion triples into perspective.

Another musical analogy?

Triples are the one finger version of Jingle Bells*:

*The gap is greater than the video represents but it is still amusing.

Does your analysis/data have one finger subtlety?

A self-updating road map of The Cancer Genome Atlas

Friday, May 17th, 2013

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

Non-Adoption of Semantic Web, Reason #1002

Monday, May 13th, 2013

Kingsley Idehen offers yet another explanation/excuses for non-adoption of the semantic web in On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen by Roberto V. Zicari.

The highlight of this interview reads:

The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point

You may recall Kingsley’s demonstration of the non-complexity of authoring for the Semantic Web in The Semantic Web Is Failing — But Why? (Part 3).

Could it be users sense the “lock-in” of RDF/Semantic Web?

Q14. Big Data Analysis: could you connect Virtuoso with Hadoop? How does Viruoso relate to commercial data analytics platforms, e.g Hadapt, Vertica?

K​ingsley Uyi Idehen: You can integrate data managed by Hadoop based ETL workflows via ODBC or Web Services driven by Hapdoop clusters that expose RESTful interaction patterns for data access. As for how Virtuoso relates to the likes of Vertica re., analytics, this is about Virtuoso being the equivalent of Vertica plus the added capability of RDF based data management, Linked Data Deployment, and share-nothing clustering. There is no job that Vertica performs that Virtuoso can’t perform.

There are several jobs that Virtuoso can perform that Vertica, VoltDB, Hadapt, and many other NoSQL and NewSQL simply cannot perform with regards to scalable, high-performance RDF data management and Linked Data deployment. Remember, RDF based Linked Data is all about data management and data access without any kind of platform lock-in. Virtuoso locks you into a value proposition (performance and scale) not the platform itself. (emphasis added to last sentence)

It’s comforting to know RDF/Semantic Web “lock-in” has our best interest at heart.

See Kingley dodging the next question on Virtuoso’s ability scale:

Q15. Do you also benchmark loading trillion of RDF triples? Do you have current benchmark results? How much time does it take to querying them?

K​ingsley Uyi Idehen: As per my earlier responses, there is no shortage of benchmark material for Virtuoso.

The benchmarks are also based on realistic platform configurations unlike the RDBMS patterns of the past which compromised the utility of TPC benchmarks.

Full Disclosure: I haven’t actually counted all of Kingsley’s reasons for non-adoption of the Semantic Web. The number I assign here may be high or low.

The ChEMBL database as linked open data

Thursday, May 9th, 2013

The ChEMBL database as linked open data by Egon L Willighagen, Andra Waagmeester, Ola Spjuth, Peter Ansell, Antony J Williams, Valery Tkachenko, Janna Hastings, Bin Chen and David J Wild. (Journal of Cheminformatics 2013, 5:23 doi:10.1186/1758-2946-5-23).

Abstract:

Background Making data available as Linked Data using Resource Description Framework (RDF) promotes integration with other web resources. RDF documents can natively link to related data, and others can link back using Uniform Resource Identifiers (URIs). RDF makes the data machine-readable and uses extensible vocabularies for additional information, making it easier to scale up inference and data analysis.

Results This paper describes recent developments in an ongoing project converting data from the ChEMBL database into RDF triples. Relative to earlier versions, this updated version of ChEMBL-RDF uses recently introduced ontologies, including CHEMINF and CiTO; exposes more information from the database; and is now available as dereferencable, linked data. To demonstrate these new features, we present novel use cases showing further integration with other web resources, including Bio2RDF, Chem2Bio2RDF, and ChemSpider, and showing the use of standard ontologies for querying.

Conclusions We have illustrated the advantages of using open standards and ontologies to link the ChEMBL database to other databases. Using those links and the knowledge encoded in standards and ontologies, the ChEMBL-RDF resource creates a foundation for integrated semantic web cheminformatics applications, such as the presented decision support.

You already know about the fragility of ontologies so no need to repeat that rant here.

Having material encoded with an ontology, on the other hand, after vetting, can be a source that you wrap with a topic map.

So all that effort isn’t lost.

5 heuristics for writing better SPARQL queries

Wednesday, April 3rd, 2013

5 heuristics for writing better SPARQL queries by Paul Groth.

From the post:

In the context of the Open PHACTS and the Linked Data Benchmark Council projects, Antonis Loizou and I have been looking at how to write better SPARQL queries. In the Open PHACTS project, we’ve been writing super complicated queries to integrate multiple data sources and from experience we realized that different combinations and factors can dramatically impact performance. With this experience, we decided to do something more systematic and test how different techniques we came up with mapped to database theory and worked in practice. We just submitted a paper for review on the outcome. You can find a preprint (On the Formulation of Performant SPARQL Queries) on arxiv.org at http://arxiv.org/abs/1304.0567. The abstract is below. The fancy graphs are in the paper.

Paper Abstract:

The combination of the flexibility of RDF and the expressiveness of SPARQL provides a powerful mechanism to model, integrate and query data. However, these properties also mean that it is nontrivial to write performant SPARQL queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to write “good” queries. The goal of this paper is to begin to bridge this gap. It describes 5 heuristics that can be applied to create optimised queries. The heuristics are informed by formal results in the literature on the semantics and complexity of evaluating SPARQL queries, which ensures that queries following these rules can be optimised effectively by an underlying RDF store. Moreover, we empirically verify the efficacy of the heuristics using a set of openly available datasets and corresponding SPARQL queries developed by a large pharmacology data integration project. The experimental results show improvements in performance across 6 state-of-the-art RDF stores.

Just in case you have to query RDF stores as part of your topic map work.

Be aware that: The effectiveness of your SPARQL query will vary based on the RDF Store.

Or as the authors say:

SPARQL, due to its expressiveness , provides a plethora of different ways to express the same constraints, thus, developers need to be aware of the performance implications of the combination of query formulation and RDF Store. This work provides empirical evidence that can help developers in designing queries for their selected RDF Store. However, this raises questions about the effectives of writing complex generic queries that work across open SPARQL endpoints available in the Linked Open Data Cloud. We view the optimisation of queries independent of underlying RDF Store technology as a critical area of research to enable the most effective use of these endpoints. (page 21)

I hope their research is successful.

Varying performance, especially as reported in their paper, doesn’t bode well for cross-RDF Store queries.

Dydra

Tuesday, March 26th, 2013

Dydra

From the webpage:

Dydra

Dydra is a cloud-based graph database. Whether you’re using existing social network APIs or want to build your own, Dydra treats your customers’ social graph as exactly that.

With Dydra, your data is natively stored as a property graph, directly representing the relationships in the underlying data.

Expressive

With Dydra, you access and update your data via an industry-standard query language specifically designed for graph processing, SPARQL. It’s easy to use and we provide a handy in-browser query editor to help you learn.

Despite my misgivings about RDF (Simple Web Semantics), if you want to investigate RDF and SPARQL, Dydra would be a good way to get your feet wet.

You can get an idea of the skill level required by RDF/SPARQL.

Currently in beta, free with some resource limitations.

I particularly liked the line:

We manage every piece of the data store, including versioning, disaster recovery, performance, and more. You just use it.

RDF/SPARQL skills will remain a barrier but Dydra as does its best to make those the only barriers you will face. (And have reduced some of those.)

Definitely worth your attention, whether you simply want to practice on RDF/SPARQL as a data source or have other uses for it.

I first saw this in a tweet by Stian Danenbarger.

Congratulations! You’re Running on OpenCalais 4.7!

Monday, March 25th, 2013

Congratulations! You’re Running on OpenCalais 4.7!

From the post:

This morning we upgraded OpenCalais to release 4.7. Our focus with 4.7 was on a significant improvement in the detection and disambiguation of companies as well as some behind-the-scenes tune-ups and bug fixes.

If your content contains company names you should already be seeing a significant improvement in detection and disambiguation. While company detection has always been very good in OpenCalais, now it’s great.

If you’re one of our high-volume commercial clients (1M+ transactions per day), we’ll be rolling out your upgrade toward the end of the month.

And, remember, you can always drop by the OpenCalais viewer for a quick test or exploration of OpenCalais with zero programming involved.

If you don’t already know OpenCalais:

From a user perspective it’s pretty simple: You hand the Web Service unstructured text (like news articles, blog postings, your term paper, etc.) and it returns semantic metadata in RDF format. What’s happening in the background is a little more complicated.

Using natural language processing and machine learning techniques, the Calais Web Service examines your text and locates the entities (people, places, products, etc.), facts (John Doe works for Acme Corporation) and events (Jane Doe was appointed as a Board member of Acme Corporation). Calais then processes the entities, facts and events extracted from the text and returns them to the caller in RDF format.

Please also check out the Calais blog and forums to see where Calais is headed. Significant development activities include the ability for downstream content consumers to retrieve previously generated metadata using a Calais-provided GUID, additional input languages, and user-defined processing extensions.

Did I mention it is a free service up to 50,000 submissions a day? (see the license terms for details)

OpenCalais won’t capture every entity or relationship known to you but it will do a lot of the rote work for you. You can then fill in the specialized parts.

Tensors and Their Applications…

Saturday, March 23rd, 2013

Tensors and Their Applications in Graph-Structured Domains by Maximilian Nickel and Volker Tresp. (Slides.)

Along with the slides, you will like abstract and bibliography found at: Machine Learning on Linked Data: Tensors and their Applications in Graph-Structured Domains.

Abstract:

Machine learning has become increasingly important in the context of Linked Data as it is an enabling technology for many important tasks such as link prediction, information retrieval or group detection. The fundamental data structure of Linked Data is a graph. Graphs are also ubiquitous in many other fields of application, such as social networks, bioinformatics or the World Wide Web. Recently, tensor factorizations have emerged as a highly promising approach to machine learning on graph-structured data, showing both scalability and excellent results on benchmark data sets, while matching perfectly to the triple structure of RDF. This tutorial will provide an introduction to tensor factorizations and their applications for machine learning on graphs. By the means of concrete tasks such as link prediction we will discuss several factorization methods in-depth and also provide necessary theoretical background on tensors in general. Emphasis is put on tensor models that are of interest to Linked Data, which will include models that are able to factorize large-scale graphs with millions of entities and known facts or models that can handle the open-world assumption of Linked Data. Furthermore, we will discuss tensor models for temporal and sequential graph data, e.g. to analyze social networks over time.

Devising a system to deal with the heterogeneous nature of linked data.

Just skimming the slides I could see, this looks very promising.

I first saw this in a tweet by Stefano Bertolo.


Update: I just got an email from Maximilian Nickel and he has altered the transition between slides. Working now!

From slide 53 forward is pure gold for topic map purposes.

Heavy sledding but let me give you one statement from the slides that should capture your interest:

Instance matching: Ranking of entities by their similarity in the entity-latent-component space.

Although written about linked data, not limited to linked data.

What is more, Maximilian offers proof that the technique scales!

Complex, configurable, scalable determination of subject identity!

[Update: deleted note about issues with slides, which read: (Slides for ISWC 2012 tutorial, Chrome is your best bet. Even better bet, Chrome on Windows. Chrome on Ubuntu crashed every time I tried to go to slide #15. Windows gets to slide #46 before failing to respond. I have written to inquire about the slides.)]

A Distributed Graph Engine…

Friday, March 22nd, 2013

A Distributed Graph Engine for Web Scale RDF Data by Kai Zeng, Jiacheng Yang, Haixum Wang, Bin Shao and Zhongyuan Wang.

Abstract:

Much work has been devoted to supporting RDF data. But state-of-the-art systems and methods still cannot handle web scale RDF data e ffectively. Furthermore, many useful and general purpose graph-based operations (e.g., random walk, reachability, community discovery) on RDF data are not supported, as most existing systems store and index data in particular ways (e.g., as relational tables or as a bitmap matrix) to maximize one particular operation on RDF data: SPARQL query processing. In this paper, we introduce Trinity.RDF, a distributed, memory-based graph engine for web scale RDF data. Instead of managing the RDF data in triple stores or as bitmap matrices, we store RDF data in its native graph form. It achieves much better (sometimes orders of magnitude better) performance for SPARQL queries than the state-of-the-art approaches. Furthermore, since the data is stored in its native graph form, the system can support other operations (e.g., random walks, reachability) on RDF graphs as well. We conduct comprehensive experimental studies on real life, web scale RDF data to demonstrate the e ffectiveness of our approach.

From the conclusion:

We propose a scalable solution for managing RDF data as graphs in a distributed in-memory key-value store. Our query processing and optimization techniques support SPARQL queries without relying on join operations, and we report performance numbers of querying against RDF datasets of billions of triples. Besides scalability, our approach also has the potential to support queries and analytical tasks that are far more advanced than SPARQL queries, as RDF data is stored as graphs. In addition, our solution only utilizes basic (distributed) key-value store functions and thus can be ported to any in-memory key-value store.

A result that is:

  • scalable
  • goes beyond SPARQL
  • can be ported to any in-memory key-value store

Merits a very close read.

Makes me curious what other data models would work better if cast as graphs?

I first saw this in a tweet by Juan Sequeda.

Striking a Blow for Complexity

Thursday, March 21st, 2013

The W3C struck a blow for complexity today.

It’s blog entry entitled: Eleven SPARQL 1.1 Specifications are W3C Recommendations reads:

The SPARQL Working Group has completed development of its full-featured system for querying and managing data using the flexible RDF data model. It has now published eleven Recommendations for SPARQL 1.1, detailed in SPARQL 1.1 Overview. SPARQL 1.1 extends the 2008 Recommendation for SPARQL 1.0 by adding features to the query language such as aggregates, subqueries, negation, property paths, and an expanded set of functions and operators. Beyond the query language, SPARQL 1.1 adds other features that were widely requested, including update, service description, a JSON results format, and support for entailment reasoning. Learn more about the Semantic Web Activity.

I can’t wait for the movie version starring IBM’s Watson playing sudden death Jeopardy against Bob Ducharme, category SPARQL.

I’m betting on Bob!

Data.ac.uk

Thursday, March 21st, 2013

Data.ac.uk

From the website:

This is a landmark site for academia providing a single point of contact for linked open data development. It not only provides access to the know-how and tools to discuss and create linked data and data aggregation sites, but also enables access to, and the creation of, large aggregated data sets providing powerful and flexible collections of information.
Here at Data.ac.uk we’re working to inform national standards and assist in the development of national data aggregation subdomains.

I can’t imagine a greater contrast between my poor web authoring skills and a website than this one.

But having said that, I think you will be as disappointed as I was when you start looking for data on this “landmark site.”

There is some but not nearly enough to match the promise of such a cleverly designed website.

Perhaps they are hoping that someday RDF data (they also offer comma and tab delimited versions) will catch up to the site design.

I first saw this in a tweet by Frank van Harmelen.

Freebase Data Dumps

Thursday, March 21st, 2013

Freebase Data Dumps

From the webpage:

Data Dumps are a downloadable version of the data in Freebase. They constitute a snapshot of the data stored in Freebase and the Schema that structures it, and are provided under the same CC-BY license.

Full data dumps of every fact and assertion in Freebase are available as RDF and are updated every week. Deltas are not available at this time.

Total triples: 585 million
Compressed size: 14 GB
Uncompressed size: 87 GB
Data Format: Turtle RDF

I first saw this in a tweet by Thomas Steiner.

Open Annotation Data Model

Tuesday, March 19th, 2013

Open Annotation Data Model

Abstract:

The Open Annotation Core Data Model specifies an interoperable framework for creating associations between related resources, annotations, using a methodology that conforms to the Architecture of the World Wide Web. Open Annotations can easily be shared between platforms, with sufficient richness of expression to satisfy complex requirements while remaining simple enough to also allow for the most common use cases, such as attaching a piece of text to a single web resource.

An Annotation is considered to be a set of connected resources, typically including a body and target, where the body is somehow about the target. The full model supports additional functionality, enabling semantic annotations, embedding content, selecting segments of resources, choosing the appropriate representation of a resource and providing styling hints for consuming clients.

My first encounter with this proposal so I need to compare it to my Simple Web Semantics.

At first blush, the Open Annotation Core Model looks a lot heavier than Simple Web Semantics.

I need to reform my blog posts into a formal document and perhaps attach a comparison as an annex.

Semantic Search Over The Web (SSW 2013)

Monday, March 18th, 2013

3RD International Workshop onSemantic Search Over The Web (SSW 2013)

Dates:

Abstract Papers submission: May 31, 2013 – 15:00 (3:00 pm) EDT
(Short) Full Paper submission: June 7, 2013 – 15:00 (3:00 pm) EDT
Author notification: July 19, 2013
Camera-ready copy due: August 2, 2013
Workshop date: During VLDB (Aug 26 – Aug 30)

From the webpage:

We are witnessing a smooth evolution of the Web from a worldwide information space of linked documents to a global knowledge base, composed of semantically interconnected resources. To date, the correlated and semantically annotated data available on the web amounts to 25 billion RDF triples, interlinked by around 395 million RDF links. The continuous publishing and the integration of the plethora of semantic datasets from companies, government and public sector projects is leading to the creation of the so-called Web of Knowledge. Each semantic dataset contributes to extend the global knowledge and increases its reasoning capabilities. As a matter of facts, researchers are now looking with growing interest to semantic issues in this huge amount of correlated data available on the Web. Many progresses have been made in the field of semantic technologies, from formal models to repositories and reasoning engines. While the focus of many practitioners is on exploiting such semantic information to contribute to IR problems from a document centric point of view, we believe that such a vast, and constantly growing, amount of semantic data raises data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web.

The third edition of the International Workshop on Semantic Search over the Web (SSW) will discuss about data management issues related to the search over the web and the relationships with semantic web technologies, proposing new models, languages and applications.

The research issues can be summarized by the following problems:

  • How can we model and efficiently access large amounts of semantic web data?
  • How can we effectively retrieve information exploiting semantic web technologies?
  • How can we employ semantic search in real world scenarios?

The SSW Workshop invites researchers, engineers, service developers to present their research and works in the field of data management for semantic search. Papers may deal with methods, models, case studies, practical experiences and technologies.

Apologies for the uncertainty of the workshop date. (There is confusion about the date on the workshop site, one place says the 26th, the other the 30th. Check before you make reservation/travel arrangements.)

I differ with the organizers on some issues but on the presence of: “…data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web,” there is no disagreement.

That’s the trick isn’t it? In any confined or small group setting, just about any consistent semantic solution will work.

The hurly-burly of a constant stream of half-heard, partially understood communications across distributed and heterogeneous systems tests the true mettle of semantic solutions.

Not a quest for perfect communication but “good enough.”

LDBC – Second Technical User Community (TUC) Meeting

Monday, March 18th, 2013

LDBC: Linked Data Benchmark Council – Second Technical User Community (TUC) Meeting – 22/23rd April 2013.

From the post:

The LDBC consortium are pleased to announce the second Technical User Community (TUC) meeting.

This will be a two day event in Munich on the 22/23rd April 2013.

The event will include:

  • Introduction to the objectives and progress of the LDBC project.
  • Description of the progress of the benchmarks being evolved through Task Forces.
  • Users explaining their use-cases and describing the limitations they have found in current technology.
  • Industry discussions on the contents of the benchmarks.

All users of RDF and graph databases are welcome to attend. If you are interested, please contact: ldbc AT ac DOT upc DOT edu.

Further meeting details at the post.

Beacons of Availability

Sunday, March 17th, 2013

From Records to a Web of Library Data – Pt3 Beacons of Availability by Richard Wallis.

Beacons of Availability

As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.

As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.

Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff. How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal. In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.

I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google. A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources. You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources. Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist. A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines. Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.

I’m am an ardent sympathizer helping people to find “our stuff.”

I don’t disagree with the description of Google as: “…the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!”

But in all fairness to Google, I would remind you of Drabenstott’s research that found for the Library of Congress subject headings:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows:

children 32%
adults 40%
reference 53%
technical services librarians 56%

The Library of Congress subject classification has been around for more than a century and just over half of the librarians can use it correctly.

Let’s don’t wait more than a century to test the claim:*

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines.


* By “test” I don’t mean the sort of study, “…we recruited twelve LIS students but one had to leave before the study was complete….”

I am using “test” in the sense of a well designed and organized social science project with professional assistance from social scientists, UI test designers and the like.

I think OCLC is quite sincere in its promotion of linked data, but effectiveness is an empirical question, not one of sincerity.

Algebraix Data Achieves Unrivaled Semantic Benchmark Performance

Saturday, March 16th, 2013

Algebraix Data Achieves Unrivaled Semantic Benchmark Performance by Angela Guess.

From the post:

Algebraix Data Corporation today announced its SPARQL Server(TM) RDF database successfully executed all 17 of its queries on the SP2 benchmark up to one billion triples on one computer node. The SP2 benchmark is the most computationally complex for testing SPARQL performance and no other vendor has reported results for all queries on data sizes above five million triples.

Furthermore, SPARQL Server demonstrated linear performance in total SP2Bench query time on data sets from one million to one billion triples. These latest dramatic results are made possible by algebraic optimization techniques that maximize computing resource utilization.

“Our outstanding SPARQL Server performance is a direct result of the algebraic techniques enabled by our patented Algebraix technology,” said Charlie Silver, CEO of Algebraix Data. “We are investing heavily in the development of SPARQL Server to continue making substantial additional functional, performance and scalability improvements.”

Pretty much a copy of the press release from Algebraix.

You may find:

Doing the Math: The Algebraix DataBase Whitepaper: What it is, how it works, why we need it (PDF) by Robin Bloor, PhD

ALGEBRAIX Technology Mathematics Whitepaper (PDF), by Algebraix Data

and,

Granted Patents

more useful.

BTW, The SP²Bench SPARQL Performance Benchmark, will be useful as well.

Algebraix listed its patents but I supplied the links. Why the links were missing at Algebraix I cannot say.

If the “…no other vendor has reported results for all queries on data sizes above five million triples…” is correct, isn’t scaling an issue for SQARQL?

From Records to a Web of Library Data – Pt2 Hubs of Authority

Saturday, March 16th, 2013

From Records to a Web of Library Data – Pt2 Hubs of Authority by Richard Wallis.

From the post:

Hubs of Authority

Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years. The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations. Two from personal experience come to mind, BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC. Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).

One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records. A large number of national libraries have such lists of agreed formats for author and organisational names. The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc. Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations.

These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data. As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other. A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.

I don’t deny that VIAF is a very useful tool but if you search for personal name, “Marilyn Monroe,” it returns:

1. Miller, Arthur, 1915-2005
National Library of Australia National Library of the Czech Republic National Diet Library (Japan) Deutsche Nationalbibliothek RERO (Switzerland) SUDOC (France) Library and Archives Canada National Library of Israel (Latin) National Library of Sweden NUKAT Center (Poland) Bibliothèque nationale de France Biblioteca Nacional de España Library of Congress/NACO

Miller, Arthur (Arthur Asher), 1915-2005
National Library of the Netherlands-test

Miller, Arthur, 1915-
Vatican Library Biblioteca Nacional de Portugal

ميلر، ارثر، 1915-2005 م.
Bibliotheca Alexandrina (Egypt)

Miller, Arthur
Wikipedia (en)-test

מילר, ארתור, 1915-2005
National Library of Israel (Hebrew)

2. Monroe, Marilyn, 1926-1962
National Library of Israel (Latin) National Library of the Czech Republic National Diet Library (Japan) Deutsche Nationalbibliothek SUDOC (France) Library and Archives Canada National Library of Australia National Library of Sweden NUKAT Center (Poland) Bibliothèque nationale de France Biblioteca Nacional de España Library of Congress/NACO

Monroe, Marilyn
National Library of the Netherlands-test Wikipedia (en)-test RERO (Switzerland)

Monroe, Marilyn American actress, model, and singer, 1926-1962
Getty Union List of Artist Names

Monroe, Marilyn, pseud.
Biblioteca Nacional de Portugal

3. DiMaggio, Joe, 1914-1999
Library of Congress/NACO Bibliothèque nationale de France

Di Maggio, Joe 1914-1999
Deutsche Nationalbibliothek

Di Maggio, Joseph Paul, 1914-1999
National Diet Library (Japan)

DiMaggio, Joe, 1914-
National Library of Australia

Dimaggio, Joseph Paul, 1914-1999
SUDOC (France)

DiMaggio, Joe (Joseph Paul), 1914-1999
National Library of the Netherlands-test

Dimaggio, Joe
Wikipedia (en)-test

4. Monroe, Marilyn
Deutsche Nationalbibliothek

5. Hurst-Monroe, Marlene
Library of Congress/NACO

6. Wolf, Marilyn Monroe
Deutsche Nationalbibliothek

Maybe Sir Tim is right, users “…can discover more things.”

Some of them are related, some of them are not.

DRM/WWW, Wealth/Salvation: Theological Parallels

Thursday, March 14th, 2013

Cory Doctorow misses a teaching moment in his: What I wish Tim Berners-Lee understood about DRM.

Cory says:

Whenever Berners-Lee tells the story of the Web’s inception, he stresses that he was able to invent the Web without getting any permission. He uses this as a parable to explain the importance of an open and neutral Internet.

The “…without getting any permission” was a principle for Tim Berners-Lee when he was inventing the Web.

A principle then, not now.

Evidence? The fundamentals of RDF have been mired in the same model for fourteen (14) years. Impeding the evolution of the “Semantic” Web. Whatever its merits.

Another example? HTML5 violates prior definitions of URL in order to widen the reach of HTML5. (URL Homonym Problem: A Topic Map Solution)

Same “principle” as DRM support, expanding the label of “WWW” beyond what early supporters would recognize as the WWW.

HTML5 rewriting of URL and DRM support are membership building exercises.

The teaching moment comes from early Christian history.

You may (or may not) recall the parable of the rich young ruler (Matthew 19:16-30), where a rich young man asks Jesus what he must do to be saved?

Jesus replies:

One thing you still lack. Sell all that you have and distribute to the poor, and you will have treasure in heaven; and come, follow me.

And for the first hundred or more years of Christianity, so far as can be known, that rule, divesting yourself of property was followed.

Until, Clement of Alexandria. Clement took the position that indeed the rich could retain their goods, so long as they used it charitably. (Now there’s a loophole!)

Created two paths to salvation, one for anyone foolish enough to take the Bible at its word and another for anyone would wanted to call themselves Christians, without any inconvenience or discomfort.

Following Clement of Alexandria, Tim Berners-Lee is creating two paths to the WWW.

One for people who are foolish enough to innovate and share information, the innovation model of the WWW that Cory speaks so highly of.

Another path for people (DRM crowd) who neither spin nor toil but who want to burden everyone who does.

Membership as a principle isn’t surprising considering how TBL sees himself in the mirror:

TBL as WWW Pope

Data Catalog Vocabulary (DCAT) [Last Call ends 08 April 2013]

Tuesday, March 12th, 2013

Data Catalog Vocabulary (DCAT)

Abstract:

DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web. This document defines the schema and provides examples for its use.

By using DCAT to describe datasets in data catalogs, publishers increase discoverability and enable applications easily to consume metadata from multiple catalogs. It further enables decentralized publishing of catalogs and facilitates federated dataset search across sites. Aggregated DCAT metadata can serve as a manifest file to facilitate digital preservation.

If you have comments, now would be a good time to finish them up for submission.

I first saw this in a tweet by Sandro Hawke.

RDF Data Cube Vocabulary [Last Call ends 08 April 2013]

Tuesday, March 12th, 2013

RDF Data Cube Vocabulary

Abstract:

There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations. The Data Cube vocabulary is a core foundation which supports extension vocabularies to enable publication of other aspects of statistical data flows or other multi-dimensional data sets.

If you have comments, now would be a good time to finish them up for submission.

I first saw this in a tweet by Sandro Hawke.