Archive for the ‘SPARQL’ Category

A self-updating road map of The Cancer Genome Atlas

Friday, May 17th, 2013

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

5 heuristics for writing better SPARQL queries

Wednesday, April 3rd, 2013

5 heuristics for writing better SPARQL queries by Paul Groth.

From the post:

In the context of the Open PHACTS and the Linked Data Benchmark Council projects, Antonis Loizou and I have been looking at how to write better SPARQL queries. In the Open PHACTS project, we’ve been writing super complicated queries to integrate multiple data sources and from experience we realized that different combinations and factors can dramatically impact performance. With this experience, we decided to do something more systematic and test how different techniques we came up with mapped to database theory and worked in practice. We just submitted a paper for review on the outcome. You can find a preprint (On the Formulation of Performant SPARQL Queries) on arxiv.org at http://arxiv.org/abs/1304.0567. The abstract is below. The fancy graphs are in the paper.

Paper Abstract:

The combination of the flexibility of RDF and the expressiveness of SPARQL provides a powerful mechanism to model, integrate and query data. However, these properties also mean that it is nontrivial to write performant SPARQL queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to write “good” queries. The goal of this paper is to begin to bridge this gap. It describes 5 heuristics that can be applied to create optimised queries. The heuristics are informed by formal results in the literature on the semantics and complexity of evaluating SPARQL queries, which ensures that queries following these rules can be optimised effectively by an underlying RDF store. Moreover, we empirically verify the efficacy of the heuristics using a set of openly available datasets and corresponding SPARQL queries developed by a large pharmacology data integration project. The experimental results show improvements in performance across 6 state-of-the-art RDF stores.

Just in case you have to query RDF stores as part of your topic map work.

Be aware that: The effectiveness of your SPARQL query will vary based on the RDF Store.

Or as the authors say:

SPARQL, due to its expressiveness , provides a plethora of different ways to express the same constraints, thus, developers need to be aware of the performance implications of the combination of query formulation and RDF Store. This work provides empirical evidence that can help developers in designing queries for their selected RDF Store. However, this raises questions about the effectives of writing complex generic queries that work across open SPARQL endpoints available in the Linked Open Data Cloud. We view the optimisation of queries independent of underlying RDF Store technology as a critical area of research to enable the most effective use of these endpoints. (page 21)

I hope their research is successful.

Varying performance, especially as reported in their paper, doesn’t bode well for cross-RDF Store queries.

Dydra

Tuesday, March 26th, 2013

Dydra

From the webpage:

Dydra

Dydra is a cloud-based graph database. Whether you’re using existing social network APIs or want to build your own, Dydra treats your customers’ social graph as exactly that.

With Dydra, your data is natively stored as a property graph, directly representing the relationships in the underlying data.

Expressive

With Dydra, you access and update your data via an industry-standard query language specifically designed for graph processing, SPARQL. It’s easy to use and we provide a handy in-browser query editor to help you learn.

Despite my misgivings about RDF (Simple Web Semantics), if you want to investigate RDF and SPARQL, Dydra would be a good way to get your feet wet.

You can get an idea of the skill level required by RDF/SPARQL.

Currently in beta, free with some resource limitations.

I particularly liked the line:

We manage every piece of the data store, including versioning, disaster recovery, performance, and more. You just use it.

RDF/SPARQL skills will remain a barrier but Dydra as does its best to make those the only barriers you will face. (And have reduced some of those.)

Definitely worth your attention, whether you simply want to practice on RDF/SPARQL as a data source or have other uses for it.

I first saw this in a tweet by Stian Danenbarger.

Striking a Blow for Complexity

Thursday, March 21st, 2013

The W3C struck a blow for complexity today.

It’s blog entry entitled: Eleven SPARQL 1.1 Specifications are W3C Recommendations reads:

The SPARQL Working Group has completed development of its full-featured system for querying and managing data using the flexible RDF data model. It has now published eleven Recommendations for SPARQL 1.1, detailed in SPARQL 1.1 Overview. SPARQL 1.1 extends the 2008 Recommendation for SPARQL 1.0 by adding features to the query language such as aggregates, subqueries, negation, property paths, and an expanded set of functions and operators. Beyond the query language, SPARQL 1.1 adds other features that were widely requested, including update, service description, a JSON results format, and support for entailment reasoning. Learn more about the Semantic Web Activity.

I can’t wait for the movie version starring IBM’s Watson playing sudden death Jeopardy against Bob Ducharme, category SPARQL.

I’m betting on Bob!

SPIN: SPARQL Inferencing Notation

Tuesday, March 12th, 2013

SPIN: SPARQL Inferencing Notation

From the webpage:

SPIN is a W3C Member Submission that has become the de-facto industry standard to represent SPARQL rules and constraints on Semantic Web models. SPIN also provides meta-modeling capabilities that allow users to define their own SPARQL functions and query templates. Finally, SPIN includes a ready to use library of common functions.

SPIN in Five Slides.

In case you encounter SPARQL rules and constraints.

I first saw this in a tweet by Stian Danenbarger.

Marketplace in Query Libraries? Marketplace in Identified Entities?

Thursday, February 7th, 2013

Using SPARQL Query Libraries to Generate Simple Linked Data API Wrappers by Tony Hirst.

From the post:

A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) SPARQL with R in less than 5 minutes, which shows how to query US data.gov Linked Data and then Leigh Dodds’ Brief Review of the Land Registry Linked Data.

I was going to post a couple of of examples merging those two posts – showing how to access Land Registry data via Leigh’s example queries in R, then plotting some of the results using ggplot2, but another post of Leigh’s today – SPARQL-doc – a simple convention for documenting individual SPARQL queries, has sparked another thought…

For some time I’ve been intrigued by the idea of a marketplace in queries over public datasets, as well as the public sharing of generally useful queries. A good query is like a good gold pan, or a good interview question – it can get a dataset to reveal something valuable that may otherwise have laid hidden. Coming up with a good query in part requires having a good understanding of the structure of a dataset, in part having an eye for what sorts of secret the data may contain: the next step is crafting a well phrased query that can tease that secret out. Creating the query might take some time, some effort, and some degree of expertise in query optimisation to make it actually runnable in reasonable time (which is why I figure there may be a market for such things*) but once written, the query is there. And if it can be appropriately parameterised, it may generalise.

Tony’s marketplace of queries has a great deal of potential.

But I don’t think they need to be limited to SPARQL queries.

By extension his arguments should be true for searches on Google, Bing, etc., as well as vendor specialized search interfaces.

I would take that a step further into libraries for post-processing the results of such queries and presenting users with enhanced presentations and/or content.

And as part of that post-processing, I would add robust identification of entities as an additional feature of such a library/service.

For example, what if you have curated some significant portion of the ACM digital library and when passed what could be an ambiguous reference to a concept, you return to the user the properties that distinguish their reference into several groups.

Which frees every user from wading through unrelated papers and proceedings, when that reference comes up.

Would that be a service users would pay for?

I suppose that depends on how valuable their time is to them and/or their employers.

SPARQL with R in less than 5 minutes [Fire Data]

Saturday, January 26th, 2013

SPARQL with R in less than 5 minutes

From the post:

In this article we’ll get up and running on the Semantic Web in less than 5 minutes using SPARQL with R. We’ll begin with a brief introduction to the Semantic Web then cover some simple steps for downloading and analyzing government data via a SPARQL query with the SPARQL R package.

What is the Semantic Web?

To newcomers, the Semantic Web can sound mysterious and ominous. By most accounts, it’s the wave of the future, but it’s hard to pin down exactly what it is. This is in part because the Semantic Web has been evolving for some time but is just now beginning to take a recognizable shape (DuCharme 2011). Detailed definitions of the Semantic Web abound, but simply put, it is an attempt to structure the unstructured data on the Web and to formalize the standards that make that structure possible. In other words, it’s an attempt to create a data definition for the Web.

I will have to re-read Bob Ducharme’s “Learning SPARQL.” I didn’t realize the “Semantic Web” was beginning to “…take a recognizable shape.” After a decade of attempting to find an achievable agenda, it’s about time.

The varying interpretations of Semantic Web origin tales are quite amusing. In the first creation account, independent agents were going to schedule medical appointments and tennis matches for us. In the second account, our machine were going to reason across structured data to produce new insights. More recently, the vision is of a web of CMU Coke machines connected to the WWW, along with other devices. (The Internet of Things.)

I suppose the next version will be computers that can exchange information using the TCP/IP protocol and various standards, like HTML, for formatting documents. Plus some declaration that semantics will be handled in a future version, sufficiently far off to keep grant managers from fearing an end to the project.

The post is a good example of using R to use SPARQL and you will encounter data at SPARQL endpoints so it is a useful exercise.

The example data set is one of wildfires and acres burned per year, 1960-2008.

More interesting fire data sets can be found at: Fire Detection GIS Data.

Mapping that data by date, weather conditions/trends, known impact, would require coordination between diverse data sets.

SPARQL end-point of data.euorpeana.edu

Friday, December 21st, 2012

SPARQL end-point of data.euorpeana.edu

From the webpage:

Welcome on the SPARQL end-point of data.europeana.eu!

data.europeana.eu currently contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana. Data is following the terms of the Creative Commons CC0 public domain dedication. Data is described the Resource Description Framework (RDF) format, and structured using the Europeana Data Model (EDM). We give more detail on the EDM data we publish on the technical details page.

Please take the time to check out the list of collections currently included in the pilot.

The terms of use and external data sources appearing at data.europeana.eu are provided on the Europeana Data sources page.

Sample queries are available on the sparql page.

At first I wondered why this was news because: Europeana opens up data on 20 million cultural items appeared on 12 September 2012 in the Guardian

I assume the data has been in use since its release last September.

If you have been using it, can you comment on how your use will change now that the data is available as a SPARQL end-point?

YASGUI: Web-based SPARQL client with bells ‘n wistles

Wednesday, December 5th, 2012

YASGUI: Web-based SPARQL client with bells ‘n wistles

From the post:

A few months ago Laurens Rietveld was looking for a query interface from which he could easily query any other SPARQL endpoint.

But he couldn’t find any that fit my requirements:

So he decided to make his own!

Give it a try at: http://aers.data2semantics.org/sparql/

Future work (next year probably):

In case you are interested in SPARQL per se or want to extract information for re-use in a topic map. Could be interesting.

Good to see mention of our friends at Mondeca.

Normalizing company names with SPARQL and DBpedia

Wednesday, December 5th, 2012

Normalizing company names with SPARQL and DBpedia

Bob DuCharme writes:

Wikipedia page redirection data, waiting for you to query it.

If you send your browser to http://en.wikipedia.org/wiki/Big_Blue, you’ll end up at IBM’s page, because Wikipedia knows that this nickname usually refers to this company. (Apparently, it’s also a nickname for several high schools and universities.) This data pointing from nicknames to official names is also stored in DBpedia, which means that we we can use SPARQL queries to normalize company names. You can use the same technique to normalize other kinds of names—for example, trying to send your browser to http://en.wikipedia.org/wiki/Bobby_Kennedy will actually send it to http://en.wikipedia.org/wiki/Robert_F._Kennedy—but a query that sticks to one domain will have a simpler job. Description Logics and all that.

As always Bob is on the cutting edge of the use of a markup standard!

Possible topic map analogies:

  • create a second name cluster and the “normalized name” is an additional base name
  • move the “nickname” to a variant name (scope?) and update the base name to be the normalized name (with changes to sort/display as necessary)

I am assuming that Bob’s lang(?redirectsTo) = "en" operates like scope in topic maps.

Except that scope in topic map is represented by one or more topics, which means merging can occur between topics that represent the same language.

Webnodes Semantic Integration Server (SDShare Protocol)

Thursday, November 22nd, 2012

Webnodes AS announces Webnodes Semantic Integration Server by Mike Johnston.

From the post:

Webnodes AS, a company developing a .NET based semantic content management system, today announced the release of a new product called Webnodes Semantic Integration Server.

Webnodes Semantic Integration Server is a standalone product that has two main components: A SPARQL endpoint for traditional semantic use-cases and the full-blown integration server based on the SDShare protocol. SDShare is a new protocol for allowing different software to share and consume data with each other, with minimal amount of setup.

The integration server ships with connectors out of the box for OData- and SPARQL endpoints and any ODBC compatible RDBMS. This means you can integrate many of the software systems on the market with very little work. If you want to support software not compatible with any of the available connectors, you can create custom connectors. In addition to full-blown connectors, the integration server can also push the raw data to another SPARQL endpoint (the internal data format is RDF) or a HTTP endpoint (for example Apache SOLR).

I wonder about the line:

This means you can integrate many of the software systems on the market with very little work.

I think wiring disparate systems together is a better description. To “integrate” systems implies some useful result.

Wiring systems together is a long way from the hard task of semantic mapping, which produces integration of systems.

I first saw this in a tweet by Paul Hermans.

Sindice SPARQL endpoint

Wednesday, November 21st, 2012

Sindice SPARQL endpoint by Gabi Vulcu.

From an email by Gabi:

We have released a new version of the SIndice SPARQL endpoint (http://sparql.sindice.com/) with two new datasets: sudoc and yago

Below are the current dump datasets that are in the Sparql endpoint:

dataset_uri dataset_name
http://sindice.com/dataspace/default/dataset/dbpedia “dbpedia”
http://sindice.com/dataspace/default/dataset/medicare “medicare”
http://sindice.com/dataspace/default/dataset/whoiswho “whoiswho”
http://sindice.com/dataspace/default/dataset/sudoc “sudoc”
http://sindice.com/dataspace/default/dataset/nytimes “nytimes”
http://sindice.com/dataspace/default/dataset/ookaboo “ookaboo”
http://sindice.com/dataspace/default/dataset/europeana “europeana”
http://sindice.com/dataspace/default/dataset/basekb “basekb”
http://sindice.com/dataspace/default/dataset/geonames “geonames”
http://sindice.com/dataspace/default/dataset/wordnet “wordnet”
http://sindice.com/dataspace/default/dataset/dailymed “dailymed”
http://sindice.com/dataspace/default/dataset/reactome “reactome”
http://sindice.com/dataspace/default/dataset/yago “yago”

The list of crawled website datasets that have been rdf-ized and loaded into the Sparql endpoint can be found here [1]

Due to space limitation we limited both the amount of dump datasets to the ones in the above table and the websites datasets to the top 1000 domains based on the DING[3] score.

However, upon request, if someone needs a particular dataset( there are more to choose from here [4]), we can arrange to get it into the Sparql endpoint in the next release.

[1] https://docs.google.com/spreadsheet/ccc?key=0AvdgRy2el8d9dERUZzBPNEZIbVJTTVVIRDVUWHhKdWc
[2] https://docs.google.com/spreadsheet/ccc?key=0AvdgRy2el8d9dGhDaHMta0MtaG9vWWhhbTd5SVVaX1E
[3] http://ding.sindice.com
[4] https://docs.google.com/spreadsheet/ccc?key=0AvdgRy2el8d9dGhDaHMta0MtaG9vWWhhbTd5SVVaX1E#gid=0

You may also be interested in: sindice-dev — Sindice developers list.

Stardog 1.1 Release

Thursday, November 15th, 2012

Stardog 1.1 Release

From the webpage:

Stardog is a fast, commercial RDF database: SPARQL for queries; OWL for reasoning; pure Java for the Enterprise.

Stardog 1.1 supports SPARQL 1.1.

I first saw this in a tweet from Kendall Clark.

IOGDS: International Open Government Dataset Search

Saturday, November 10th, 2012

IOGDS: International Open Government Dataset Search

Description:

The TWC International Open Government Dataset Search (IOGDS) is a linked data application based on metadata “scraped” from hundreds of international dataset catalog websites publishing a rich variety of government data. Metadata extracted from these catalog websites is automatically converted to RDF linked data and re-published via the TWC LOGD SPARQL endpoint and made available for download. The TWC IOGDS demo site features an efficient, reconfigurable faceted browser with search capabilities offering a compelling demonstration of the value of a common metadata model for open government dataset catalogs. We believe that the vocabulary choices demonstrated by IOGDS highlights the potential for useful linked data applications to be created from open government catalogs and will encourage the adoption of such a standard worldwide.

In addition to the datasets you will find tutorials, videos, demos, tools and technologies and other resources.

Whether you are looking for Linked Data or Linked Data to re-use in other ways.

Seen in a tweet by Tim O’Reilly.

Eleven SPARQL 1.1 Specifications Published

Friday, November 9th, 2012

Eleven SPARQL 1.1 Specifications Published

From the post:

The SPARQL Working Group has today published a set of eleven documents, advancing most of SPARQL 1.1 to Proposed Recommendation. Building on the success of SPARQL 1.0, SPARQL 1.1 is a full-featured standard system for working with RDF data, including a query/update language, two HTTP protocols (one full-featured, one using basic HTTP verbs), three result formats, and other features which allow SPARQL endpoints to be combined and work together. Most features of SPARQL 1.1 have already been implemented by a range of SPARQL suppliers, as shown in our table of implementations and test results.

The Proposed Recommendations are:

  1. SPARQL 1.1 Overview – Overview of SPARQL 1.1 and the SPARQL 1.1 documents
  2. SPARQL 1.1 Query Language – A query language for RDF data.
  3. SPARQL 1.1 Update – Specifies additions to the query language to allow clients to update stored data
  4. SPARQL 1.1 Query Results JSON Format – How to use JSON for SPARQL query results
  5. SPARQL 1.1 Query Results CSV and TSV Formats – How to use comma-separated values (CVS) and tab-separated values (TSV) for SPARQL query results
  6. SPARQL Query Results XML Format – How to use XML for SPARQL query results. (This contains only minor, editorial updates from SPARQL 1.0, and is actually a Proposed Edited Recommendation.)
  7. SPARQL 1.1 Federated Query – an extension of the SPARQL 1.1 Query Language for executing queries distributed over different SPARQL endpoints.
  8. "http://www.w3.org/TR/sparql11-service-description/">SPARQL 1.1 Service Description – a method for discovering and a vocabulary for describing SPARQL services.

While you are waiting for news on SPARQL performance increases, some reading material to pass the time.

Federated SPARQL Queries [Take "Hit" From Multiple/Distributed Data Sets]

Thursday, November 8th, 2012

On the Impact of Data Distribution in Federated SPARQL Queries by Nur Aini Rakhmawati and Michael Hausenblas.

Abstract:

With the growing number of publicly available SPARQL endpoints, federated queries become more and more attractive and feasible. Compared to queries against a single endpoint, queries that range over a number of endpoints pose new challenges, ranging from the type and number of datasets involved to the data distribution across the datasets. Existingre search focuses on the data distribution in a central store and is mainly concerned with adopting well-known, traditional database techniques. In this work we investigate the impact of the data distribution in the context of federated SPARQL queries.We perform a number of experiments with four federation frameworks (Sesame Alibaba, Splendid, FedX, and Darq) against an RDF dataset, Dailymed, that we partition by graph and class.Our preliminary results confirm the intuition that the more datasets involved in query processing, the worse performance of federation query is and that the data distribution significantly influences the performance.

It isn’t often I read in the same paragraph:

With the growing number of publicly available SPARQL endpoints, federated queries become more and more attractive and feasible.

and

Our preliminary results confirm the intuition that the more datasets involved in query processing, the worse performance of federation query is and that the data distribution significantly influences the performance.

I have trouble reconciling “…more and more attractive and feasible” with “…the more datasets…the worse performance of federation query is….”

Particularly in the age of “big data” where an increasing number of datasets and data distribution are the norms, not exceptions.

I commend the authors for creating data points to confirm “intuitions” about SPARQL performance.

At the same time, their results raise serious questions about SPARQL in big data environments.

SPARQL and Big Data (and NoSQL) [Identifying Winners and Losers - Cui Bono?]

Saturday, October 27th, 2012

SPARQL and Big Data (and NoSQL) by Bob DuCharme.

From the post:

How to pursue the common ground?

I think it’s obvious that SPARQL and other RDF-related technologies have plenty to offer to the overlapping worlds of Big Data and NoSQL, but this doesn’t seem as obvious to people who focus on those areas. For example, the program for this week’s Strata conference makes no mention of RDF or SPARQL. The more I look into it, the more I see that this flexible, standardized data model and query language align very well with what many of those people are trying to do.

But, we semantic web types can’t blame them for not noticing. If you build a better mouse trap, the world won’t necessarily beat a path to your door, because they have to find out about your mouse trap and what it does better. This requires marketing, which requires talking to those people in language that they understand, so I’ve been reading up on Big Data and NoSQL in order to better appreciate what they’re trying to do and how.

A great place to start is the excellent (free!) booklet Planning for Big Data by Edd Dumbill. (Others contributed a few chapters.) For a start, he describes data that “doesn’t fit the strictures of your database architectures” as a good candidate for Big Data approaches. That’s a good start for us. Here are a few longer quotes that I found interesting, starting with these two paragraphs from the section titled “Ingesting and Cleaning” after a discussion about collecting data from multiple different sources (something else that RDF and SPARQL are good at):

Bob has a very good point: marketing “…requires talking to those people in language that they understand….”

That is, no matter how “good” we think a solution may be, it won’t interest others until we explain it in terms they “get.”

But “marketing” requires more than a lingua franca.

Once an offer is made and understood, it must interest the other person. Or it is very poor marketing.

We may think that any sane person would jump at the chance to reduce the time and expense of data cleaning. But that isn’t necessarily the case.

I once made a proposal that would substantially reduce the time and expense for maintaining membership records. Records that spanned decades and were growing every year (hard copy). I made the proposal, thinking it would be well received.

Hardly. I was called into my manager’s office and got a lecture on how the department in question had more staff, a larger budget, etc., than any other department. They had no interest whatsoever in my proposal and that I should not presume to offer further advice. (Years later my suggestion was adopted when budget issues forced the issue.)

Efficient information flow interested me but not management.

Bob and the rest of us need to ask the traditional question: Cui bono? (To whose benefit?)

Semantic technologies, just like any other, have winners and losers.

To effectively market our wares, we need to identify both.

Relational Data to RDF [Bridge to No Where?]

Sunday, October 21st, 2012

Transforming Relational Data to RDF – R2RML Becomes Official W3C Recommendation by Eric Franzon.

From the post:

Today, the World Wide Web Consortium announced that R2RML has achieved Recommendation status. As stated on the W3C website, R2RML is “a language for expressing customized mappings from relational databases to RDF datasets. Such mappings provide the ability to view existing relational data in the RDF data model, expressed in a structure and target vocabulary of the mapping author’s choice.” In the life cycle of W3C standards creation, today’s announcement means that the specifications have gone through extensive community review and revision and that R2RML is now considered stable enough for wide-spread distribution in commodity software.

Richard Cyganiak, one of the Recommendation’s editors, explained why R2RML is so important. “In the early days of the Semantic Web effort, we’ve tried to convert the whole world to RDF and OWL. This clearly hasn’t worked. Most data lives in entrenched non-RDF systems, and that’s not likely to change.”

“That’s why technologies that map existing data formats to RDF are so important,” he continued. “R2RML builds a bridge between the vast amounts of existing data that lives in SQL databases and the SPARQL world. Having a standard for this makes SPARQL even more useful than it already is, because it can more easily access lots of valuable existing data. It also means that database-to-RDF middleware implementations can be more easily compared, which will create pressure on both open-source and commercial vendors, and will increase the level of play in the entire field.” (emphasis added)

If most data resides in non-RDF systems, what do I gain by converting it into RDF for querying with SPARQL?

Some possible costs:

  • Planning the conversion from non-RDF to RDF system
  • Debugging the conversion (unless it is trivial, the few conversions won’t be right)
  • Developing the SPARQL queries
  • Debugging the SPARQL queries
  • Updating the conversion if new data is added to the source
  • Testing the SPARQL query against updated data
  • Maintenance of the source and target RDF systems (unless pushing SPARQL is a way to urge conversion from relational system)

Or to put it another way, if most data is still on non-RDF data stores, why do I need a bridge to SPARQL world?

Of is this a Sarah Palin bridge to no where?

SPARQL 1.1 Query Language [Last Call - 21 August 2012]

Tuesday, July 24th, 2012

SPARQL 1.1 Query Language

From the W3C News page:

The SPARQL Working Group has published a Last Call Working Draft of SPARQL 1.1 Query Language. RDF is a directed, labeled graph data format for representing information in the Web. This specification defines the syntax and semantics of the SPARQL query language for RDF. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports aggregation, subqueries, negation, creating values by expressions, extensible value testing, and constraining queries by source RDF graph. The results of SPARQL queries can be result sets or RDF graphs. Comments are welcome through 21 August.

Linked Media Framework [Semantic Web vs. ROI]

Tuesday, July 10th, 2012

Linked Media Framework

From the webpage:

The Linked Media Framework is an easy-to-setup server application that bundles central Semantic Web technologies to offer advanced services. The Linked Media Framework consists of LMF Core and LMF Modules.

LMF Usage Scenarios

The LMF has been designed with a number of typical use cases in mind. We currently support the following tasks out of the box:

Target groups are a in particular casual users who are not experts in Semantic Web technologies but still want to publish or work with Linked Data, e.g. in the Open Government Data and Linked Enterprise Data area.

It is a bad assumption that workers in business or government have free time to add semantics to their data sets.

If adding semantics to your data, by linked data or other means is a core value, resource the task just like any other with your internal staff or hire outside help.

A Semantic Web short coming is the attitude that users are interested in or have the time to build it. Assuming the project to be worthwhile and/or doable.

Users are fully occupied with tasks of their own and don’t need a technical elite tossing more tasks onto them. You want the Semantic Web? Suggest you get on that right away.

Integrated data that meets a business need and has proven ROI isn’t the same thing as the Semantic Web. Give me a call if you are interested in the former, not the latter. (I would do the latter as well, but only on your dime.)

I first saw this at semanticweb.com, announcing version 2.2.0 of lmf – Linked Media Framework.

SparQLed…Writing SPARQL Queries [Less ZERO-result queries]

Friday, July 6th, 2012

SindiceTech Releases SparQLed As Open Source Project To Simplify Writing SPARQL Queries by Jennifer Zaino.

From the post:

SindiceTech today released SparQLed, the SindiceTech Assisted SPARQL Editor, as an open source project. SindiceTech, a spinoff company from the DERI Institute, commercializes large-scale, Big Data infrastructures for enterprises dealing with semantic data. It has roots in the semantic web index Sindice, which lets users collect, search, and query semantically marked-up web data (see our story here).

SparQLed also is one of the components of the commercial Sindice Suite for helping large enterprises build private linked data clouds. It is designed to give users all the help they need to write SPARQL queries to extract information from interconnected datasets.

“SPARQL is exciting but it’s difficult to develop and work with,” says Giovanni Tummarello, who led the efforts around the Sindice search and analysis engine and is founder and CEO of SindiceTech.

SparQLed Project page.

Maybe we have become spoiled by search engines that always return results, even bad ones:

With SQL, the advantage lies in having a schema which users can look at and understand how to write a query. RDF, on the other hand, has the advantage of providing great power and freedom, because information in RDF can be interconnected freely. But, Tummarello says, “with RDF there is no schema because there is all sorts of information from everywhere.” Without knowing which properties are available specifically for a certain URI and in what context, users can wind up writing queries that return no results and get frustrated by the constant iterating needed to achieve their ends.

I am not encouraged by a features list that promises:

Less ZERO-result queries

Cascading map-side joins over HBase for scalable join processing

Sunday, July 1st, 2012

Cascading map-side joins over HBase for scalable join processing by Martin Przyjaciel-Zablocki, Alexander Schätzle, Thomas Hornung, Christopher Dorner, and Georg Lausen.

Abstract:

One of the major challenges in large-scale data processing with MapReduce is the smart computation of joins. Since Semantic Web datasets published in RDF have increased rapidly over the last few years, scalable join techniques become an important issue for SPARQL query processing as well. In this paper, we introduce the Map-Side Index Nested Loop Join (MAPSIN join) which combines scalable indexing capabilities of NoSQL storage systems like HBase, that suffer from an insufficient distributed processing layer, with MapReduce, which in turn does not provide appropriate storage structures for efficient large-scale join processing. While retaining the flexibility of commonly used reduce-side joins, we leverage the effectiveness of map-side joins without any changes to the underlying framework. We demonstrate the significant benefits of MAPSIN joins for the processing of SPARQL basic graph patterns on large RDF datasets by an evaluation with the LUBM and SP2Bench benchmarks. For most queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude.

Some topic map applications include Linked Data/RDF processing capabilities.

The salient comment here being: “For most queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude.

Recycling RDF and SPARQL

Wednesday, June 6th, 2012

I was surprised to learn the W3C is recycling RDF and SPARQL for graph analytics:

RDF and SPARQL (both standards developed by the World Wide Web Consortium) [were developed] as the industry standard[s] for graph analytics.

It doesn’t hurt to repurpose those standards, assuming they are appropriate for graph analytics.

Or rather, assuming they are appropriate for your graph analytic needs.

BTW, there is a contest to promote recycling of RDF and SPARQL with a $70,000 first prize:

YarcData Announces $100,000 Big Data Graph Analytics Challenge

From the post:

At the 2012 Semantic Technology & Business Conference in San Francisco, YarcData, a Cray company, has announced the planned launch of a “Big Data” contest featuring $100,000 in prizes. The YarcData Graph Analytics Challenge will recognize the best submissions for solutions of un-partitionable, Big Data graph problems.

YarcData is holding the contest to showcase the increasing applicability and adoption of graph analytics in solving Big Data problems. The contest also is intended to promote the use and development of RDF and SPARQL (both standards developed by the World Wide Web Consortium) as the industry standard for graph analytics.

“Graph databases have a significant role to play in analytic environments, and they can solve problems like relationship discovery that other traditional technologies do not handle easily,” said Philip Howard, Research Director, Bloor Research. “YarcData driving thought leadership in this area will be positive for the overall graph database market, and this contest could help expand the use of RDF and SPARQL as valuable tools for solving Big Data problems.”

The grand prize for the first place winner is $70,000. The second place winner will receive $10,000, and the third place winner will receive $5,000. There also will be additional prizes for the other finalists. Contest judges, which will include a combination of Big Data industry analysts, experts from academia and semantic web, and YarcData customers, will review the submissions and select the 10 best contestants.

The YarcData Graph Analytics Challenge will officially begin on Tuesday, June 26, 2012, and winners will be announced during a live Web event on December 4, 2012. Full contest details, including specific criteria and the contest judges, will be announced on June 26. To pre-register for a contest information packet, please visit the YarcData website at www.yarcdata.com. Information packets will be sent out June 26. The contest will be open only to those individuals who are eligible to participate under U.S. and other applicable laws and regulations.

Full details to follow on June 26, 2012.

Simple federated queries with RDF [Part 1]

Thursday, May 10th, 2012

Simple federated queries with RDF [Part 1]

Bob DuCharme writes:

A few more triples to identify some relationships, and you’re all set.

[side note] Easy aggregation without conversion is where semantic web technology shines the brightest.

Once, at an XML Summer School session, I was giving a talk about semantic web technology to a group that included several presenters from other sessions. This included Henry Thompson, who I’ve known since the SGML days. He was still a bit skeptical about RDF, and said that RDF was in the same situation as XML—that if he and I stored similar information using different vocabularies, we’d still have to convert his to use the same vocabulary as mine or vice versa before we could use our data together. I told him he was wrong—that easy aggregation without conversion is where semantic web technology shines the brightest.

I’ve finally put together an example. Let’s say that I want to query across his address book and my address book together for the first name, last name, and email address of anyone whose email address ends with “.org”. Imagine that his address book uses the vCard vocabulary and the Turtle syntax and looks like this,

Bob is an expert in more areas of markup, SGML/XML, SPARQL and other areas than I can easily count. Not to mention being a good friend.

Take a look at Bob’s post and decide for yourself how “simple” the federation is following Bob’s technique.

I am just going to let it speak for itself today.

I will outline obvious and some not so obvious steps in Bob’s “simple” federated queries in Part II.

Meronymy SPARQL Database Server To Debut With Emphasis on High Performance

Wednesday, February 22nd, 2012

Meronymy SPARQL Database Server To Debut With Emphasis on High Performance

From the post:

Coming in June from start-up Meronymy is a new RDF enterprise database management system, the Meronymy SPARQL Database Server. The company, founded by Inge Henriksen, began life because of the need he saw for a high-performance and more scalable RDF database server.

The idea to focus on a database server exclusively oriented to Linked Data and the Semantic Web came as a result of Henriksen’s work over the last decade as an IT consultant implementing many semantic solutions for customers in sectors such as government and education. “One issue that always came up was performance,” he explains, especially when performing more advanced SPARQL queries against triple stores using filters, for example.

“Once the data reached a certain size, which it often did very quickly, the size of the data became unmanageable and we had to fall back on caching and the like to resolve these performance issues.” The problem there is that caching isn’t compatible with situations where there is a need for real-time data.

A closed beta is due out soon. Register at Meronymy.

Introducing Meronymy SPARQL Database Server

Monday, January 16th, 2012

Introducing Meronymy SPARQL Database Server

Inge Henriksen writes:

I am pleased to announce today that the Meronymy SPARQL Database Server is ready for release later in 2012. Meronymy SPARQL Database Server is a high performance RDF Enterprise Database Management System (DBMS).

Our goal has been to make a really fast, ACID, OS portable, user friendly, secure, SPARQL-driven RDF database server usable with most programming languages.

Let’s not start any language wars about Meronymy being written in C++/assembly, ;-) , and concentrate on its performance in actual use.

Suggested RDF data sets to use to test that performance? (Knowing Inge I trust it is fast but the question is how fast under what circumstances?)

Or other RDF engines to test along side of it?

PS: If you don’t know SPARQL, check out Learning SPARQL by Bob Ducharme.

Meronymy SPARQL Database Server

Friday, January 13th, 2012

Meronymy SPARQL Database Server

Inge Henriksen writes:

We are pleased to announce that the Meronymy SPARQL Database Server is ready for release later in 2012. Those interested in our RDF database server software should consider registering today; those that do get exclusive early access to beta software in the upcoming closed beta testing period, insider news on the development progress, get to submit feature requests, and otherwise directly influence the finished product.

From the FAQ we learn some details:

A: All components in the database server and its drivers have been programmed from scratch so that we could optimize them in terms of their performance.
We developed the database server in C++ since this programming language has the most potential for optimalization, there are also some inline assembly at key locations in the programming code.
Some more components that makes our database management system very fast:

  • In-process query optimizer; determines the most efficient way to execute a query.
  • In-proces memory manager; for much faster memory allocation and deallocation than the operating system can provide.
  • In-process multithreaded HTTP server; for much faster SPARQL Protocol endpoint than through a standard out-of-process web server.
  • In-process multithreaded TCP/IP-listener with thread pooling; for efficient thread managment.
  • In-process directly coded lexical analyzer; for efficient query parsing.
  • Snapshot isolation; for fast transaction processing.
  • B+ trees; for fast indexing
  • In-process stream-oriented XML parser; for fast RDF/XML parsing.
  • A RDF data model; for no data model abstraction layers which results in faster processing of data.

I’m signing up for the beta. How about you?

UMBEL Services, Part 1: Overview

Tuesday, December 13th, 2011

UMBEL Services, Part 1: Overview

From the post:

UMBEL, the Upper Mapping and Binding Exchange Layer, is an upper ontology of about 28,000 reference concepts and a vocabulary designed for domain ontologies and ontology mapping [1]. When we first released UMBEL in mid-2008 it was accompanied by a number of Web services and a SPARQL endpoint, and general APIs. In fact, these were the first Web services developed for release by Structured Dynamics. They were the prototypes for what later became the structWSF Web services framework, which incorporated many lessons learned and better practices.

By the time that the structWSF framework had evolved with many additions to comprise the Open Semantic Framework (OSF), those original UMBEL Web services had become quite dated. Thus, upon the last major update to UMBEL to version 1.0 back in February of this year, we removed these dated services.

Like what I earlier mentioned about the cobbler’s children being the last to get new shoes, it has taken us a bit to upgrade the UMBEL services. However, I am pleased to announce we have now completed the transition of UMBEL’s earlier services to use the OSF framework, and specifically the structWSF platform-independent services. As a result, there are both upgraded existing services and some exciting new ones. We will now be using UMBEL as one of our showcases for these expanding OSF features. We will be elaborating upon these features throughout this series, some parts of which will appear on Fred Giasson’s blog.

In this first part, we provide a broad overview of the new UMBEL OSF implementation. We also begin to foretell some of the parts to come that will describe some of these features in more detail.

There are three more parts that follow this one.

If you have the time, I am interested in your take on this resource.

A lot of time and effort has gone into making this a useful site, so what parts do you like best/least? What would you change?

More to follow on this one.

British Museum Semantic Web Collection Online

Friday, December 9th, 2011

British Museum Semantic Web Collection Online

From the webpage:

Welcome to this Linked Data and SPARQL service. It provides access to the same collection data available through the Museum’s web presented Collection Online, but in a computer readable format. The use of the W3C open data standard, RDF, allows the Museum’s collection data to join and relate to a growing body of linked data published by other organisations around the world interested in promoting accessibility and collaboration.

The data has also been organised using the CIDOC-CRM (Conceptual Reference Model) crucial for harmonising with other cultural heritage data. The current version is beta and development work continues to improve the service. We hope that the service will be used by the community to develop friendly web applications that are freely available to the community.

Please use the SPARQL menu item to use the SPARQL user interface or click here.

With the British National Bibliography, the British Museum both accessible via SPARQL and Bob DuCharme’s Learning SPARQL book, the excuses for not knowing SPARQL cold are few and far in between.

SPARQL 1.1 Overview

Friday, November 25th, 2011

SPARQL 1.1 Overview

From the webpage:

Abstract:

This document is an overview of SPARQL 1.1. It provides an introduction to a set of W3C specifications that facilitate querying and manipulating RDF graph content on the Web or in an RDF store. (First Public Working draft)

Not a deep introduction but does include enough pointers and other material that it is worth reading.