Archive for the ‘SPARQL’ Category

SPARQL queries of Beatles recording sessions

Monday, November 20th, 2017

SPARQL queries of Beatles recording sessions – Who played what when? by Bob DuCharme.

From the post:

While listening to the song Dear Life on the new Beck album, I wondered who played the piano on the Beatles’ Martha My Dear. A web search found the website Beatles Bible, where the Martha My Dear page showed that it was Paul.

This was not a big surprise, but one pleasant surprise was how that page listed absolutely everyone who played on the song and what they played. For example, a musician named Leon Calvert played both trumpet and flugelhorn. The site’s Beatles’ Songs page links to pages for every song, listing everyone who played on them, with very few exceptions–for example, for giant Phil Spector productions like The Long and Winding Road, it does list all the instruments, but not who played them. On the other hand, for the orchestra on A Day in the Life, it lists the individual names of all 12 violin players, all 4 violists, and the other 25 or so musicians who joined the Fab Four for that.

An especially nice surprise on this website was how syntactically consistent the listings were, leading me to think “with some curl commands, python scripting, and some regular expressions, I could, dare I say it, convert all these listings to an RDF database of everyone who played on everything, then do some really cool SPARQL queries!”

So I did, and the RDF is available in the file BeatlesMusicians.ttl. The great part about having this is the ability to query across the songs to find out things such as how many different people played a given instrument on Beatles recordings or what songs a given person may have played on, regardless of instrument. In a pop music geek kind of way, it’s been kind of exciting to think that I could ask and answer questions about the Beatles that may have never been answered before.

Will the continuing popularity of the Beatles drive interest in SPARQL? Hard to say but DuCharme gives it a hard push in this post. It will certainly appeal to Beatles devotees.

Is it coincidence that DuCharme posted this on November 19, 2017, the same day as the reported death of Charles Mason? (cf. Helter Skelter)

There’s a logical extension to DuCharme’s RDF file, Charles Mason, the Mason family and music of that era.

Many foolish things were said about rock-n-rock in the ’60’s that are now being repeated about social media and terrorists. Same rant, same lack of evidence, same intolerance, same ineffectual measures against it. Not only can elders not learn from the past, they can’t wait to repeat it.

Be inventive! Learn from past mistakes so you can make new ones in the future!

SPARQL in 11 minutes (Bob DuCharme)

Monday, May 4th, 2015

From the description:

An introduction to the W3C query language for RDF. See http://www.learningsparql.com for more.

I first saw this in Bob DuCharme’s post: SPARQL: the video.

Nothing new for old hands but useful to pass on to newcomers.

I say nothing new, I did learn that Bob has a Korg Monotron synthesizer. Looking forward to more “accompanied” blog posts. 😉

Blazegraph™ Selected by Wikimedia Foundation to Power the Wikidata Query Service

Sunday, March 8th, 2015

Blazegraph™ Selected by Wikimedia Foundation to Power the Wikidata Query Service by Brad Bebee.

From the post:

Blazegraph™ has been selected by the Wikimedia Foundation to be the graph database platform for the Wikidata Query Service. Read the Wikidata announcement here. Blazegraph™ was chosen over Titan, Neo4j, Graph-X, and others by Wikimedia in their evaluation. There’s a spreadsheet link in the selection message, which has quite an interesting comparison of graph database platforms.

Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others. The Wikidata Query Service is a new capability being developed to allow users to be able to query and curate the knowledge base contained in Wikidata.

We’re super-psyched to be working with Wikidata and think it will be a great thing for Wikidata and Blazegraph™.

From the Blazegraph™ SourceForge page:

Blazegraph™is SYSTAP’s flagship graph database. It is specifically designed to support big graphs offering both Semantic Web (RDF/SPARQL) and Graph Database (tinkerpop, blueprints, vertex-centric) APIs. It is built on the same open source GPLv2 platform and maintains 100% binary and API compatibility with Bigdata®. It features robust, scalable, fault-tolerant, enterprise-class storage and query and high-availability with online backup, failover and self-healing. It is in production use with enterprises such as Autodesk, EMC, Yahoo7!, and many others. Blazegraph™ provides both embedded and standalone modes of operation.

Blazegraph has a High Availability and Scale Out architecture. It provides robust support for Semantic Web (RDF/SPARQ)L and Property Graph (Tinkerpop) APIs. Highly scalable Blazegraph graph can handle 50 Billion edges.

The Blazegraph wiki, which has forty-three (43) substantive links to further details on Blazegraph.

For an even deeper look, consider these white papers:

Enjoy!

SPARQLES: Monitoring Public SPARQL Endpoints

Sunday, February 15th, 2015

SPARQLES: Monitoring Public SPARQL Endpoints by Pierre-Yves Vandenbussche, Jürgen Umbrich, Aidan Hogan, and Carlos Buil-Aranda.

Abstract:

We describe SPARQLES: an online system that monitors the health of public SPARQL endpoints on the Web by probing them with custom-designed queries at regular intervals. We present the architecture of SPARQLES and the variety of analytics that it runs over public SPARQL endpoints, categorised by availability, discoverability, performance and interoperability. To motivate the system, we gives examples of some key questions about the health and maturation of public SPARQL endpoints that can be answered by the data it has collected in the past year(s). We also detail the interfaces that the system provides for human and software agents to learn more about the recent history and current state of an individual SPARQL endpoint or about overall trends concerning the maturity of all endpoints monitored by the system.

I started to pass on this article since it does date from 2009 but am now glad that I didn’t. The service is still active and can be found at: http://sparqles.okfn.org/.

The discoverability of SPARQL endpoints is reported to be:

sparql-discovery

From the article:

[VoID Description:] The Vocabulary of Interlinked Data-sets (VoID) [2] has become the de facto standard for describing RDF datasets (in RDF). The vocabulary allows for specifying, e.g., an OpenSearch description, the number of triples a dataset contains, the number of unique subjects, a list of properties and classes used, number of triples associated with each property (used as predicate), number of instances of a given class, number of triples used to describe all instances of a given class, predicates used to describe class instances, and so forth. Likewise, the description of the dataset is often enriched using external vocabulary, such as for licensing information.

[SD Description:] Endpoint capabilities – such as supported SPARQL version, query and update features, I/O formats, custom functions, and/or entailment regimes – can be described in RDF using the SPARQL 1.1 Service Description (SD) vocabulary, which became a W3C Recommendation in March 2013 [21]. Such descriptions, if made widely available, could help a client find public endpoints that support the features it needs (e.g., find SPARQL 1.1 endpoints)

No, I’m not calling your attention to this to pick on SPARQL, especially, but the lack of discoverability raises a serious issue for any information retrieval system that hopes to better the dumb luck searching.

Clearly SPARQL has the capability to increase discoverability, whether those mechanisms would be effective or not cannot be answered due to lack of use. So my first question is: Why aren’t the mechanisms of SPARQL being used to increase discoverability?

Or perhaps better, having gone to the trouble to construct a SPARQL endpoint, why aren’t people taking the next step to make them more discoverable?

Is it because discoverability benefits some remote and faceless user instead of those being called upon to make the endpoint more discoverable? In that sense, it is a lack of positive feedback for the person tasked with increasing discoverability?

I ask because if we can’t find the key to motivating people to increase the discoverability of information (SPARQL or no) then we are in serious trouble as the rate of big data continues to increase. The amount of data will continue to grow and discoverability continues to go down. That can’t be a happy circumstance for anyone interested in discovering information.

Suggestions?

I first saw this in a tweet by Ruben Verborgh.

Black Friday Dreaming with Bob DuCharme

Saturday, November 15th, 2014

Querying aggregated Walmart and BestBuy data with SPARQL by Bob DuCharme.

From the post:

The combination of microdata and schema.org seems to have hit a sweet spot that has helped both to get a lot of traction. I’ve been learning more about microdata recently, but even before I did, I found that the W3C’s Microdata to RDF Distiller written by Ivan Herman would convert microdata stored in web pages into RDF triples, making it possible to query this data with SPARQL. With major retailers such as Walmart and BestBuy making such data available on—as far as I can tell—every single product’s web page, this makes some interesting queries possible to compare prices and other information from the two vendors.

Bob’s use of SPARQL won’t be ready for this coming Black Friday but some Black Friday in the future?

One can imagine “blue light specials” being input by shoppers on location and driving traffic patterns at the larger malls.

Well worth your time to see where Bob was able to get using public tools.

I first saw this in a tweet by Ivan Herman.

Getting Started with S4, The Self-Service Semantic Suite

Tuesday, September 16th, 2014

Getting Started with S4, The Self-Service Semantic Suite by Marin Dimitrov.

From the post:

Here’s how S4 developers can get started with The Self-Service Semantic Suite. This post provides you with practical information on the following topics:

  • Registering a developer account and generating API keys
  • RESTful services & free tier quotas
  • Practical examples of using S4 for text analytics and Linked Data querying

Ontotext is up front about the limitations on the “free” service:

  • 250 MB of text processed monthly (via the text analytics services)
  • 5,000 SPARQL queries monthly (via the LOD SPARQL service)

The number of pages in a megabyte of text varies depends on text content but assuming a working average of one (1) megabyte = five hundred (500) pages of text, you can analyze up to one hundred and twenty-five thousand (125,000) pages of text a month. Chump change for serious NLP but it is a free account.

The post goes on to detail two scenarios:

  • Annotate a news document via the News analytics service
  • Send a simple SPARQL query to the Linked Data service

Learn how effective entity recognition and SPARQL are with data of interest to you, at a minimum of investment.

I first saw this in a tweet by Tony Agresta.

Exploring a SPARQL endpoint

Monday, August 25th, 2014

Exploring a SPARQL endpoint by Bob DuCharme.

From the post:

In the second edition of my book Learning SPARQL, a new chapter titled “A SPARQL Cookbook” includes a section called “Exploring the Data,” which features useful queries for looking around a dataset that you know little or nothing about. I was recently wondering about the data available at the SPARQL endpoint http://data.semanticweb.org/sparql, so to explore it I put several of the queries from this section of the book to work.

An important lesson here is how easy SPARQL and RDF make it to explore a dataset that you know nothing about. If you don’t know about the properties used, or whether any schema or schemas were used and how much they was used, you can just query for this information. Most hypertext links below will execute the queries they describe using semanticweb.org’s SNORQL interface.

Bob’s ease at using SPARQL reminds me of a story of an ex-spy who was going through customs for the first time in years. As part of that process, he accused a customs officer of having memorized print that was too small to read easily. The which the officer replied, “I am familiar with it.” 😉

Bob’s book on SPARQL and his blog will help you become a competent SPARQL user.

I don’t suppose SPARQL is any worse off semantically than SQL, which has been in use for decades. It is troubling that I can discover dc:title but have no way to investigate how it was used by a particular content author.

Oh, to be sure, the term dc:title makes sense to me, but that is a smoothing function as a reader and may or may not be the same “sense” as occurs to the person who completed such a term.

You can read data sets using your own understanding of tokens but I would do so with a great deal of caution.

Ontology-Based Interpretation of Natural Language

Thursday, July 10th, 2014

Ontology-Based Interpretation of Natural Language by Philipp Cimiano, Christina Unger, John McCrae.

Authors’ description:

For humans, understanding a natural language sentence or discourse is so effortless that we hardly ever think about it. For machines, however, the task of interpreting natural language, especially grasping meaning beyond the literal content, has proven extremely difficult and requires a large amount of background knowledge.

The book Ontology-based interpretation of natural language presents an approach to the interpretation of natural language with respect to specific domain knowledge captured in ontologies. It puts ontologies at the center of the interpretation process, meaning that ontologies not only provide a formalization of domain knowlegde necessary for interpretation but also support and guide the construction of meaning representations.

The links under Resources for Ontologies, Lexica and Grammars, as of today return “coming soon.”

Implementations fares a bit better, returning information on various aspects of lemon.

lemon is a proposed meta-model for describing ontology lexica with RDF. It is declarative, thus abstracts from specific syntactic and semantic theories, and clearly separates lexicon and ontology. It follows the principle of semantics by reference, which means that the meaning of lexical entries is specified by pointing to elements in the ontology.

lemon-core

It may just be me but the Lemon model seems more complicated than asking users what identifies their subjects and distinguishes them from other subjects.

Lemon is said to be compatible with RDF, OWL, SPARQL, etc.

But, accurate (to a user) identification of subjects and their relationships to other subjects is more important to me than compatibility with RDF, SPARQL, etc.

You?

I first saw this in a tweet by Stefano Bertolo.

Foundations of an Alternative Approach to Reification in RDF

Tuesday, June 17th, 2014

Foundations of an Alternative Approach to Reification in RDF by Olaf Hartig and Bryan Thompson.

Abstract:

This document defines extensions of the RDF data model and of the SPARQL query language that capture an alternative approach to represent statement-level metadata. While this alternative approach is backwards compatible with RDF reification as defined by the RDF standard, the approach aims to address usability and data management shortcomings of RDF reification. One of the great advantages of the proposed approach is that it clarifies a means to (i) understand sparse matrices, the property graph model, hypergraphs, and other data structures with an emphasis on link attributes, (ii) map such data onto RDF, and (iii) query such data using SPARQL. Further, the proposal greatly expands both the freedom that database designers enjoy when creating physical indexing schemes and query plans for graph data annotated with link attributes and the interoperability of those database solutions.

The essence of the approach is to embed triples “in” triples that make statements about the embedded triples.

Works more efficiently than the standard RDF alternative but that’s hardly surprising.

Of course, you remain bound to lexical “sameness” as the identification for the embedded triple but I suppose fixing that would not be backwards compatible with the RDF standard.

I recommend this if you are working with RDF data. No point in it being any more inefficient than absolutely necessary.

PS: Reification is one of those terms that should be stricken from the CS vocabulary.

The question is: Can you make a statement about X? If the answer is no, there is no “reification” of X. Your information system cannot speak of X, which includes assigning any properties to X.

If the answer is yes, then the question is how do you identify X? Olaf and Bryan answer by saying “put a copy of X right here.” That’s one solution.

I first saw this in a tweet by Marin Dimitrov.

Towards Web-scale Web querying [WWW vs. Internet]

Wednesday, March 12th, 2014

Towards Web-scale Web querying: The quest for intelligent clients starts with simple servers that scale. by Ruben Verborgh.

From the post:

Most public SPARQL endpoints are down for more than a day per month. This makes it impossible to query public datasets reliably, let alone build applications on top of them. It’s not a performance issue, but an inherent architectural problem: any server offering resources with an unbounded computation time poses a severe scalability threat. The current Semantic Web solution to querying simply doesn’t scale. The past few months, we’ve been working on a different model of query solving on the Web. Instead of trying to solve everything at the server side—which we can never do reliably—we should build our servers in such a way that enables clients to solve queries efficiently.

The Web of Data is filled with an immense amount of information, but what good is that if we cannot efficiently access those bits of information we need?

SPARQL endpoints aim to fulfill the promise of querying on the Web, but their notoriously low availability rates make that impossible. In particular, if you want high availability for your SPARQL endpoint, you have to compromise one of these:

  • offering public access,
  • allowing unrestricted queries,
  • serving many users.

Any SPARQL endpoint that tries to fulfill all of those inevitably has low availability. Low availability means unreliable query access to datasets. Unreliable access means we cannot build applications on top of public datasets.

Sure, you could just download a data dump and have your own endpoint, but then you move from Web querying to local querying, and that problem has been solved ages ago. Besides, it doesn’t give you access to up to date information, and who has enough storage to download a dump of the entire Web?

The whole “endpoint” concept will never work on a Web scale, because servers are subject to arbitrarily complex requests by arbitrarily many clients. (emphasis in original)

The prelude to an interesting proposal on Linked Data Fragments.

See the Linked Data Fragments website or Web-Scale Querying through Linked Data Fragments by Ruben Verborgh, et. al. (LDOW2014 workshop).

The paper gives a primary motivation as:

There is one issue: it appears to be very hard to make a sparql endpoint available reliably. A recent survey examining 427 public endpoints concluded that only one third of them have an availability rate above 99%; not even half of all endpoints reach 95% [6]. To put this into perspective: 95% availability means the server is unavailable for one and a half days every month. These figures are quite disturbing given the fact that availability is usually measured in “number of nines” [5, 25], counting the number of leading nines in the availability percentage. In comparison, the fairly common three nines (99.9%) amounts to 8.8 hours of downtime per year. The disappointingly low availability of public sparql endpoints is the Semantic Web community’s very own “Inconvenient Truth”.

Curious that on the twenty-fifth anniversary of the WWW that I would realize the WWW re-created a networking problem solved by the Internet.

Unlike the WWW, to say nothing of Linked Data and its cousins in the SW activity, the Internet doesn’t have a single point of failure.

Or put more positively, the Internet is fault-tolerant by design. In contrast, the SW is fragile, by design.

While I applaud the Linked Data fragment exploration of the solution space, focusing on the design flaw of a single point of failure might be more profitable.

I first saw this in a tweet by Thomas Steiner.

Bersys 2014!

Thursday, March 6th, 2014

Bersys 2014!

From the webpage:

Following the 1st International workshop on Benchmarking RDF Systems (BeRSys 2013) the aim of the BeRSys 2014 workshop is to provide a discussion forum where researchers and industrials can meet to discuss topics related to the performance of RDF systems. BeRSys 2014 is the only workshop dedicated to benchmarking different aspects of RDF engines – in the line of TPCTC series of workshops.The focus of the workshop is to expose and initiate discussions on best practices, different application needs and scenarios related to different aspects of RDF data management.

We will solicit contributions presenting experiences with benchmarking RDF systems, real-life RDF application needs which are good candidates for benchmarking, as well as novel ideas on developing benchmarks for different aspects of RDF data management ranging from query processing, reasoning to data integration. More specifically, we will welcome contributions from a diverse set of domain areas such as life science (bio-informatics, pharmaceutical), social networks, cultural informatics, news, digital forensics, e-science (astronomy, geology) and geographical among others. More specifically, the topics of interest include but are not limited to:

  • Descriptions of RDF data management use cases and query workloads
  • Benchmarks for RDF SPARQL 1.0 and SPARQL 1.1 query workloads
  • Benchmarks RDF data integration tasks including but not limited to ontology aligment, instance matching and ETL techniques
  • Benchmark metrics
  • Temporal and geospatial benchmarks
  • Evaluation of benchmark performance results on RDF engines
  • Benchmark principles
  • Query processing and optimization algorithms for RDF systems.

Venue:

The workshop is held in conjuction with the 40th International Conference on Very Large Data Bases (VLDB2014) in Hangzhou, China.

The only date listed on the announcement is September 1-5, 2014 for the workshop.

When other dates appear, I will update this post and re-post about the conference.

As you have seen in better papers on graphs, RDF, etc., benchmarking in this area is a perilous affair. Workshops, like this one, are one step towards building the experience necessary to consider the topic of benchmarking.

I first saw this in a tweet by Stefano Bertolo.

Storing and querying RDF in Neo4j

Sunday, January 26th, 2014

Storing and querying RDF in Neo4j by Bob DuCharme.

From the post:

In the typical classification of NoSQL databases, the “graph” category is one that was not covered in the “NoSQL Databases for RDF: An Empirical Evaluation” paper that I described in my last blog entry. (Several were “column-oriented” databases, which I always thought sounded like triple stores—the “table” part of they way people describe these always sounded to me like a stretched metaphor designed to appeal to relational database developers.) A triplestore is a graph database, and Brazilian software developer Paulo Roberto Costa Leite has developed a SPARQL plugin for Neo4j, the most popular of the NoSQL graph databases. This gave me enough incentive to install Neo4j and play with it and the SPARQL plugin.

As Bob points out, the plugin isn’t ready for prime time but I mention it in case you are interested in yet another storage solution for RDF.

Stardog 2.0.0 (26 September 2013)

Friday, September 27th, 2013

Stardog 2.0.0 (26 September 2013)

From the docs page:

Introducing Stardog

Stardog is a graph database—fast, lightweight, pure Java storage for mission-critical apps—that supports:

  • the RDF data model
  • SPARQL 1.1 query language
  • HTTP and SNARL protocols for remote access and control
  • OWL 2 and rules for inference and data analytics
  • Java, JavaScript, Ruby, Python, .Net, Groovy, Spring, etc.

New features in 2.0:

I was amused to read in Stardog Rules Syntax:

Stardog supports two different syntaxes for defining rules. The first is native Stardog Rules syntax and is based on SPARQL, so you can re-use what you already know about SPARQL to write rules. Unless you have specific requirements otherwise, you should use this syntax for user-defined rules in Stardog. The second, the de facto standard RDF/XML syntax for SWRL. It has the advantage of being supported in many tools; but it‘s not fun to read or to write. You probably don’t want to use it. Better: don’t use this syntax! (emphasis in the original)

Install and play with it over the weekend. It’s a good way to experience RDF and SPARQL.

BrightstarDB 1.3 now available

Friday, June 14th, 2013

BrightstarDB 1.3 now available

From the post:

We are pleased to announce the release of BrightstarDB 1.3. This is the first “official” release of BrightstarDB under the open-source MIT license. All of the documentation and notices on the website should now have been updated to remove any mention of commercial licensing. To be clear: BrightstarDB is not dual licensed, the MIT license applies to all uses of BrightstarDB, commercial or non-commercial. If you spot something we missed in the docs that might indicate otherwise please let us know.

The main focus of this release has been to tidy up the licensing and use of third-party closed-source applications in the build process, but we also took the opportunity to extend the core RDF APIs to provide better support for named graphs within BrightstarDB stores. This release also incorporates the most recent version of dotNetRDF providing us with updated Turtle parsing and improved SPARQL query performance over the previous release.

Just to tempt you into looking further, the features are:

  • Schema-Free Triple Store
  • High Performance
  • LINQ & OData Support
  • Historical Data Access
  • Transactional (ACID)
  • NoSQL Entity Framework
  • SPARQL Support
  • Automatic Indexing

From Kal Ahmed and Graham Moore if you don’t recognize the software.

Coming soon: new, expanded edition of “Learning SPARQL”

Monday, June 3rd, 2013

Coming soon: new, expanded edition of “Learning SPARQL” by Bob DuCharme.

From the post:

55% more pages! 23% fewer mentions of the semantic web!

sparql

I’m very pleased to announce that O’Reilly will make the second, expanded edition of my book Learning SPARQL available sometime in late June or early July. The early release “raw and unedited” version should be available this week.

I wonder if Bob is going to start an advertising trend with “fewer mentions of the semantic web?”

😉

Looking forward to the update!

Not that I care about SPARQL all that much but I’ll learn something.

Besides, I have liked Bob’s writing style from back in the SGML days.

A self-updating road map of The Cancer Genome Atlas

Friday, May 17th, 2013

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

5 heuristics for writing better SPARQL queries

Wednesday, April 3rd, 2013

5 heuristics for writing better SPARQL queries by Paul Groth.

From the post:

In the context of the Open PHACTS and the Linked Data Benchmark Council projects, Antonis Loizou and I have been looking at how to write better SPARQL queries. In the Open PHACTS project, we’ve been writing super complicated queries to integrate multiple data sources and from experience we realized that different combinations and factors can dramatically impact performance. With this experience, we decided to do something more systematic and test how different techniques we came up with mapped to database theory and worked in practice. We just submitted a paper for review on the outcome. You can find a preprint (On the Formulation of Performant SPARQL Queries) on arxiv.org at http://arxiv.org/abs/1304.0567. The abstract is below. The fancy graphs are in the paper.

Paper Abstract:

The combination of the flexibility of RDF and the expressiveness of SPARQL provides a powerful mechanism to model, integrate and query data. However, these properties also mean that it is nontrivial to write performant SPARQL queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to write “good” queries. The goal of this paper is to begin to bridge this gap. It describes 5 heuristics that can be applied to create optimised queries. The heuristics are informed by formal results in the literature on the semantics and complexity of evaluating SPARQL queries, which ensures that queries following these rules can be optimised effectively by an underlying RDF store. Moreover, we empirically verify the efficacy of the heuristics using a set of openly available datasets and corresponding SPARQL queries developed by a large pharmacology data integration project. The experimental results show improvements in performance across 6 state-of-the-art RDF stores.

Just in case you have to query RDF stores as part of your topic map work.

Be aware that: The effectiveness of your SPARQL query will vary based on the RDF Store.

Or as the authors say:

SPARQL, due to its expressiveness , provides a plethora of different ways to express the same constraints, thus, developers need to be aware of the performance implications of the combination of query formulation and RDF Store. This work provides empirical evidence that can help developers in designing queries for their selected RDF Store. However, this raises questions about the effectives of writing complex generic queries that work across open SPARQL endpoints available in the Linked Open Data Cloud. We view the optimisation of queries independent of underlying RDF Store technology as a critical area of research to enable the most effective use of these endpoints. (page 21)

I hope their research is successful.

Varying performance, especially as reported in their paper, doesn’t bode well for cross-RDF Store queries.

Dydra

Tuesday, March 26th, 2013

Dydra

From the webpage:

Dydra

Dydra is a cloud-based graph database. Whether you’re using existing social network APIs or want to build your own, Dydra treats your customers’ social graph as exactly that.

With Dydra, your data is natively stored as a property graph, directly representing the relationships in the underlying data.

Expressive

With Dydra, you access and update your data via an industry-standard query language specifically designed for graph processing, SPARQL. It’s easy to use and we provide a handy in-browser query editor to help you learn.

Despite my misgivings about RDF (Simple Web Semantics), if you want to investigate RDF and SPARQL, Dydra would be a good way to get your feet wet.

You can get an idea of the skill level required by RDF/SPARQL.

Currently in beta, free with some resource limitations.

I particularly liked the line:

We manage every piece of the data store, including versioning, disaster recovery, performance, and more. You just use it.

RDF/SPARQL skills will remain a barrier but Dydra as does its best to make those the only barriers you will face. (And have reduced some of those.)

Definitely worth your attention, whether you simply want to practice on RDF/SPARQL as a data source or have other uses for it.

I first saw this in a tweet by Stian Danenbarger.

Striking a Blow for Complexity

Thursday, March 21st, 2013

The W3C struck a blow for complexity today.

It’s blog entry entitled: Eleven SPARQL 1.1 Specifications are W3C Recommendations reads:

The SPARQL Working Group has completed development of its full-featured system for querying and managing data using the flexible RDF data model. It has now published eleven Recommendations for SPARQL 1.1, detailed in SPARQL 1.1 Overview. SPARQL 1.1 extends the 2008 Recommendation for SPARQL 1.0 by adding features to the query language such as aggregates, subqueries, negation, property paths, and an expanded set of functions and operators. Beyond the query language, SPARQL 1.1 adds other features that were widely requested, including update, service description, a JSON results format, and support for entailment reasoning. Learn more about the Semantic Web Activity.

I can’t wait for the movie version starring IBM’s Watson playing sudden death Jeopardy against Bob Ducharme, category SPARQL.

I’m betting on Bob!

SPIN: SPARQL Inferencing Notation

Tuesday, March 12th, 2013

SPIN: SPARQL Inferencing Notation

From the webpage:

SPIN is a W3C Member Submission that has become the de-facto industry standard to represent SPARQL rules and constraints on Semantic Web models. SPIN also provides meta-modeling capabilities that allow users to define their own SPARQL functions and query templates. Finally, SPIN includes a ready to use library of common functions.

SPIN in Five Slides.

In case you encounter SPARQL rules and constraints.

I first saw this in a tweet by Stian Danenbarger.

Marketplace in Query Libraries? Marketplace in Identified Entities?

Thursday, February 7th, 2013

Using SPARQL Query Libraries to Generate Simple Linked Data API Wrappers by Tony Hirst.

From the post:

A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) SPARQL with R in less than 5 minutes, which shows how to query US data.gov Linked Data and then Leigh Dodds’ Brief Review of the Land Registry Linked Data.

I was going to post a couple of of examples merging those two posts – showing how to access Land Registry data via Leigh’s example queries in R, then plotting some of the results using ggplot2, but another post of Leigh’s today – SPARQL-doc – a simple convention for documenting individual SPARQL queries, has sparked another thought…

For some time I’ve been intrigued by the idea of a marketplace in queries over public datasets, as well as the public sharing of generally useful queries. A good query is like a good gold pan, or a good interview question – it can get a dataset to reveal something valuable that may otherwise have laid hidden. Coming up with a good query in part requires having a good understanding of the structure of a dataset, in part having an eye for what sorts of secret the data may contain: the next step is crafting a well phrased query that can tease that secret out. Creating the query might take some time, some effort, and some degree of expertise in query optimisation to make it actually runnable in reasonable time (which is why I figure there may be a market for such things*) but once written, the query is there. And if it can be appropriately parameterised, it may generalise.

Tony’s marketplace of queries has a great deal of potential.

But I don’t think they need to be limited to SPARQL queries.

By extension his arguments should be true for searches on Google, Bing, etc., as well as vendor specialized search interfaces.

I would take that a step further into libraries for post-processing the results of such queries and presenting users with enhanced presentations and/or content.

And as part of that post-processing, I would add robust identification of entities as an additional feature of such a library/service.

For example, what if you have curated some significant portion of the ACM digital library and when passed what could be an ambiguous reference to a concept, you return to the user the properties that distinguish their reference into several groups.

Which frees every user from wading through unrelated papers and proceedings, when that reference comes up.

Would that be a service users would pay for?

I suppose that depends on how valuable their time is to them and/or their employers.

SPARQL with R in less than 5 minutes [Fire Data]

Saturday, January 26th, 2013

SPARQL with R in less than 5 minutes

From the post:

In this article we’ll get up and running on the Semantic Web in less than 5 minutes using SPARQL with R. We’ll begin with a brief introduction to the Semantic Web then cover some simple steps for downloading and analyzing government data via a SPARQL query with the SPARQL R package.

What is the Semantic Web?

To newcomers, the Semantic Web can sound mysterious and ominous. By most accounts, it’s the wave of the future, but it’s hard to pin down exactly what it is. This is in part because the Semantic Web has been evolving for some time but is just now beginning to take a recognizable shape (DuCharme 2011). Detailed definitions of the Semantic Web abound, but simply put, it is an attempt to structure the unstructured data on the Web and to formalize the standards that make that structure possible. In other words, it’s an attempt to create a data definition for the Web.

I will have to re-read Bob Ducharme’s “Learning SPARQL.” I didn’t realize the “Semantic Web” was beginning to “…take a recognizable shape.” After a decade of attempting to find an achievable agenda, it’s about time.

The varying interpretations of Semantic Web origin tales are quite amusing. In the first creation account, independent agents were going to schedule medical appointments and tennis matches for us. In the second account, our machine were going to reason across structured data to produce new insights. More recently, the vision is of a web of CMU Coke machines connected to the WWW, along with other devices. (The Internet of Things.)

I suppose the next version will be computers that can exchange information using the TCP/IP protocol and various standards, like HTML, for formatting documents. Plus some declaration that semantics will be handled in a future version, sufficiently far off to keep grant managers from fearing an end to the project.

The post is a good example of using R to use SPARQL and you will encounter data at SPARQL endpoints so it is a useful exercise.

The example data set is one of wildfires and acres burned per year, 1960-2008.

More interesting fire data sets can be found at: Fire Detection GIS Data.

Mapping that data by date, weather conditions/trends, known impact, would require coordination between diverse data sets.

SPARQL end-point of data.euorpeana.edu

Friday, December 21st, 2012

SPARQL end-point of data.euorpeana.edu

From the webpage:

Welcome on the SPARQL end-point of data.europeana.eu!

data.europeana.eu currently contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana. Data is following the terms of the Creative Commons CC0 public domain dedication. Data is described the Resource Description Framework (RDF) format, and structured using the Europeana Data Model (EDM). We give more detail on the EDM data we publish on the technical details page.

Please take the time to check out the list of collections currently included in the pilot.

The terms of use and external data sources appearing at data.europeana.eu are provided on the Europeana Data sources page.

Sample queries are available on the sparql page.

At first I wondered why this was news because: Europeana opens up data on 20 million cultural items appeared on 12 September 2012 in the Guardian

I assume the data has been in use since its release last September.

If you have been using it, can you comment on how your use will change now that the data is available as a SPARQL end-point?

YASGUI: Web-based SPARQL client with bells ‘n wistles

Wednesday, December 5th, 2012

YASGUI: Web-based SPARQL client with bells ‘n wistles

From the post:

A few months ago Laurens Rietveld was looking for a query interface from which he could easily query any other SPARQL endpoint.

But he couldn’t find any that fit my requirements:

So he decided to make his own!

Give it a try at: http://aers.data2semantics.org/sparql/

Future work (next year probably):

In case you are interested in SPARQL per se or want to extract information for re-use in a topic map. Could be interesting.

Good to see mention of our friends at Mondeca.

Normalizing company names with SPARQL and DBpedia

Wednesday, December 5th, 2012

Normalizing company names with SPARQL and DBpedia

Bob DuCharme writes:

Wikipedia page redirection data, waiting for you to query it.

If you send your browser to http://en.wikipedia.org/wiki/Big_Blue, you’ll end up at IBM’s page, because Wikipedia knows that this nickname usually refers to this company. (Apparently, it’s also a nickname for several high schools and universities.) This data pointing from nicknames to official names is also stored in DBpedia, which means that we we can use SPARQL queries to normalize company names. You can use the same technique to normalize other kinds of names—for example, trying to send your browser to http://en.wikipedia.org/wiki/Bobby_Kennedy will actually send it to http://en.wikipedia.org/wiki/Robert_F._Kennedy—but a query that sticks to one domain will have a simpler job. Description Logics and all that.

As always Bob is on the cutting edge of the use of a markup standard!

Possible topic map analogies:

  • create a second name cluster and the “normalized name” is an additional base name
  • move the “nickname” to a variant name (scope?) and update the base name to be the normalized name (with changes to sort/display as necessary)

I am assuming that Bob’s lang(?redirectsTo) = "en" operates like scope in topic maps.

Except that scope in topic map is represented by one or more topics, which means merging can occur between topics that represent the same language.

Webnodes Semantic Integration Server (SDShare Protocol)

Thursday, November 22nd, 2012

Webnodes AS announces Webnodes Semantic Integration Server by Mike Johnston.

From the post:

Webnodes AS, a company developing a .NET based semantic content management system, today announced the release of a new product called Webnodes Semantic Integration Server.

Webnodes Semantic Integration Server is a standalone product that has two main components: A SPARQL endpoint for traditional semantic use-cases and the full-blown integration server based on the SDShare protocol. SDShare is a new protocol for allowing different software to share and consume data with each other, with minimal amount of setup.

The integration server ships with connectors out of the box for OData- and SPARQL endpoints and any ODBC compatible RDBMS. This means you can integrate many of the software systems on the market with very little work. If you want to support software not compatible with any of the available connectors, you can create custom connectors. In addition to full-blown connectors, the integration server can also push the raw data to another SPARQL endpoint (the internal data format is RDF) or a HTTP endpoint (for example Apache SOLR).

I wonder about the line:

This means you can integrate many of the software systems on the market with very little work.

I think wiring disparate systems together is a better description. To “integrate” systems implies some useful result.

Wiring systems together is a long way from the hard task of semantic mapping, which produces integration of systems.

I first saw this in a tweet by Paul Hermans.

Sindice SPARQL endpoint

Wednesday, November 21st, 2012

Sindice SPARQL endpoint by Gabi Vulcu.

From an email by Gabi:

We have released a new version of the SIndice SPARQL endpoint (http://sparql.sindice.com/) with two new datasets: sudoc and yago

Below are the current dump datasets that are in the Sparql endpoint:

dataset_uri dataset_name
http://sindice.com/dataspace/default/dataset/dbpedia “dbpedia”
http://sindice.com/dataspace/default/dataset/medicare “medicare”
http://sindice.com/dataspace/default/dataset/whoiswho “whoiswho”
http://sindice.com/dataspace/default/dataset/sudoc “sudoc”
http://sindice.com/dataspace/default/dataset/nytimes “nytimes”
http://sindice.com/dataspace/default/dataset/ookaboo “ookaboo”
http://sindice.com/dataspace/default/dataset/europeana “europeana”
http://sindice.com/dataspace/default/dataset/basekb “basekb”
http://sindice.com/dataspace/default/dataset/geonames “geonames”
http://sindice.com/dataspace/default/dataset/wordnet “wordnet”
http://sindice.com/dataspace/default/dataset/dailymed “dailymed”
http://sindice.com/dataspace/default/dataset/reactome “reactome”
http://sindice.com/dataspace/default/dataset/yago “yago”

The list of crawled website datasets that have been rdf-ized and loaded into the Sparql endpoint can be found here [1]

Due to space limitation we limited both the amount of dump datasets to the ones in the above table and the websites datasets to the top 1000 domains based on the DING[3] score.

However, upon request, if someone needs a particular dataset( there are more to choose from here [4]), we can arrange to get it into the Sparql endpoint in the next release.

[1] https://docs.google.com/spreadsheet/ccc?key=0AvdgRy2el8d9dERUZzBPNEZIbVJTTVVIRDVUWHhKdWc
[2] https://docs.google.com/spreadsheet/ccc?key=0AvdgRy2el8d9dGhDaHMta0MtaG9vWWhhbTd5SVVaX1E
[3] http://ding.sindice.com
[4] https://docs.google.com/spreadsheet/ccc?key=0AvdgRy2el8d9dGhDaHMta0MtaG9vWWhhbTd5SVVaX1E#gid=0

You may also be interested in: sindice-dev — Sindice developers list.

Stardog 1.1 Release

Thursday, November 15th, 2012

Stardog 1.1 Release

From the webpage:

Stardog is a fast, commercial RDF database: SPARQL for queries; OWL for reasoning; pure Java for the Enterprise.

Stardog 1.1 supports SPARQL 1.1.

I first saw this in a tweet from Kendall Clark.

IOGDS: International Open Government Dataset Search

Saturday, November 10th, 2012

IOGDS: International Open Government Dataset Search

Description:

The TWC International Open Government Dataset Search (IOGDS) is a linked data application based on metadata “scraped” from hundreds of international dataset catalog websites publishing a rich variety of government data. Metadata extracted from these catalog websites is automatically converted to RDF linked data and re-published via the TWC LOGD SPARQL endpoint and made available for download. The TWC IOGDS demo site features an efficient, reconfigurable faceted browser with search capabilities offering a compelling demonstration of the value of a common metadata model for open government dataset catalogs. We believe that the vocabulary choices demonstrated by IOGDS highlights the potential for useful linked data applications to be created from open government catalogs and will encourage the adoption of such a standard worldwide.

In addition to the datasets you will find tutorials, videos, demos, tools and technologies and other resources.

Whether you are looking for Linked Data or Linked Data to re-use in other ways.

Seen in a tweet by Tim O’Reilly.