Big Data RDF Store Benchmarking Experiences

Big Data RDF Store Benchmarking Experiences by Peter Boncz.

From the post:

Recently we were able to present new BSBM results, testing the RDF triple stores Jena TDB, BigData, BIGOWLIM and Virtuoso on various data sizes. These results extend the state-of-the-art in various dimensions:

  • scale: this is the first time that RDF store benchmark results on such a large size have been published. The previous published BSBM results published were on 200M triples, the 150B experiments thus mark a 750x increase in scale.
  • workload: this is the first time that results on the Business Intelligence (BI) workload are published. In contrast to the Explore workload, which features short-running “transactional” queries, the BI workload consists of queries that go through possibly billions of triples, grouping and aggregating them (using the respective functionality, new in SPARQL1.1).
  • architecture: this is the first time that RDF store technology with cluster functionality has been publicly benchmarked.

Clusters are great but also difficult to use.

Peter’s post is one of those rare ones that exposes the second half of that statement.

Impressive hardware and results.

Given the hardware and effort required, are we pursuing “big data” for the sake of “big data?”

Not just where RDF is concerned but in general?

Shouldn’t the first question always be: What is the relevant data?

If you can’t articulate the relevant data, isn’t that a commentary on your understanding of the problem?

Comments are closed.