The headline Hadoop’s tremendous inefficiency on graph data management (and how to avoid it) certainly got my attention.
But when you read the paper, Scalable SPARQL Querying of Large RDF Graphs, it isn’t Hadoop’s “tremendous inefficiency,” but actually that of SHARD, an RDF triple store that uses flat text files for storage.
Or as the authors say in their paper (6.3 Performance Comparison):
Figure 6 shows the execution time for LUBM in the four benchmarked systems. Except for query 6, all queries take more time on SHARD than on the single-machine deployment of RDF-3X. This is because SHARD’s use of hash partitioning only allows it optimize subject-subject joins. Every other type of join requires a complete redistribution of data over the network within a Hadoop job, which is extremely expensive. Furthermore, its storage layer is not at all optimized for RDF data (it stores data in flat files).
Saying that SHARD (not as well known as Hadoop), was using Hadoop inefficiently, would not have the “draw” of allegations about Hadoop’s failure to process graph data efficiently.
Sure, I write blog lines for “draw” but let’s ‘fess up in the body of the blog article. Readers shouldn’t have to run down other sources to find the real facts.