Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 3, 2012

Is That A Graph In Your Cray?

Filed under: Cray,Graphs,Neo4j,RDF,Semantic Web — Patrick Durusau @ 7:27 pm

If you want more information about graph processing in Cray’s uRIKA (I did), try: High-performance Computing Applied to Semantic Databases by Eric L. Goodman, Edward Jimenez, David Mizell, Sinan al-Saffar, Bob Adolf, and David Haglin.

Abstract:

To-date, the application of high-performance computing resources to Semantic Web data has largely focused on commodity hardware and distributed memory platforms. In this paper we make the case that more specialized hardware can offer superior scaling and close to an order of magnitude improvement in performance. In particular we examine the Cray XMT. Its key characteristics, a large, global shared memory, and processors with a memory-latency tolerant design, offer an environment conducive to programming for the Semantic Web and have engendered results that far surpass current state of the art. We examine three fundamental pieces requisite for a fully functioning semantic database: dictionary encoding, RDFS inference, and query processing. We show scaling up to 512 processors (the largest configuration we had available), and the ability to process 20 billion triples completely in memory.

Unusual to see someone apologize for only having “…512 processors (the largest configuration we had available)….,” but that isn’t why I am citing the paper. 😉

The “dictionary encoding” (think indexing) techniques may prove instructive, even if you don’t have time on a Cray XMT. The techniques presented achieve a compression of the raw data between 3.2. and 4.4.

Take special note of the statement: “To simplify the discussion, we consider only semantic web data represented in N-Triples.” Actually the system presented processes only subject, edge, object triples. Unlike Neo4j, for instance, it isn’t a generalized graph engine.

Specialized hardware/software is great but let’s be clear about that upfront. You may need more than RDF graphs can offer. Like edges with properties.

Other specializations include, a process of “closure” has several simplifications to enable a single pass through the RDFS rule set and querying doesn’t allow a variable in the predicate position.

Granting that this results in a hardware/software combination that can claim “interactivity” on large data sets, but what is the cost of making that a requirement?

Take the best known “connect the dots” problem of this century, 9/11. Analysts did not need “interactivity” with large data sets measured in nano-seconds. Batch processing that lasted for a week or more would have been more than sufficient. Most of the information that was known was “known” by various parties for months.

More than that, the amount of relevant was quite small when compared to the “Semantic Web.” There were known suspects (as there are now), with known associates, with known travel patterns, so eliminating all the business/frequent flyers from travel data is a one time filter, plus any > 40 females traveling on US passports (grandmothers). Similar criteria can reduce information clutter, allowing analysts to focus on important data, as opposing to paging through “hits” in a simulation of useful activity.

I would put batch processing of graphs of relevant information against interactive churning of big data in a restricted graph model any day. How about you?

2 Comments

  1. […] Word For It Patrick Durusau on Topic Maps and Semantic Diversity « Is That A Graph In Your Cray? MapReduceXMT […]

    Pingback by MultiThreaded Graph Library (MTGL) « Another Word For It — March 3, 2012 @ 7:28 pm

  2. […] wondered in Is That A Graph In Your Cray? if “interactivity” with data is a real […]

    Pingback by Fast and slow visualization « Another Word For It — March 8, 2012 @ 9:45 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress