Archive for the ‘bigdata®’ Category

Picard and Dathon at El-Adrel

Friday, May 11th, 2012

Orri Erling’s account of the seeing Bryan Thompson reminded me of Picard and Dathon at El-Adrel, albeit with happier results.

See what you think:

I gave an invited talk (“Virtuoso 7 – Column Store and Adaptive Techniques for Graph” (Slides (ppt))) at the Graph Data Management Workshop at ICDE 2012.

Bryan Thompson of Systap (Bigdata® RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to SPARQL, and adding a way of reifying statements that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF S or O (Subject or Object), less frequently a P or G (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the O but could of course have one in also S and even in P and G. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work.

So, we could do it like Bigdata®, or we could add a “quad ID” column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of PSOG->R.

Yet another variation would be to make the SPOG concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the O, and as an IRI in a special range when occurring as S. The relative merits depend on how often something will be reified and on whether one wishes to SELECT based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database.

Pushing past the metaphors it sounds like both Orri and Bryan are working on interesting projects. ;-)

bigdata®

Tuesday, December 20th, 2011

bigdata®

Bryan Thompson, one of the creators of bigdata(R), was a member of the effort that resulted in the XTM syntax for topic maps.

If Bryan says it scales, it scales.

What I did not see was the ability to document mappings between data as representing the same subjects. Or the ability to query such mappings. Still, on further digging I may uncover something that works that way.

From the webpage:

This is a major version release of bigdata(R). Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_1_0

New features:

  • Fast, scalable native support for SPARQL 1.1 analytic queries;
  • %100 Java memory manager leverages the JVM native heap (no GC);
  • New extensible hash tree index structure.

Feature summary:

- Single machine data storage to ~50B triples/quads (RWStore);

  • Clustered data storage is essentially unlimited;
  • Simple embedded and/or webapp deployment (NanoSparqlServer);
  • Triples, quads, or triples with provenance (SIDs);
  • Fast 100% native SPARQL 1.0 evaluation;
  • Integrated “analytic” query package;
  • Fast RDFS+ inference and truth maintenance;
  • Fast statement level provenance mode (SIDs).

Road map [3]:

  • Simplified deployment, configuration, and administration for clusters; and
  • High availability for the journal and the cluster.

(footnotes omitted)

PS: Jack Park forwarded this to my attention. Will have to download and play with it over the holidays.