Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 10, 2015

MapGraph [Graphs, GPUs, 30 GTEPS (30 billion traversed edges per second)]

Filed under: bigdata®,GPU,Graphs,MapGraph — Patrick Durusau @ 4:40 pm

MapGraph [Graphs, GPUs, 30 GTEPS (30 billion traversed edges per second)]

From the post:

MapGraph is Massively Parallel Graph processing on GPUs. (Previously known as “MPGraph”).

  • The MapGraph API makes it easy to develop high performance graph analytics on GPUs. The API is based on the Gather-Apply-Scatter (GAS) model as used in GraphLab. To deliver high performance computation and efficiently utilize the high memory bandwidth of GPUs, MapGraph’s CUDA kernels use multiple sophisticated strategies, such as vertex-degree-dependent dynamic parallelism granularity and frontier compaction.
  • New algorithms can be implemented in a few hours that fully exploit the data-level parallelism of the GPU and offer throughput of up to 3 billion traversed edges per second on a single GPU.
  • Preliminary results for the multi-GPU version of MapGraph have traversal rates of nearly 30 GTEPS (30 billion traversed edges per second) on a scale-free random graph with 4.3 billion directed edges using a 64 GPU cluster. See the multi-GPU paper referenced below for details.
  • The MapGraph API also comes in a CPU-only version that is currently packaged and distributed with the bigdata open-source graph database. GAS programs operate over the graph data loaded into the database and are accessed via either a Java API or a SPARQL 1.1 Service Call . Packaging the GPU version inside bigdata will be in a future release.

MapGraph is under the Apache 2 license. You can download MapGraph from http://sourceforge.net/projects/mpgraph/ . For the lastest version of this documentation, see http://mapgraph.io. You can subscribe to receive notice for future updates on the project home page. For open source support, please ask a question on the MapGraph mailing lists or file a ticket. To inquire about commercial support, please email us at licenses@bigdata.com. You can follow MapGraph and the bigdata graph database platform at http://www.bigdata.com/blog.

This work was (partially) funded by the DARPA XDATA program under AFRL Contract #FA8750-13-C-0002.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D14PC00029.

MapGraph Publications

You do have to wonder when the folks at Systap sleep. 😉 This is the same group that produced BlazeGraph, recently adopted by WikiData. Granting WikiData only has 13.6 million data items as of today but it isn’t “small” data.

The rest of the page has additional pointers and explanations for MapGraph.

Enjoy!

May 27, 2014

Bigdata and Blueprints

Filed under: bigdata®,Blueprints,Graphs,Gremlin,Rexster,TinkerPop — Patrick Durusau @ 4:04 pm

Bigdata and Blueprints

From the webpage:

Blueprints is an open-source property graph model interface useful for writing applications on top of a graph database. Gremlin is a domain specific language for traversing property graphs that comes with an excellent REPL useful for interacting with a Blueprints database. Rexster exposes a Blueprints database as a web service and comes with a web-based workbench application called DogHouse.

To get started with bigdata via Blueprints, Gremlin, and Rexster, start by getting your bigdata server running per the instructions here.

Then, go and download some sample GraphML data. The Tinkerpop Property Graph is a good starting point.

Just in case you aren’t familiar with bigdata(R):

bigdata(R) is a scale-out storage and computing fabric supporting optional transactions, very high concurrency, and very high aggregate IO rates. The bigdata RDF/graph database can load 1B edges in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal), highly available replication cluster mode (HAJournalServer), and a horizontally sharded cluster mode (BigdataFederation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion edges. The HAJournalServer adds replication, online backup, horizontal scaling of query, and high availability. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation. (http://sourceforge.net/projects/bigdata/)

So, this is a major event for Blueprints.

I first saw this in a tweet by Marko A. Rodriguez.

May 11, 2012

Picard and Dathon at El-Adrel

Filed under: bigdata®,Graphs,SQL — Patrick Durusau @ 4:50 pm

Orri Erling’s account of the seeing Bryan Thompson reminded me of Picard and Dathon at El-Adrel, albeit with happier results.

See what you think:

I gave an invited talk (“Virtuoso 7 – Column Store and Adaptive Techniques for Graph” (Slides (ppt))) at the Graph Data Management Workshop at ICDE 2012.

Bryan Thompson of Systap (Bigdata® RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to SPARQL, and adding a way of reifying statements that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF S or O (Subject or Object), less frequently a P or G (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the O but could of course have one in also S and even in P and G. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work.

So, we could do it like Bigdata®, or we could add a “quad ID” column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of PSOG->R.

Yet another variation would be to make the SPOG concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the O, and as an IRI in a special range when occurring as S. The relative merits depend on how often something will be reified and on whether one wishes to SELECT based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database.

Pushing past the metaphors it sounds like both Orri and Bryan are working on interesting projects. 😉

December 20, 2011

bigdata®

Filed under: bigdata®,NoSQL — Patrick Durusau @ 8:23 pm

bigdata®

Bryan Thompson, one of the creators of bigdata(R), was a member of the effort that resulted in the XTM syntax for topic maps.

If Bryan says it scales, it scales.

What I did not see was the ability to document mappings between data as representing the same subjects. Or the ability to query such mappings. Still, on further digging I may uncover something that works that way.

From the webpage:

This is a major version release of bigdata(R). Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_1_0

New features:

  • Fast, scalable native support for SPARQL 1.1 analytic queries;
  • %100 Java memory manager leverages the JVM native heap (no GC);
  • New extensible hash tree index structure.

Feature summary:

– Single machine data storage to ~50B triples/quads (RWStore);

  • Clustered data storage is essentially unlimited;
  • Simple embedded and/or webapp deployment (NanoSparqlServer);
  • Triples, quads, or triples with provenance (SIDs);
  • Fast 100% native SPARQL 1.0 evaluation;
  • Integrated “analytic” query package;
  • Fast RDFS+ inference and truth maintenance;
  • Fast statement level provenance mode (SIDs).

Road map [3]:

  • Simplified deployment, configuration, and administration for clusters; and
  • High availability for the journal and the cluster.

(footnotes omitted)

PS: Jack Park forwarded this to my attention. Will have to download and play with it over the holidays.

Powered by WordPress