Bye-bye Giraph-Gremlin, Hello Hadoop-Gremlin with GiraphGraphComputer Support

Bye-bye Giraph-Gremlin, Hello Hadoop-Gremlin with GiraphGraphComputer Support by Marko A. Rodriguez.

There are days when I wonder if Marko ever sleeps or if the problem of human cloning has already been solved.

This is one of those day:

The other day Dan LaRocque and I were working on a Hadoop-based GraphComputer for Titan so we could do bulk loading into Titan. First we wrote the BulkLoading VertexProgram: bulkloader/
…and then realized, “huh, we can just execute this with GiraphGraph. Huh! We can just execute this with TinkerGraph!” In fact, as a side note, the BulkLoaderVertexProgram is general enough to work for any TinkerPop Graph.

So great, we can just use GiraphGraph (or any other TinkerPop implementation that has a GraphComputer (e.g. TinkerGraph)). However, Titan is all about scale and when the size of your graph is larger than the total RAM in your cluster, we will still need a MapReduce-based GraphComputer. Thinking over this, it was realized: Giraph-Gremlin is very little Giraph and mostly just Hadoop — InputFormats, HDFS interactions, MapReduce wrappers, Configuration manipulations, etc. Why not make GiraphGraphComputer just a particular GraphComputer supported by Gremlin-Hadoop (a new package).

With that, Giraph-Gremlin no longer exists. Hadoop-Gremlin now exists. Hadoop-Gremlin behaves the exact same way as Giraph-Gremlin, save that we will be adding a MapReduceGraphComputer to Hadoop-Gremlin. In this way, Hadoop-Gremlin will support two GraphComputer: GiraphGraphComputer and MapReduceGraphComputer.

The master/ branch is updated and the docs for Giraph have been re-written, though I suspect there will be some dangling references in the docs here and there for a while.

Up next, Matthias and I will create MapReduceGraphComputer that is smart about “partitioned vertices” — so you don’t get the Faunus scene where if a vertex doesn’t fit in memory, an exception. This will allow vertices with as many edges as you want (though your data model is probably shotty if you have 100s of millions of edges on one vertex 😉 ……………….. Matthias will be driving that effort and I’m excited to learn about the theory of vertex partitioning (i.e. splitting a single vertex across machines).


Comments are closed.