I don’t find this change reflected in the 4.1 release notes but elsewhere Marko Rodriguez writes:
I tested the new code on a subset of the Friendster data (6 node Hadoop and 6 node Cassandra cluster).
vertices: 7 minutes to write 39 million vertices at ~100mb/second from the Hadoop to the Cassandra cluster.
- edges: 15 minutes to write 245 million edges at ~40mb/second from the Hadoop to the Cassandra cluster.
This is the fastest bulk load time I’ve seen to date. This means, DBPedia can be written in ~20 minutes! I’ve attached an annotated version of the Ganglia monitor to the email that shows the outgoing throughput for the various stages of the MapReduce job. In the past, I was lucky to get 5-10mb/second out of the edge writing stage (this had to do with how I was being dumb about how reduce worked in Hadoop — wasn’t considering the copy/shuffle aspect of the stage).
At this rate, this means we can do billion edges graphs in a little over 1 hour. I bet though I can now speed this up more with some parameter tweaking as I was noticing that Cassandra was RED HOT and locking up a few times on transaction commits. Anywho, Faunus 0.4.1 is going to be gangbusters!
Approximately one billion edges an hour?
It’s not > /dev/null speed but still quite respectable. 😉