I first encountered the reference to the Do the Schimmy… posts at Alex Popescu’s myNoSQL site under Efficient Large-Scale Graph Analysis with Hadoop.
An excellent pair of articles on the use (and improvement of) Hadoop for graph processing.
Do the Schimmy: Efficient Large-Scale Graph Analysis with Hadoop
Question: What do PageRank, the Kevin Bacon game, and DNA sequencing all have in common?
As you might know, PageRank is one of the many features Google uses for computing the importance of a webpage based on the other pages that link to it. The intuition is that pages linked from many important pages are themselves important. In the Kevin Bacon game, we try to find the shortest path from Kevin Bacon to your favorite movie star based on who they were costars with. For example, there is a 2 hop path from Kevin Bacon to Jason Lee: Kevin Bacon starred in A Few Good Men with Tom Cruise, whom also starred in Vanilla Star with Jason Lee. In the case of DNA sequencing, we compute the full genome sequence of a person (~3 billion nucleotides) from many short DNA fragments (~100 nucleotides) by constructing and searching the genome assembly graph. The assembly graph connects fragments with the same or similar sequences, and thus long paths of a particular form can spell out entire genomes.
The common aspect for these and countless other important problems, including those in defense & intelligence, recommendation systems & machine learning, social networking analysis, and business intelligence, is the need to analyze enormous graphs: the Web consists of trillions of interconnected pages, IMDB has millions of movies and movie stars, and sequencing a single human genome requires searching for paths between billions of short DNA fragments. At this scale, searching or analyzing a graph on a single machine would be time-consuming at best and totally impossible at worst, especially when the graph cannot possibly be stored in memory on a single computer.
Do the Schimmy: Efficient Large-Scale Graph Analysis with Hadoop, Part 2
In part 1, we looked at how extremely large graphs can be represented and analyzed in Hadoop/MapReduce. Here in part 2 we will examine this design in more depth to identify inefficiencies, and present some simple solutions that can be applied to many Hadoop/MapReduce graph algorithms. The speedup using these techniques is substantial: as a prototypical example, we were able to reduce the running time of PageRank on a webgraph with 50.2 million vertices and 1.4 billion edges by as much as 69% on a small 20-core Hadoop cluster at the University of Maryland (full details available here). We expect that similar levels of improvement will carry over to many of the other problems we discussed before (the Kevin Bacon game, and DNA sequence assembly in particular).