## Graph partitioning in MapReduce with Cascading

Graph partitioning in MapReduce with Cascading in two parts by Friso van Vollenhoven.

From the post:

I have recently had the joy of doing MapReduce based graph partitioning. Here’s a post about how I did that. I decided to use Cascading for writing my MR jobs, as it is a lot less verbose than raw Java based MR. The graph algorithm consists of one step to prepare the input data and then a iterative part, that runs until convergence. The program uses a Hadoop counter to check for convergence and will stop iterating once there. All code is available. Also, the explanation has colorful images of graphs. (And everything is written very informally and there is no math.)

From the post:

In a previous post, we talked about finding the partitions in a disconnected graph using Cascading. In reality, most graphs are actually fully connected, so only being able to partition already disconnected graphs is not very helpful. In this post, we’ll take a look at partitioning a connected graph based on some criterium for creating a partition boundary.

Very accessible explanations complete with source code (github).

What puzzles me about the efforts to develop an algorithm to automatically partition a graph database is that there is no corresponding effort to develop an algorithm to automatically partition relational databases. Yet we know that relational databases can be represented as graphs. So what’s the difference?

Conceding that graphs such as Facebook, the WWW, etc., have grown without planning and so aren’t subject to the same partitioning considerations as relational databases. But isn’t there a class of graphs that are closer to relational databases than Facebook?

Consider that diverse research facilities for a drug company could use graph databases for research purposes but that doesn’t mean that any user can create edges between nodes at random. Any more than a user of a sharded database can create arbitrary joins.

I deeply enjoy graph posts such as these by Friso van Vollenhoven but the “cool” aspects of using MapReduce should not keep us from seeing heuristics we can use to enhance the performance of graph databases.