Archive for the ‘HyperANF’ Category

HyperANF: Graph Neighborhood Functions < 15 Minutes On a Laptop

Wednesday, March 14th, 2012

HyperANF: Approximating the Neighbourhood Function of Very Large Graphs on a Budget (2011) by Paolo Boldi, Marco Rosa, and Sebastiano Vigna.

Inducement to read the abstract or paper:

Recently, a MapReduce-based distributed implementation of ANF called HADI [KTA+10] has been presented. HADI runs on one of the fifty largest supercomputers—the Hadoop cluster M45. The only published data about HADI’s performance is the computation of the neighbourhood function of a Kronecker graph with 2 billion links, which required half an hour using 90 machines. HyperANF can compute the same function in less than fifteen minutes on a laptop. (emphasis in original)

Abstract:

The neighbourhood function N G.t / of a graph G gives, for each t 2 N, the number of pairs of nodes hx; yi such that y is reachable from x in less that t hops. The neighbourhood function provides a wealth of information about the graph [PGF02] (e.g., it easily allows one to compute its diameter), but it is very expensive to compute it exactly. Recently, the ANF algorithm [PGF02] (approximate neighbourhood function) has been proposed with the purpose of approximating N G.t / on large graphs. We describe a breakthrough improvement over ANF in terms of speed and scalability. Our algorithm, called HyperANF, uses the new HyperLogLog counters [FFGM07] and combines them efficiently through broadword programming [Knu07]; our implementation uses task decomposition to exploit multi-core parallelism. With HyperANF, for the first time we can compute in a few hours the neighbourhood function of graphs with billions of nodes with a small error and good confidence using a standard workstation.

Then, we turn to the study of the distribution of distances between reachable nodes (that can be efficiently approximated by means of HyperANF), and discover the surprising fact that its index of dispersion provides a clear-cut characterisation of proper social networks vs. web graphs. We thus propose the spid (Shortest-Paths Index of Dispersion) of a graph as a new, informative statistics that is able to discriminate between the above two types of graphs. We believe this is the first proposal of a significant new non-local structural index for complex networks whose computation is highly scalable.

New algorithm for studying the structure of large graphs. Part of the WebGraph project. The “large” version of the software handles 231 nodes.