Archive for the ‘Graph Traversal’ Category

hyperloglog-redis

Monday, February 18th, 2013

hyperloglog-redis

From the webpage:

This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating cardinalities of sets observed via a stream of events. A Redis instance is used for storing the counters. A minimal example:

require 'redis'
require 'hyperloglog-redis'

counter = HyperLogLog::Counter.new(Redis.new)
['john', 'paul', 'george', 'ringo', 'john', 'paul'].each 
       do |beatle|
  counter.add('beatles', beatle)
end

puts "There are approximately #{counter.count('beatles')} 
        distinct Beatles"

Each HyperLogLog counter uses a small, fixed amount of space but can estimate the cardinality of any set of up to around a billion values with relative error of 1.04 / Math.sqrt(2 ** b) with high probability, where b is a parameter passed to the HyperLogLog::Counter initializer that defaults to 10. With b = 10, each counter is represented by a 1 KB string in Redis and we get an expected relative error of 3%. Contrast this with the amount of space needed to compute set cardinality exactly, which is over 100 MB for a even a bit vector representing a set with a billion values.

The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is to apply a good hash function to each value observed in the stream and record the longest run of zeros seen as a prefix of any hashed value. If the hash function is good, the bits in any hashed value should be close to statistically independent, so seeing a value that starts with exactly X zeros should happen with probability close to 2 ** -(X + 1). So, if you’ve seen a run of 5 zeros in one of your hash values, you’re likely to have around 2 ** 6 = 64 values in the underlying set. The actual implementation and analysis are much more advanced than this, but that’s the idea.

This gem implements a few useful extensions to the basic HyperLogLog algorithm which allow you to estimate unions and intersections of counters as well as counts within specific time ranges. These extensions are described in detail below.

The HyperLogLog algorithm is described and analyzed in the paper “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm” by Flajolet, Fusy, Gandouet, and Meunier. Our implementation closely follows the program described in Section 4 of that paper.

The same paper is mentioned in: Count a billion distinct objects w/ 1.5KB of Memory (Coarsening Graph Traversal). Consult the implementation details there as well.

I first saw this in NoSQL Weekly, Issue 116.

GRADES: Graph Data-management Experiences & Systems

Saturday, December 29th, 2012

GRADES: Graph Data-management Experiences & Systems

Workshop: Sunday June 23, 2013

Papers Due: March 31, 2013

Notification: April 22, 2013

Camera-ready: May 19, 2013

Workshop Scope:

Application Areas

A new data economy is emerging, based on the analysis of distributed, heterogeneous, and complexly structured data sets. GRADES focuses on the problem of managing such data, specifically when data takes the form of graphs that connect many millions of nodes, and the worth of the data and its analysis is not only in the attribute values of these nodes, but in the way these nodes are connected. Specific application areas that exhibit the growing need for management of such graph shaped data include:

  • life science analytics, e.g., tracking relationships between illnesses, genes, and molecular compounds.
  • social network marketing, e.g., identifying influential speakers and trends propagating through a community.
  • digital forensics, e.g., analyzing the relationships between persons and entities for law enforcement purposes.
  • telecommunication network analysis, e.g., directed at fixing network bottlenecks and costing of network traffic.
  • digital publishing, e.g., enriching entities occurring in digital content with external data sources, and finding relationships among the entities.

Perspectives

The GRADES workshop solicits contributions from two perspectives:

  • Experiences. This includes topics that describe use case scenarios, datasets, and analysis opportunities occurring in real-life graph-shaped, ans well as benchmark descriptions and benchmark results.
  • Systems. This includes topics that describe data management system architectures for processing of Graph and RDF data, and specific techniques and algorithms employed inside such systems.

The combination of the two (Experiences with Systems) and benchmarking RDF and graph database systems, is of special interest.

Topics Of Interest

The following is a non-exhaustive list describing the scope of GRADES:

  • vision papers describing potential applications and benefits of graph data management.
  • descriptions of graph data management use cases and query workloads.
  • experiences with applying data management technologies in such situations.
  • experiences or techniques for specific operations such as traversals or RDF reasoning.
  • proposals for benchmarks for data integration tasks (instance matching and ETL techniques).
  • proposals for benchmarks for RDF and graph database workloads.
  • evaluation of benchmark performance results on RDF or graph database systems.
  • system descriptions covering RDF or graph database technology.
  • data and index structures that can be used in RDF and graph database systems.
  • query processing and optimization algorithms for RDF and graph database systems.
  • methods and technique for measuring graph characteristics.
  • methods and techniques for visualizing graphs and graph query results.
  • proposals and experiences with graph query languages.

The GRADES workshop is co-located and sponsored by SIGMOD in recognition that these problems are only interesting at large-scale and the contribution of the SIGMOD community to handle such topics on large amounts of data of many millions or even billions of nodes and edges is of critical importance.

That sounds promising doesn’t it? (Please email, copy, post, etc.)

GraphPack

Wednesday, July 11th, 2012

GraphPack

From the webpage:

GraphPack is a network of autonomous services that manage graph structures. Each node in those graphs may refer to a node in another service, effectively forming a distributed graph. GraphPack deals with the processing of such decentralized graphs. GraphPack supports its own traverse/query language (inspired by neo4j::cypher) that can executed as transparently distributed traverses.

Amit Portnoy wrote about GraphPack on the neo4j mailing list:

The prototype, called GraphPack, has a very lite design, actual persistence and communication aspects can be easily pluged-in by injection (using Guice).

GraphPack enables transperantly distributed traverses (in a decentralized graph), which can be specified by Cypher-inspired traverse specification language.

That is, clients of a GraphPack service (which may have a graph the refer to other nodes in other GraphPack services) can write a cypher-like expressions and simply receive a result, while the actual implementation may make many remote communication steps. This is done by deriving a new traverse specification in every edge a long specified paths and sending this derived specification to the next node (in other words, the computation moves along the nodes that are matched by the traverse specification).

Sounds like there should be lessons here for distributed topic maps. Yes?

Theory and Techniques for Synthesizing a Family of Graph Algorithms

Thursday, July 5th, 2012

Theory and Techniques for Synthesizing a Family of Graph Algorithms by Srinivas Nedunuri, William R. Cook, and Douglas R. Smith.

Abstract:

Although Breadth-First Search (BFS) has several advantages over Depth-First Search (DFS) its prohibitive space requirements have meant that algorithm designers often pass it over in favor of DFS. To address this shortcoming, we introduce a theory of Efficient BFS (EBFS) along with a simple recursive program schema for carrying out the search. The theory is based on dominance relations, a long standing technique from the field of search algorithms. We show how the theory can be used to systematically derive solutions to two graph algorithms, namely the Single Source Shortest Path problem and the Minimum Spanning Tree problem. The solutions are found by making small systematic changes to the derivation, revealing the connections between the two problems which are often obscured in textbook presentations of them.

I don’t think it would satisfy the formal derivation requirements of the authors but am curious why the dominance relationships are derived? Wouldn’t it be easier to ask a user at each node, go/no-go as far as dominance relationship? Allowing an interactive application of the search algorithms based upon the users dominance judgement?

I say that because the derivation strategy depends upon the developer’s interpretation of dominance from a field that may be unfamiliar or incorrectly communicated by a user. To be sure the derivation and results may be formally correct but not produce an answer that is “correct” in the view of a user.

Not to mention that once a derivation is formally “correct,” there would be resistance to changing a “correct” derivation.

A interactive, dynamic, symbiotic search experience between users and their search systems is more likely to produce results thought to be “correct” by users. (What other measure of “correctness” comes to mind?)

PS: Srinivas Nedunuri‘s homepage promises a copy of his PhD dissertation, which formed the basis for this paper, soon!

Notes on the analysis of large graphs

Friday, May 18th, 2012

Notes on the analysis of large graphs

From the post:

This post is part of a series on managing and analyzing graph data. Posts to date include:

My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.

How big can a graph be? That of course depends on:

  • The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
  • The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
  • The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.

*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.

I would think the specifics, for nodes and/or edges are going to depend upon the data set and your requirements for it. Neither one of which can be judged in the abstract or in advance.

Comments?

Count a billion distinct objects w/ 1.5KB of Memory (Coarsening Graph Traversal)

Friday, April 6th, 2012

Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory

From the post:

This is a guest post by Matt Abrams (@abramsm), from Clearspring, discussing how they are able to accurately estimate the cardinality of sets with billions of distinct elements using surprisingly small data structures. Their servers receive well over 100 billion events per month.

At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.

Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare’s works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:

Counter

Bytes Used

Count

Error

HashSet

10447016

67801

0%

Linear

3384

67080

1%

HyperLogLog

512

70002

3%

 

The table shows that we can count the words with a 3% error rate using only 512 bytes of space. Compare that to a perfect count using a HashMap that requires nearly 10 megabytes of space and you can easily see why cardinality estimators are useful. In applications where accuracy is not paramount, which is true for most web scale and network counting scenarios, using a probabilistic counter can result in tremendous space savings.

The post goes onto describe merging of counters from distributed machines and choosing an acceptable error rate for probabilistic counting.

Question: Can we make graph traversal resemble probabilistic counting?

I will have to work on a graphic but see if this word picture works for the moment.

Assume we have a 3-D graph and the top layer of nodes is composed of basketballs, the basketballs are sitting on a layer of baseballs, and the baseballs are sitting on top of marbles. Each layer represents the nodes and edges below it, except that the representation is coarser at the baseball level and coarser still at the level of basketballs.

Traversal at the “level” of basketballs may be sufficient until we reach a point of interest and then we traverse into greater detail levels of the graph.

Another illustration.

You draw and traverse from node a to node d the following graph:

Graph Traversal Illustration

Easy enough.

Now, same traversal but choose a molecule located in a to traverse to d and travel along edges between molecules.

Or, same traversal but choose an atom located in a to traverse to d and travel along edges between atoms.

In some sense the “same” path but substantially higher traversal cost at the level of greater detail.

Has someone suggested coarsening graph traversal (or having multiple levels of traversal)? Sure it has happened. Would appreciate a pointer.


The authors cite: Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm (2007) by Philippe Flajolet , Éric Fusy , Olivier Gandouet, et al.

And, stream-lib, a project with many useful implementations of the strategies in the post.