Archive for the ‘Probablistic Counting’ Category

hyperloglog-redis

Monday, February 18th, 2013

hyperloglog-redis

From the webpage:

This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating cardinalities of sets observed via a stream of events. A Redis instance is used for storing the counters. A minimal example:

require 'redis'
require 'hyperloglog-redis'

counter = HyperLogLog::Counter.new(Redis.new)
['john', 'paul', 'george', 'ringo', 'john', 'paul'].each 
       do |beatle|
  counter.add('beatles', beatle)
end

puts "There are approximately #{counter.count('beatles')} 
        distinct Beatles"

Each HyperLogLog counter uses a small, fixed amount of space but can estimate the cardinality of any set of up to around a billion values with relative error of 1.04 / Math.sqrt(2 ** b) with high probability, where b is a parameter passed to the HyperLogLog::Counter initializer that defaults to 10. With b = 10, each counter is represented by a 1 KB string in Redis and we get an expected relative error of 3%. Contrast this with the amount of space needed to compute set cardinality exactly, which is over 100 MB for a even a bit vector representing a set with a billion values.

The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is to apply a good hash function to each value observed in the stream and record the longest run of zeros seen as a prefix of any hashed value. If the hash function is good, the bits in any hashed value should be close to statistically independent, so seeing a value that starts with exactly X zeros should happen with probability close to 2 ** -(X + 1). So, if you’ve seen a run of 5 zeros in one of your hash values, you’re likely to have around 2 ** 6 = 64 values in the underlying set. The actual implementation and analysis are much more advanced than this, but that’s the idea.

This gem implements a few useful extensions to the basic HyperLogLog algorithm which allow you to estimate unions and intersections of counters as well as counts within specific time ranges. These extensions are described in detail below.

The HyperLogLog algorithm is described and analyzed in the paper “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm” by Flajolet, Fusy, Gandouet, and Meunier. Our implementation closely follows the program described in Section 4 of that paper.

The same paper is mentioned in: Count a billion distinct objects w/ 1.5KB of Memory (Coarsening Graph Traversal). Consult the implementation details there as well.

I first saw this in NoSQL Weekly, Issue 116.

Count a billion distinct objects w/ 1.5KB of Memory (Coarsening Graph Traversal)

Friday, April 6th, 2012

Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory

From the post:

This is a guest post by Matt Abrams (@abramsm), from Clearspring, discussing how they are able to accurately estimate the cardinality of sets with billions of distinct elements using surprisingly small data structures. Their servers receive well over 100 billion events per month.

At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.

Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare’s works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:

Counter

Bytes Used

Count

Error

HashSet

10447016

67801

0%

Linear

3384

67080

1%

HyperLogLog

512

70002

3%

 

The table shows that we can count the words with a 3% error rate using only 512 bytes of space. Compare that to a perfect count using a HashMap that requires nearly 10 megabytes of space and you can easily see why cardinality estimators are useful. In applications where accuracy is not paramount, which is true for most web scale and network counting scenarios, using a probabilistic counter can result in tremendous space savings.

The post goes onto describe merging of counters from distributed machines and choosing an acceptable error rate for probabilistic counting.

Question: Can we make graph traversal resemble probabilistic counting?

I will have to work on a graphic but see if this word picture works for the moment.

Assume we have a 3-D graph and the top layer of nodes is composed of basketballs, the basketballs are sitting on a layer of baseballs, and the baseballs are sitting on top of marbles. Each layer represents the nodes and edges below it, except that the representation is coarser at the baseball level and coarser still at the level of basketballs.

Traversal at the “level” of basketballs may be sufficient until we reach a point of interest and then we traverse into greater detail levels of the graph.

Another illustration.

You draw and traverse from node a to node d the following graph:

Graph Traversal Illustration

Easy enough.

Now, same traversal but choose a molecule located in a to traverse to d and travel along edges between molecules.

Or, same traversal but choose an atom located in a to traverse to d and travel along edges between atoms.

In some sense the “same” path but substantially higher traversal cost at the level of greater detail.

Has someone suggested coarsening graph traversal (or having multiple levels of traversal)? Sure it has happened. Would appreciate a pointer.


The authors cite: Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm (2007) by Philippe Flajolet , Éric Fusy , Olivier Gandouet, et al.

And, stream-lib, a project with many useful implementations of the strategies in the post.