Archive for the ‘HyperLogLog’ Category

Open Source Release: java-hll

Friday, January 3rd, 2014

Open Source Release: java-hll

From the post:

We’re happy to announce our newest open-source project, java-hll, a HyperLogLog implementation in Java that is storage-compatible with the previously released postgresql-hll and js-hll implementations. And as the rule of three dictates, we’ve also extracted the storage specification that makes them interoperable into it’s own repository. Currently, all three implementations support reading storage specification v1.0.0, while only the PostgreSQL and Java implementations fully support writing v1.0.0. We hope to bring the JS implementation up to speed, with respect to serialization, shortly.

For reasons to be excited, see my HyperLogLog post archive.

Open Source Release: postgresql-hll

Saturday, May 25th, 2013

Open Source Release: postgresql-hll

From the post:

We’re happy to announce the first open-source release of AK’s PostgreSQL extension for building and manipulating HyperLogLog data structures in SQL, postgresql-hll. We are releasing this code under the Apache License, Version 2.0 which we feel is an excellent balance between permissive usage and liability limitation.

What is it and what can I do with it?

The extension introduces a new data type, hll, which represents a probabilistic distinct value counter that is a hybrid between a HyperLogLog data structure (for large cardinalities) and a simple set (for small cardinalities). These structures support the basic HLL methods: insert, union, and cardinality, and we’ve also provided aggregate and debugging functions that make using and understanding these things a breeze. We’ve also included a way to do schema versioning of the binary representations of hlls, which should allow a clear path to upgrading the algorithm, as new engineering insights come up.

A quick overview of what’s included in the release:

  • C-based extension that provides the hll data structure and algorithms
  • Austin Appleby’s MurmurHash3 implementation and SQL-land wrappers for integer numerics, bytes, and text
  • Full storage specification in STORAGE.markdown
  • Full function reference in REFERENCE.markdown
  • .spec file for rpmbuild
  • Full test suite

A quick note on why we included MurmurHash3 in the extension: we’ve done a good bit of research on the importance of a good hash function when using sketching algorithms like HyperLogLog and we came to the conclusion that it wouldn’t be very user-friendly to force the user to figure out how to get a good hash function into SQL-land. Sure, there are plenty of cryptographic hash functions available, but those are (computationally) overkill for what is needed. We did the research and found MurmurHash3 to be an excellent non-cryptographic hash function in both theory and practice. We’ve been using it in production for a while now with excellent results. As mentioned in the README, it’s of crucial importance to reliably hash the inputs to hlls.

Would you say topic maps aggregate data?

I thought so.

How would you adapt HLL to synonymous identifications?

I ask because of this line in the post:

Essentially, we’re doing interactive set-intersection of operands with millions or billions of entries in milliseconds. This is intersection computed using the inclusion-exclusion principle as applied to hlls:

Performing “…interactive set-intersection of operands with millions or billions of entries in milliseconds…” sounds like an attractive feature for topic map software.

Yes?

HyperLogLog — Cornerstone of a Big Data Infrastructure

Thursday, May 2nd, 2013

HyperLogLog — Cornerstone of a Big Data Infrastructure

From the introduction:

In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. This sketch extends upon the earlier Loglog Counting of Large Cardinalities (Durand et. al.) which in turn is based on the seminal AMS work FM-85, Flajolet and Martin’s original work on probabilistic counting. (Many thanks to Jérémie Lumbroso for the correction of the history here. I am very much looking forward to his upcoming introduction to probabilistic counting in Flajolet’s complete works.) UPDATE – Rob has recently published a blog about PCSA, a direct precursor to LogLog counting which is filled with interesting thoughts. There have been a few posts on HLL recently so I thought I would dive into the intuition behind the sketch and into some of the details.

After seeing the HyperLogLog references in Approximate Methods for Scalable Data Mining I started looking for a fuller explanation/illustration of HyperLogLog.

Stumbled on this posting.

Includes a great HyperLogLog (HLL) simulation written in JavaScript.

Enjoy!

Approximate Methods for Scalable Data Mining

Thursday, May 2nd, 2013

Approximate Methods for Scalable Data Mining by Andrew Clegg.

Slides from a presentation at: Data Science London 24/04/13.

To get your interest, a nice illustration of HyperLogLog algorithm, “Billions of distinct values in 1.5KB of RAM with 2% relative error.”

Has a number of other useful illustrations and great references.

hyperloglog-redis

Monday, February 18th, 2013

hyperloglog-redis

From the webpage:

This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating cardinalities of sets observed via a stream of events. A Redis instance is used for storing the counters. A minimal example:

require 'redis'
require 'hyperloglog-redis'

counter = HyperLogLog::Counter.new(Redis.new)
['john', 'paul', 'george', 'ringo', 'john', 'paul'].each 
       do |beatle|
  counter.add('beatles', beatle)
end

puts "There are approximately #{counter.count('beatles')} 
        distinct Beatles"

Each HyperLogLog counter uses a small, fixed amount of space but can estimate the cardinality of any set of up to around a billion values with relative error of 1.04 / Math.sqrt(2 ** b) with high probability, where b is a parameter passed to the HyperLogLog::Counter initializer that defaults to 10. With b = 10, each counter is represented by a 1 KB string in Redis and we get an expected relative error of 3%. Contrast this with the amount of space needed to compute set cardinality exactly, which is over 100 MB for a even a bit vector representing a set with a billion values.

The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is to apply a good hash function to each value observed in the stream and record the longest run of zeros seen as a prefix of any hashed value. If the hash function is good, the bits in any hashed value should be close to statistically independent, so seeing a value that starts with exactly X zeros should happen with probability close to 2 ** -(X + 1). So, if you’ve seen a run of 5 zeros in one of your hash values, you’re likely to have around 2 ** 6 = 64 values in the underlying set. The actual implementation and analysis are much more advanced than this, but that’s the idea.

This gem implements a few useful extensions to the basic HyperLogLog algorithm which allow you to estimate unions and intersections of counters as well as counts within specific time ranges. These extensions are described in detail below.

The HyperLogLog algorithm is described and analyzed in the paper “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm” by Flajolet, Fusy, Gandouet, and Meunier. Our implementation closely follows the program described in Section 4 of that paper.

The same paper is mentioned in: Count a billion distinct objects w/ 1.5KB of Memory (Coarsening Graph Traversal). Consult the implementation details there as well.

I first saw this in NoSQL Weekly, Issue 116.

Count a billion distinct objects w/ 1.5KB of Memory (Coarsening Graph Traversal)

Friday, April 6th, 2012

Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory

From the post:

This is a guest post by Matt Abrams (@abramsm), from Clearspring, discussing how they are able to accurately estimate the cardinality of sets with billions of distinct elements using surprisingly small data structures. Their servers receive well over 100 billion events per month.

At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.

Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare’s works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:

Counter

Bytes Used

Count

Error

HashSet

10447016

67801

0%

Linear

3384

67080

1%

HyperLogLog

512

70002

3%

 

The table shows that we can count the words with a 3% error rate using only 512 bytes of space. Compare that to a perfect count using a HashMap that requires nearly 10 megabytes of space and you can easily see why cardinality estimators are useful. In applications where accuracy is not paramount, which is true for most web scale and network counting scenarios, using a probabilistic counter can result in tremendous space savings.

The post goes onto describe merging of counters from distributed machines and choosing an acceptable error rate for probabilistic counting.

Question: Can we make graph traversal resemble probabilistic counting?

I will have to work on a graphic but see if this word picture works for the moment.

Assume we have a 3-D graph and the top layer of nodes is composed of basketballs, the basketballs are sitting on a layer of baseballs, and the baseballs are sitting on top of marbles. Each layer represents the nodes and edges below it, except that the representation is coarser at the baseball level and coarser still at the level of basketballs.

Traversal at the “level” of basketballs may be sufficient until we reach a point of interest and then we traverse into greater detail levels of the graph.

Another illustration.

You draw and traverse from node a to node d the following graph:

Graph Traversal Illustration

Easy enough.

Now, same traversal but choose a molecule located in a to traverse to d and travel along edges between molecules.

Or, same traversal but choose an atom located in a to traverse to d and travel along edges between atoms.

In some sense the “same” path but substantially higher traversal cost at the level of greater detail.

Has someone suggested coarsening graph traversal (or having multiple levels of traversal)? Sure it has happened. Would appreciate a pointer.


The authors cite: Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm (2007) by Philippe Flajolet , Éric Fusy , Olivier Gandouet, et al.

And, stream-lib, a project with many useful implementations of the strategies in the post.

HyperANF: Graph Neighborhood Functions < 15 Minutes On a Laptop

Wednesday, March 14th, 2012

HyperANF: Approximating the Neighbourhood Function of Very Large Graphs on a Budget (2011) by Paolo Boldi, Marco Rosa, and Sebastiano Vigna.

Inducement to read the abstract or paper:

Recently, a MapReduce-based distributed implementation of ANF called HADI [KTA+10] has been presented. HADI runs on one of the fifty largest supercomputers—the Hadoop cluster M45. The only published data about HADI’s performance is the computation of the neighbourhood function of a Kronecker graph with 2 billion links, which required half an hour using 90 machines. HyperANF can compute the same function in less than fifteen minutes on a laptop. (emphasis in original)

Abstract:

The neighbourhood function N G.t / of a graph G gives, for each t 2 N, the number of pairs of nodes hx; yi such that y is reachable from x in less that t hops. The neighbourhood function provides a wealth of information about the graph [PGF02] (e.g., it easily allows one to compute its diameter), but it is very expensive to compute it exactly. Recently, the ANF algorithm [PGF02] (approximate neighbourhood function) has been proposed with the purpose of approximating N G.t / on large graphs. We describe a breakthrough improvement over ANF in terms of speed and scalability. Our algorithm, called HyperANF, uses the new HyperLogLog counters [FFGM07] and combines them efficiently through broadword programming [Knu07]; our implementation uses task decomposition to exploit multi-core parallelism. With HyperANF, for the first time we can compute in a few hours the neighbourhood function of graphs with billions of nodes with a small error and good confidence using a standard workstation.

Then, we turn to the study of the distribution of distances between reachable nodes (that can be efficiently approximated by means of HyperANF), and discover the surprising fact that its index of dispersion provides a clear-cut characterisation of proper social networks vs. web graphs. We thus propose the spid (Shortest-Paths Index of Dispersion) of a graph as a new, informative statistics that is able to discriminate between the above two types of graphs. We believe this is the first proposal of a significant new non-local structural index for complex networks whose computation is highly scalable.

New algorithm for studying the structure of large graphs. Part of the WebGraph project. The “large” version of the software handles 231 nodes.