Archive for the ‘Bloom Filters’ Category

Creating a Simple Bloom Filter

Friday, February 22nd, 2013

Creating a Simple Bloom Filter by Max Burstein.

From the post:

Bloom filters are super efficient data structures that allow us to tell if an object is most likely in a data set or not by checking a few bits. Bloom filters return some false positives but no false negatives. Luckily we can control the amount of false positives we receive with a trade off of time and memory.

You may have never heard of a bloom filter before but you’ve probably interacted with one at some point. For instance if you use Chrome, Chrome has a bloom filter of malicious URLs. When you visit a website it checks if that domain is in the filter. This prevents you from having to ping Google’s servers every time you visit a website to check if it’s malicious or not. Large databases such as Cassandra and Hadoop use bloom filters to see if it should do a large query or not.

I think you will appreciate the lookup performance difference. ;-)

I first saw this at: Alex Popescu: Creating a Simple Bloom Filter in Python.

…Efficient Approximate Data De-Duplication in Streams [Approximate Merging?]

Tuesday, December 18th, 2012

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams by Suman K. Bera, Sourav Dutta, Ankur Narang, Souvik Bhattacherjee.

Abstract:

Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent.
.
In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.

If you think of more than one representative for a subject as “duplication,” then merging is a special class of “deduplication.”

Deduplication that discards redundant information but that preserves unique additional information and relationships to other subjects.

As you move away from static and towards transient topic maps, representations of subjects in real time data streams, this and similar techniques will become very important.

I first saw this in a tweet from Stefano Bertolo.

PS: A new equivalent term (to me) for deduplication: “intelligent compression.” Pulls about 46K+ “hits” in a popular search engine today. May want to add it to your routine search queries.

Biff (Bloom Filter) Codes:…

Tuesday, August 7th, 2012

Biff (Bloom Filter) Codes: Fast Error Correction for Large Data Sets by M. Mitzenmacher and George Varghese.

Abstract:

Large data sets are increasingly common in cloud and virtualized environments. For example, transfers of multiple gigabytes are commonplace, as are replicated blocks of such sizes. There is a need for fast error-correction or data reconciliation in such settings even when the expected number of errors is small.

Motivated by such cloud reconciliation problems, we consider error-correction schemes designed for large data, after explaining why previous approaches appear unsuitable. We introduce Biff codes, which are based on Bloom filters and are designed for large data. For Biff codes with a message of length L and E errors, the encoding time is O(L), decoding time is O(L + E) and the space overhead is O(E). Biff codes are low-density parity-check codes; they are similar to Tornado codes, but are designed for errors instead of erasures. Further, Biff codes are designed to be very simple, removing any explicit graph structures and based entirely on hash tables. We derive Biff codes by a simple reduction from a set reconciliation algorithm for a recently developed data structure, invertible Bloom lookup tables. While the underlying theory is extremely simple, what makes this code especially attractive is the ease with which it can be implemented and the speed of decoding. We present results from a prototype implementation that decodes messages of 1 million words with thousands of errors in well under a second.

I followed this paper’s citation on set reconciliation to find the paper reported at: What’s the Difference? Efficient Set Reconciliation without Prior Context.

Suspect this line of work is far from finished and that you will find immediate and future applications of it for topic map applications.

“Modern” Algorithms and Data Structures (Bloom Filters, Merkle Trees)

Monday, March 5th, 2012

“Modern” Algorithms and Data Structures (Bloom Filters, Merkle Trees) by Lorenzo Alberton.

From the description:

The first part of a series of talks about modern algorithms and data structures, used by nosql databases like HBase and Cassandra. An explanation of Bloom Filters and several derivates, and Merkle Trees.

Looking forward to more of this series!