Archive for the ‘Bloom Filters’ Category

Bloom Filters

Wednesday, October 15th, 2014

Bloom Filters by Jason Davies.

From the post:

Everyone is always raving about bloom filters. But what exactly are they, and what are they useful for?

Very straightforward explanation along with interactive demo. The applications section will immediately suggest how Bloom filters could be used when querying.

There are other complexities, see the Bloom Filter entry at Wikipedia. But as a first blush explanation, you will be hard pressed to find one as good as Jason’s.

I first saw this in a tweet by Allen Day.

Writing a full-text search engine using Bloom filters

Saturday, January 4th, 2014

Writing a full-text search engine using Bloom filters by Stavros Korokithakis.

A few minutes ago I came across a Hacker News post that detailed a method of adding search to your static site. As you probably know, adding search to a static site is a bit tricky, because you can’t just send the query to a server and have the server process it and return the results. If you want full-text search, you have to implement something like an inverted index.

How an inverted index works

An inverted index is a data structure that basically maps every word in every document to the ID of the document it can be found in. For example, such an index might look like {"python": [1, 3, 6], "raspberry": [3, 7, 19]}. To find the documents that mention both “python” and “raspberry”, you look those terms up in the index and find the common document IDs (in our example, that is only document with ID 3).

However, when you have very long documents with varied words, this can grow a lot. It’s a hefty data structure, and, when you want to implement a client-side search engine, every byte you transmit counts.

Client-side search engine caveats

The problem with client-side search engines is that you (obviously) have to do all the searching on the client, so you have to transmit all available information there. What static site generators do is generate every required file when generating your site, then making those available for the client to download. Usually, search-engine plugins limit themselves to tags and titles, to cut down on the amount of information that needs to be transmitted. How do we reduce the size? Easy, use a Bloom filter!

An interesting alternative to indexing a topic map with an inverted index.

I mention it in part because of one of the “weaknesses” of Bloom filters for searching:

You can’t weight pages by relevance, since you don’t know how many times a word appears in a page, all you know is whether it appears or not. You may or may not care about this.

Unlike documents, which are more or less relevant due to work occurrences, topic maps cluster information about a subject into separate topics (or proxies if you prefer).

That being the case, it isn’t the case that one topic/proxy is more “relevant” than another. The question is whether this topic/proxy represents the subject you want?

Or to put it another way, topics/proxies have already been arranged by “relevance” by a topic map author.

If a topic map interface gives you hundreds or thousands of “relevant” topics/proxies, how are you any better off than a more traditional search engine?

If you need to look like you are working, go search any of the social media sites for useful content. It’s there, the difficulty is going to be finding it.

Creating a Simple Bloom Filter

Friday, February 22nd, 2013

Creating a Simple Bloom Filter by Max Burstein.

From the post:

Bloom filters are super efficient data structures that allow us to tell if an object is most likely in a data set or not by checking a few bits. Bloom filters return some false positives but no false negatives. Luckily we can control the amount of false positives we receive with a trade off of time and memory.

You may have never heard of a bloom filter before but you’ve probably interacted with one at some point. For instance if you use Chrome, Chrome has a bloom filter of malicious URLs. When you visit a website it checks if that domain is in the filter. This prevents you from having to ping Google’s servers every time you visit a website to check if it’s malicious or not. Large databases such as Cassandra and Hadoop use bloom filters to see if it should do a large query or not.

I think you will appreciate the lookup performance difference. 😉

I first saw this at: Alex Popescu: Creating a Simple Bloom Filter in Python.

…Efficient Approximate Data De-Duplication in Streams [Approximate Merging?]

Tuesday, December 18th, 2012

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams by Suman K. Bera, Sourav Dutta, Ankur Narang, Souvik Bhattacherjee.

Abstract:

Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent.
.
In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.

If you think of more than one representative for a subject as “duplication,” then merging is a special class of “deduplication.”

Deduplication that discards redundant information but that preserves unique additional information and relationships to other subjects.

As you move away from static and towards transient topic maps, representations of subjects in real time data streams, this and similar techniques will become very important.

I first saw this in a tweet from Stefano Bertolo.

PS: A new equivalent term (to me) for deduplication: “intelligent compression.” Pulls about 46K+ “hits” in a popular search engine today. May want to add it to your routine search queries.

Biff (Bloom Filter) Codes:…

Tuesday, August 7th, 2012

Biff (Bloom Filter) Codes: Fast Error Correction for Large Data Sets by M. Mitzenmacher and George Varghese.

Abstract:

Large data sets are increasingly common in cloud and virtualized environments. For example, transfers of multiple gigabytes are commonplace, as are replicated blocks of such sizes. There is a need for fast error-correction or data reconciliation in such settings even when the expected number of errors is small.

Motivated by such cloud reconciliation problems, we consider error-correction schemes designed for large data, after explaining why previous approaches appear unsuitable. We introduce Biff codes, which are based on Bloom filters and are designed for large data. For Biff codes with a message of length L and E errors, the encoding time is O(L), decoding time is O(L + E) and the space overhead is O(E). Biff codes are low-density parity-check codes; they are similar to Tornado codes, but are designed for errors instead of erasures. Further, Biff codes are designed to be very simple, removing any explicit graph structures and based entirely on hash tables. We derive Biff codes by a simple reduction from a set reconciliation algorithm for a recently developed data structure, invertible Bloom lookup tables. While the underlying theory is extremely simple, what makes this code especially attractive is the ease with which it can be implemented and the speed of decoding. We present results from a prototype implementation that decodes messages of 1 million words with thousands of errors in well under a second.

I followed this paper’s citation on set reconciliation to find the paper reported at: What’s the Difference? Efficient Set Reconciliation without Prior Context.

Suspect this line of work is far from finished and that you will find immediate and future applications of it for topic map applications.

“Modern” Algorithms and Data Structures (Bloom Filters, Merkle Trees)

Monday, March 5th, 2012

“Modern” Algorithms and Data Structures (Bloom Filters, Merkle Trees) by Lorenzo Alberton.

From the description:

The first part of a series of talks about modern algorithms and data structures, used by nosql databases like HBase and Cassandra. An explanation of Bloom Filters and several derivates, and Merkle Trees.

Looking forward to more of this series!