Flexible Indexing in Hadoop

Flexible Indexing in Hadoop by Dmitriy Ryaboy.

Summarized by Russell Jumey as:

There was much excitement about Dmitriy Ryaboy’s talk about Flexible Indexing in Hadoop (slides available). Twitter has created a novel indexing system atop Hadoop to avoid “Looking for needles in haystacks with snowplows,” or – using mapreduce over lots of data to pick out a few records. Twitter Analytics’s new tool, Elephant Twin goes beyond folder/subfolder partitioning schemes used by many, for instance bucketizing data by /year/month/week/day/hour. Elephant Twin is a framework for creating indexes in Hadoop using Lucene. This enables you to push filtering down into Lucene, to return a few records and to dramatically reduce the records streamed and the time spent on jobs that only parse a small subset of your overall data. A huge boon for the Hadoop Community from Twitter!

The slides plus a slide-by-slide transcript of the presentation is available.

Going in the opposite direction of some national security efforts, which are creating bigger haystacks for the purpose of having larger haystacks.

There are a number of legitimately large haystacks in medicine, physics, astronomy, chemistry and any number of other disciplines. Grabbing all phone traffic to avoid saying you choose the < 5,000 potential subjects of interest is just bad planning.

Comments are closed.