Search Data at Scale in Five Minutes with Pig, Wonderdog and ElasticSearch

Search Data at Scale in Five Minutes with Pig, Wonderdog and ElasticSearch

Russell Jurney continues his posts on searching at scale:

Working code examples for this post (for both Pig 0.10 and ElasticSearch 0.18.6) are available here.

ElasticSearch makes search simple. ElasticSearch is built over Lucene and provides a simple but rich JSON over HTTP query interface to search clusters of one or one hundred machies. You can get started with ElasticSearch in five minutes, and it can scale to support heavy loads in the enterprise. ElasticSearch has a Whirr Recipe, and there is even a Platform-as-a-Service provider, Bonsai.io.

Apache Pig makes Hadoop simple. In a previous post, we prepared the Berkeley Enron Emails in Avro format. The entire dataset is available in Avro format here: https://s3.amazonaws.com/rjurney.public/enron.avro. Lets check them out:

Scale is important for some queries but what other factors are important for searches?

Thinking that Google is searching at scale. Is that a counter-example to scale being the only measure of search success? Or the best measure?

Or is scale of searching just a starting point?

Where do you go after scale? Scale is easy to evaluate/measure, so whatever your next step, how is it evaluated or measured?

Or is that the reason for emphasis on scale/size? It’s an easy mark(in several senses)?

Leave a Reply

You must be logged in to post a comment.