Email Indexing Using Cloudera Search by Jeff Shmain
From the post:
Why would any company be interested in searching through its vast trove of email? A better question is: Why wouldn’t everybody be interested?
Email has become the most widespread method of communication we have, so there is much value to be extracted by making all emails searchable and readily available for further analysis. Some common use cases that involve email analysis are fraud detection, customer sentiment and churn, lawsuit prevention, and that’s just the tip of the iceberg. Each and every company can extract tremendous value based on its own business needs.
A little over a year ago we described how to archive and index emails using HDFS and Apache Solr. However, at that time, searching and analyzing emails were still relatively cumbersome and technically challenging tasks. We have come a long way in document indexing automation since then — especially with the recent introduction of Cloudera Search, it is now easier than ever to extract value from the corpus of available information.
In this post, you’ll learn how to set up Apache Flume for near-real-time indexing and MapReduce for batch indexing of email documents. Note that although this post focuses on email data, there is no reason why the same concepts could not be applied to instant messages, voice transcripts, or any other data (both structured and unstructured).
If you want a beyond “Hello World” introduction to: Flume, Solr, Cloudera Morphlines, HDFS, Hue’s Search application, and Cloudera Search, this is the post for you.
With the added advantage that you can apply the basic principles in this post as you expand your knowledge of the Hadoop ecosystem.