Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 5, 2013

Email Indexing Using Cloudera Search and HBase

Filed under: Cloudera,HBase,Solr — Patrick Durusau @ 6:38 pm

Email Indexing Using Cloudera Search and HBase by Jeff Shmain.

From the post:

In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)

Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:

  • Is HBase an optimal storage medium for the given use case?
  • Is the data already ingested into HBase?
  • Is there any access pattern that will require the files to be stored in a format other than HFiles?
  • If HBase is not currently running, will there be enough hardware resources to bring it up?

There are two ways to configure Cloudera Search to index documents stored in HBase: to alter the configuration files directly and start Lily HBase Indexer manually or as a service, or to configure everything using Cloudera Manager. This post will focus on the latter, because it is by far the easiest way to enable Search on HBase — or any other service on CDH, for that matter.

This rocks!

Including the reminder to fit the solution to your requirements, not the other way around.

The phrase “…near real time…” reminds me that HBase can operate in “…near real time…” but no analyst using HBase can.

Think about it. A search result comes back, the analyst reads it, perhaps compares it to their memory of other results and/or looks for other results to make the comparison. Then the analyst has to decide what if anything the results mean in a particular context and then communicate those results to others or take action based on those results.

That doesn’t sound even close to “…near real time…” to me.

You?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress