IDH Hbase & Lucene Integration by Ritu Kama.
From the post:
HBase is a non-relational, column-oriented database that runs on top of the Hadoop Distributed File System (HDFS). Hbase’s tables contain rows and columns. Each table has an element defined as a Primary Key which is used for all Get/Put/Scan/Delete operations on those tables. To some extent this can be a shortcoming because one may want to search within, say, a given column.
The IDH Integration with Lucene
The Intel® Distribution for Apache Hadoop* (IDH) solves this problem by incorporating native features that permit straightforward integration with Lucene. Lucene is a search library that acts upon documents containing data fields and their values. The IDH-to-Lucene integration leverages the HBase Observer and Endpoint concepts, and therein lies the flexibility to access the HBase data with Lucene searches more robustly.
The Observers can be likened to triggers in RDBMS’s, while the Endpoints share some conceptual similarity to stored procedures. The mapping of Hbase records and Lucene documents is done by a convenience class called IndexMetadata. The Hbase observer monitors data updates to the Hbase table and builds indexes synchronously. The Indexes are stored in multiple shards with each shard tied to a region. The Hbase Endpoint dispatches search requests from the client to those regions.
When entering data into an HBase table you’ll need to create an HBase-Lucene mapping using the IndexMetadata class. During the insertion, text in the columns that are mapped get broken into indexes and stored in the Lucene index file. This process of creating the Lucene index is done automatically by the IDH implementation. Once the Lucene index is created, you can search on any keyword. The implementation searches for the word in the Lucene index and retrieves the row ID’s of the target word. Then, using those keys you can directly access the relevant rows in the database.
IDH’s HBase-Lucene integration extends HBase’s capability and provides many advantages:
- Search not only by row key but also by values.
- Use multiple query types such as Starts, Ends, Contains, Range, etc.
- Ranking scores for the search are also available.
See Ritu’s post for sample code and configuration procedures.
Definitely one for the short list on downloads to make.