HBase Real-time Analytics & Rollbacks via Append-based Updates by Alex Baranau.
From the post:
In this part 1 of a 3-part post series we’ll describe how we use HBase at Sematext for real-time analytics and how we can perform data rollbacks by using an append-only updates approach.
Some bits of this topic were already covered in Deferring Processing Updates to Increase HBase Write Performance and some were briefly presented at BerlinBuzzwords 2011 (video). We will also talk about some of the ideas below during HBaseCon-2012 in late May (see Real-time Analytics with HBase). The approach described in this post is used in our production systems (SPM & SA) and the implementation was open-sourced as HBaseHUT project.
Problem we are Solving
While HDFS & MapReduce are designed for massive batch processing and with the idea of data being immutable (write once, read many times), HBase includes support for additional operations such as real-time and random read/write/delete access to data records. HBase performs its basic job very well, but there are times when developers have to think at a higher level about how to utilize HBase capabilities for specific use-cases. HBase is a great tool with good core functionality and implementation, but it does require one to do some thinking to ensure this core functionality is used properly and optimally. The use-case we’ll be working with in this post is a typical data analytics system where:
- new data are continuously streaming in
- data are processed and stored in HBase, usually as time-series data
- processed data are served to users who can navigate through most recent data as well as dig deep into historical data
Although the above points frame the use-case relatively narrowly, the approach and its implementation that we’ll describe here are really more general and applicable to a number of other systems, too. The basic issues we want to solve are the following:
- increase record update throughput. Ideally, despite high volume of incoming data changes can be applied in real-time . Usually. due to the limitations of the “normal HBase update”, which requires Get+Put operations, updates are applied using batch-processing approach (e.g. as MapReduce jobs). This, of course, is anything but real-time: incoming data is not immediately seen. It is seen only after it has been processed.
- ability to roll back changes in the served data. Human errors or any other issues should not permanently corrupt data that system serves.
- ability to fetch data interactively (i.e. fast enough for inpatient humans). When one navigates through a small amount of recent data, as well as when selected time interval spans years, the retrieval should be fast.
Here is what we consider an “update”:
- addition of a new record if no records with same key exists
- update of an existing record with a particular key
See anything familiar? That resembles your use cases?
The proffered solution may not fit your use case(s) but this is an example of exploring a solution. Not fitting a problem to a solution. Not the same thing.
HBase Real-time Analytics & Rollbacks via Append-based Updates Part 2 is available. Solution uses HBaseHUT. Really informative graphics in part 2 as well.
Very interested in seeing Part 3!