How-to: Use HBase Bulk Loading, and Why by Jean-Daniel (JD) Cryans.
From the post:
Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.
This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.
Overview of Bulk Loading
If you have any of these symptoms, bulk loading is probably the right choice for you:
- You needed to tweak your MemStores to use most of the memory.
- You needed to either use bigger WALs or bypass them entirely.
- Your compaction and flush queues are in the hundreds.
- Your GC is out of control because your inserts range in the MBs.
- Your latency goes out of your SLA when you import data.
Most of those symptoms are commonly referred to as “growing pains.” Using bulk loading can help you avoid them.
Great post!
I would be very leery of database or database-like software that doesn’t offer bulk loading.