HBase Replication Overview by Himanshu Vashishtha.
From the post:
HBase Replication is a way of copying data from one HBase cluster to a different and possibly distant HBase cluster. It works on the principle that the transactions from the originating cluster are pushed to another cluster. In HBase jargon, the cluster doing the push is called the master, and the one receiving the transactions is called the slave. This push of transactions is done asynchronously, and these transactions are batched in a configurable size (default is 64MB). Asynchronous mode incurs minimal overhead on the master, and shipping edits in a batch increases the overall throughput.
This blogpost discusses the possible use cases, underlying architecture and modes of HBase replication as supported in CDH4 (which is based on 0.92). We will discuss Replication configuration, bootstrapping, and fault tolerance in a follow up blogpost.
Use cases
HBase replication supports replicating data across datacenters. This can be used for disaster recovery scenarios, where we can have the slave cluster serve real time traffic in case the master site is down. Since HBase replication is not intended for automatic failover, the act of switching from the master to the slave cluster in order to start serving traffic is done by the user. Afterwards, once the master cluster is up again, one can do a CopyTable job to copy the deltas to the master cluster (by providing the start/stop timestamps) as described in the CopyTable blogpost.
Another replication use case is when a user wants to run load intensive MapReduce jobs on their HBase cluster; one can do so on the slave cluster while bearing a slight performance decrease on the master cluster.
So there is a non-romantic, sysadmin side to “big data.” I understand, no one ever even speaks unless something has gone wrong with the system. Sysadmins either get no contacts (a good thing) or pages, tweets, emails, phone calls and physical visits from irate users, managers, etc.
This post is a start towards always having the first case, no contacts. Leaves you more time for things that interest sysadmins. I won’t tell if you don’t.