Introduction to Apache HBase Snapshots by Matteo Bertozzi.
From the post:
The current (4.2) release of CDH — Cloudera’s 100% open-source distribution of Apache Hadoop and related projects (including Apache HBase) — introduced a new HBase feature, recently landed in trunk, that allows an admin to take a snapshot of a specified table.
Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.
In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.
Here are a few of the use cases for HBase snapshots:
- Recovery from user/application errors
- Restore/Recover from a known safe state.
- View previous snapshots and selectively merge the difference into production.
- Save a snapshot right before a major application upgrade or change.
- Auditing and/or reporting on views of data at specific time
- Capture monthly data for compliance purposes.
- Run end-of-day/month/quarter reports.
- Application testing
- Test schema or application changes on data similar to that in production from a snapshot and then throw it away. For example: take a snapshot, create a new table from the snapshot content (schema plus data), and manipulate the new table by changing the schema, adding and removing rows, and so on. (The original table, the snapshot, and the new table remain mutually independent.)
- Offloading of work
- Take a snapshot, export it to another cluster, and run your MapReduce jobs. Since the export snapshot operates at HDFS level, you don’t slow down your main HBase cluster as much as CopyTable does.
Under “application testing” I would include access to your HBase data by non-experts. Gives them something to tinker with and preserves the integrity of your production data.