Archive for the ‘HBase’ Category

Start of a new era: Apache HBase™ 1.0

Wednesday, February 25th, 2015

Start of a new era: Apache HBase™ 1.0

From the post:

The Apache HBase community has released Apache HBase 1.0.0. Seven years in the making, it marks a major milestone in the Apache HBase project’s development, offers some exciting features and new API’s without sacrificing stability, and is both on-wire and on-disk compatible with HBase 0.98.x.

In this blog, we look at the past, present and future of Apache HBase project. 

The 1.0.0 release has three goals:

1) to lay a stable foundation for future 1.x releases;

2) to stabilize running HBase cluster and its clients; and

3) make versioning and compatibility dimensions explicit 

Seven (7) years is a long time so kudos to everyone who contributed to getting Apache HBase to this point!

For those of you who like documentation, see the Apache HBase™ Reference Guide.

New in Cloudera Labs: SparkOnHBase

Friday, December 19th, 2014

New in Cloudera Labs: SparkOnHBase by Ted Malaska.

From the post:

Apache Spark is making a huge impact across our industry, changing the way we think about batch processing and stream processing. However, as we progressively migrate from MapReduce toward Spark, we shouldn’t have to “give up” anything. One of those capabilities we need to retain is the ability to interact with Apache HBase.

In this post, we will share the work being done in Cloudera Labs to make integrating Spark and HBase super-easy in the form of the SparkOnHBase project. (As with everything else in Cloudera Labs, SparkOnHBase is not supported and there is no timetable for possible support in the future; it’s for experimentation only.) You’ll learn common patterns of HBase integration with Spark and see Scala and Java examples for each. (It may be helpful to have the SparkOnHBase repository open as you read along.)

Is it too late to amend my wish list to include an eighty-hour week with Spark? 😉

This is an excellent opportunity to follow along with lab quality research on an important technology.

The Cloudera Labs discussion group strikes me as dreadfully under used.


Analyzing 1.2 Million Network Packets…

Sunday, June 15th, 2014

Analyzing 1.2 Million Network Packets per Second in Real Time by James Sirota and Sheetal Dolas.

Slides giving an overview of OpenSOC (Open Security Operations Center).

I mention this in case you are not the NSA and simply streaming the backbone of the Internet to storage for later analysis. Some business cases require real time results.

The project is also a good demonstration of building a high throughput system using only open source software.

Not to mention a useful collaboration between Cisco and Hortonworks.

BTW, take a look at slide 18. I would say they are adding information to the representative of a subject, wouldn’t you? While on the surface this looks easy, merging that data with other data, say held by local law enforcement, might not be so easy.

For example, depending on where you are intercepting traffic, you will be told I am about thirty (30) miles from my present physical location or some other answer. 😉 Now, if someone had annotated an earlier packet with that information and it was accessible to you, well, your targeting of my location could be a good deal more precise.

And there is the question of using data annotated by different sources who may have been attacked by the same person or group.

Even at 1.2 million packets per second there is still a role for subject identity and merging.

Phoenix: Incubating at Apache!

Sunday, January 12th, 2014

Phoenix: Incubating at Apache!

From the webpage:

Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

Tired of reading already and just want to get started? Take a look at our FAQs, listen to the Phoenix talks from Hadoop Summit 2013 and HBaseConn 2013, and jump over to our quick start guide here.

To see whats supported, go to our language reference. It includes all typical SQL query statement clauses, including SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, etc. It also supports a full set of DML commands as well as table creation and versioned incremental alterations through our DDL commands. We try to follow the SQL standards wherever possible.

Incubating at Apache is no guarantee of success but it does mean sane licensing and a merit based organization/process.

If you are interested in non-NSA corrupted software, consider supporting the Apache Software Foundation.

Using Hive to interact with HBase, Part 2

Tuesday, December 3rd, 2013

Using Hive to interact with HBase, Part 2 by Nick Dimiduk.

From the post:

This is the second of two posts examining the use of Hive for interaction with HBase tables. This is a hands-on exploration so the first post isn’t required reading for consuming this one. Still, it might be good context.

“Nick!” you exclaim, “that first post had too many words and I don’t care about JIRA tickets. Show me how I use this thing!”

This is post is exactly that: a concrete, end-to-end example of consuming HBase over Hive. The whole mess was tested to work on a tiny little 5-node cluster running HDP-1.3.2, which means Hive 0.11.0 and HBase

If you learn from concrete examples and then feel your way further out, you will love this post!

Approaches to Backup and Disaster Recovery in HBase

Saturday, November 23rd, 2013

Approaches to Backup and Disaster Recovery in HBase by Clint Heath.

From the post:

With increased adoption and integration of HBase into critical business systems, many enterprises need to protect this important business asset by building out robust backup and disaster recovery (BDR) strategies for their HBase clusters. As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.

In this post, you will get a high-level overview of the available mechanisms for backing up data stored in HBase, and how to restore that data in the event of various data recovery/failover scenarios. After reading this post, you should be able to make an educated decision on which BDR strategy is best for your business needs. You should also understand the pros, cons, and performance implications of each mechanism. (The details herein apply to CDH 4.3.0/HBase 0.94.6 and later.)

Note: At the time of this writing, Cloudera Enterprise 4 offers production-ready backup and disaster recovery functionality for HDFS and the Hive Metastore via Cloudera BDR 1.0 as an individually licensed feature. HBase is not included in that GA release; therefore, the various mechanisms described in this blog are required. (Cloudera Enterprise 5, currently in beta, offers HBase snapshot management via Cloudera BDR.)

The critical line in this post reads:

As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.

Note the emphasis on provide.

Great backup mechanisms don’t help much unless someone is making, testing and logging the backups.

Ask in writing about backups before performing any changes to a client’s system or data. Make the answer part of your documentation.

Using Hive to interact with HBase, Part 1

Monday, November 11th, 2013

Using Hive to interact with HBase, Part 1 by Nick Dimiduk.

From the post:

This is the first of two posts examining the use of Hive for interaction with HBase tables. Check back later in the week for the concluding article.

One of the things I’m frequently asked about is how to use HBase from Apache Hive. Not just how to do it, but what works, how well it works, and how to make good use of it. I’ve done a bit of research in this area, so hopefully this will be useful to someone besides myself. This is a topic that we did not get to cover in HBase in Action, perhaps these notes will become the basis for the 2nd edition 😉 These notes are applicable to Hive 0.11.x used in conjunction with HBase 0.94.x. They should be largely applicable to 0.12.x + 0.96.x, though I haven’t tested everything yet.

The hive project includes an optional library for interacting with HBase. This is where the bridge layer between the two systems is implemented. The primary interface you use when accessing HBase from Hive queries is called the BaseStorageHandler. You can also interact with HBase tables directly via Input and Output formats, but the handler is simpler and works for most uses.

If you want to be on the edge of Hive/HBase interaction, start here.

Be forewarned that you are in a folklore, JIRA issue, etc., place but you will be ahead of the less brave.

Email Indexing Using Cloudera Search and HBase

Tuesday, November 5th, 2013

Email Indexing Using Cloudera Search and HBase by Jeff Shmain.

From the post:

In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)

Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:

  • Is HBase an optimal storage medium for the given use case?
  • Is the data already ingested into HBase?
  • Is there any access pattern that will require the files to be stored in a format other than HFiles?
  • If HBase is not currently running, will there be enough hardware resources to bring it up?

There are two ways to configure Cloudera Search to index documents stored in HBase: to alter the configuration files directly and start Lily HBase Indexer manually or as a service, or to configure everything using Cloudera Manager. This post will focus on the latter, because it is by far the easiest way to enable Search on HBase — or any other service on CDH, for that matter.

This rocks!

Including the reminder to fit the solution to your requirements, not the other way around.

The phrase “…near real time…” reminds me that HBase can operate in “…near real time…” but no analyst using HBase can.

Think about it. A search result comes back, the analyst reads it, perhaps compares it to their memory of other results and/or looks for other results to make the comparison. Then the analyst has to decide what if anything the results mean in a particular context and then communicate those results to others or take action based on those results.

That doesn’t sound even close to “…near real time…” to me.


Hadoop Weekly – October 28, 2013

Tuesday, October 29th, 2013

Hadoop Weekly – October 28, 2013 by Joe Crobak.

A weekly blog post that tracks all things in the Hadoop ecosystem.

I will keep posting on Hadoop things of particular interest for topic maps but will also be pointing to this blog for those who want/need more Hadoop coverage.

Applying the Big Data Lambda Architecture

Sunday, October 27th, 2013

Applying the Big Data Lambda Architecture by Michael Hausenblas.

From the article:

Based on his experience working on distributed data processing systems at Twitter, Nathan Marz recently designed a generic architecture addressing common requirements, which he called the Lambda Architecture. Marz is well-known in Big Data: He’s the driving force behind Storm and at Twitter he  led the streaming compute team, which provides and develops shared infrastructure to support critical real-time applications.

Marz and his team described the underlying motivation for building systems with the lambda architecture as:

  • The need for a robust system that is fault-tolerant, both against hardware failures and human mistakes.
  • To serve a wide range of workloads and use cases, in which low-latency reads and updates are required. Related to this point, the system should support ad-hoc queries.
  • The system should be linearly scalable, and it should scale out rather than up, meaning that throwing more machines at the problem will do the job.
  • The system should be extensible so that features can be added easily, and it should be easily debuggable and require minimal maintenance.

From a bird’s eye view the lambda architecture has three major components that interact with new data coming in and responds to queries, which in this article are driven from the command line:

The goal of the article:

In this article, I employ the lambda architecture to implement what I call UberSocialNet (USN). This open-source project enables users to store and query acquaintanceship data. That is, I want to be able to capture whether I happen to know someone from multiple social networks, such as Twitter or LinkedIn, or from real-life circumstances. The aim is to scale out to several billions of users while providing low-latency access to the stored information. To keep the system simple and comprehensible, I limit myself to bulk import of the data (no capabilities to live-stream data from social networks) and provide only a very simple a command-line user interface. The guts, however, use the lambda architecture.

Something a bit challenging for the start of the week. 😉

How-to: Use HBase Bulk Loading, and Why

Monday, September 30th, 2013

How-to: Use HBase Bulk Loading, and Why by Jean-Daniel (JD) Cryans.

From the post:

Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.

This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.

Overview of Bulk Loading

If you have any of these symptoms, bulk loading is probably the right choice for you:

  • You needed to tweak your MemStores to use most of the memory.
  • You needed to either use bigger WALs or bypass them entirely.
  • Your compaction and flush queues are in the hundreds.
  • Your GC is out of control because your inserts range in the MBs.
  • Your latency goes out of your SLA when you import data.

Most of those symptoms are commonly referred to as “growing pains.” Using bulk loading can help you avoid them.

Great post!

I would be very leery of database or database-like software that doesn’t offer bulk loading.

IDH Hbase & Lucene Integration

Tuesday, September 3rd, 2013

IDH Hbase & Lucene Integration by Ritu Kama.

From the post:

HBase is a non-relational, column-oriented database that runs on top of the Hadoop Distributed File System (HDFS). Hbase’s tables contain rows and columns. Each table has an element defined as a Primary Key which is used for all Get/Put/Scan/Delete operations on those tables. To some extent this can be a shortcoming because one may want to search within, say, a given column.

The IDH Integration with Lucene

The Intel® Distribution for Apache Hadoop* (IDH) solves this problem by incorporating native features that permit straightforward integration with Lucene. Lucene is a search library that acts upon documents containing data fields and their values. The IDH-to-Lucene integration leverages the HBase Observer and Endpoint concepts, and therein lies the flexibility to access the HBase data with Lucene searches more robustly.

The Observers can be likened to triggers in RDBMS’s, while the Endpoints share some conceptual similarity to stored procedures. The mapping of Hbase records and Lucene documents is done by a convenience class called IndexMetadata. The Hbase observer monitors data updates to the Hbase table and builds indexes synchronously. The Indexes are stored in multiple shards with each shard tied to a region. The Hbase Endpoint dispatches search requests from the client to those regions.

When entering data into an HBase table you’ll need to create an HBase-Lucene mapping using the IndexMetadata class. During the insertion, text in the columns that are mapped get broken into indexes and stored in the Lucene index file. This process of creating the Lucene index is done automatically by the IDH implementation. Once the Lucene index is created, you can search on any keyword. The implementation searches for the word in the Lucene index and retrieves the row ID’s of the target word. Then, using those keys you can directly access the relevant rows in the database.

IDH’s HBase-Lucene integration extends HBase’s capability and provides many advantages:

  1. Search not only by row key but also by values.
  2. Use multiple query types such as Starts, Ends, Contains, Range, etc.
  3. Ranking scores for the search are also available.


Interested yet?

See Ritu’s post for sample code and configuration procedures.

Definitely one for the short list on downloads to make.

Hoya (HBase on YARN) : Application Architecture

Friday, August 9th, 2013

Hoya (HBase on YARN) : Application Architecture by Steve Loughran.

From the post:

At Hadoop Summit in June, we introduced a little project we’re working on: Hoya: HBase on YARN. Since then the code has been reworked and is now up on Github. It’s still very raw, and requires some local builds of bits of Hadoop and HBase – but it is there for the interested.

In this article we’re going to look at the architecture, and a bit of the implementation.

We’re not going to look at YARN in this article -for that we have a dedicated section of the Hortonworks site -including sample chapters of Arun Murthy’s forthcoming book. Instead we’re going to cover how Hoya makes use of YARN.

If you are interested in where Hadoop is likely to go beyond MapReduce and don’t mind getting your hands dirty, this is for you.

HBase CON2013 – Videos are UP!

Tuesday, August 6th, 2013

HBase CON2013 – Videos are UP!

The videos from HBase CON2013 are up!

I will create a sorted speakers list with links to the videos/presentations later this week.

Thought you might be as tired of the PBS fund raising specials as I am. 😉

I would prefer to have shows I like to watch as opposed to the “specials” they have on during fund raising.

…Apache HBase REST Interface, Part 3

Tuesday, July 9th, 2013

How-to: Use the Apache HBase REST Interface, Part 3 by Jesse Anderson.

From the post:

This how-to is the third in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 showed you how to insert multiple rows simultaneously using XML and JSON. Part 3 below will show how to get multiple rows using XML and JSON.

Jesse is an instructor with Cloudera University. I checked but Cloudera doesn’t offer a way to search for courses by instructor. 🙁

I will drop them a note.

Triggers for Apache HBase

Wednesday, July 3rd, 2013

Cloudera Search over Apache HBase: A Story of Collaboration by Steven Noels.

Great background story on the development of triggers and indexing updates for Apache HBase by NGDATA (for their Lily product) and that underlies Cloudera Search.

From the post:

In this most recent edition, we introduced an order of magnitude performance improvement: a cleaner, more efficient, and fault-tolerant code path with no write performance penalty on HBase. In the interest of modularity, we decoupled the trigger and indexing component from Lily, making it into a stand-alone, collaborative open source project that is now underpinning both Cloudera Search HBase support as well as Lily.

This made sense for us, not just because we believe in HBase and its community but because our customers in Banking, Media, Pharma and Telecom have unqualified expectations for both the scalability and resilience of Lily. Outsourcing some part of that responsibility towards the infrastructure tier is efficient for us. We are very pleased with the collaboration, innovation, and quality that Cloudera has produced by working with us and look forward to a continued relationship that combines joint development in a community oriented way with responsible stewardship of the infrastructure code base we build upon.

Our HBase Triggering and Indexing software can be found on GitHub at:

Do you have any indexing or update side-effect needs for HBase? Tell us your thoughts on this solution.

Apache Bigtop: The “Fedora of Hadoop”…

Wednesday, June 26th, 2013

Apache Bigtop: The “Fedora of Hadoop” is Now Built on Hadoop 2.x by Roman Shaposhnik.

From the post:

Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.

And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:

  • Apache Zookeeper 3.4.5
  • Apache Flume 1.3.1
  • Apache HBase 0.94.5
  • Apache Pig 0.11.1
  • Apache Hive 0.10.0
  • Apache Sqoop 2 (AKA 1.99.2)
  • Apache Oozie 3.3.2
  • Apache Whirr 0.8.2
  • Apache Mahout 0.7
  • Apache Solr (SolrCloud) 4.2.1
  • Apache Crunch (incubating) 0.5.0
  • Apache HCatalog 0.5.0
  • Apache Giraph 1.0.0
  • LinkedIn DataFu 0.0.6
  • Cloudera Hue 2.3.0

And we were just talking about YARN and applications weren’t we? 😉


(Participate if you can but at least send a note of appreciation to Cloudera.)

Introducing Hoya – HBase on YARN

Tuesday, June 25th, 2013

Introducing Hoya – HBase on YARN by Steve Loughran, Devaraj Das & Eric Baldeschwieler.

From the post:

In the last few weeks, we have been getting together a prototype, Hoya, running HBase On YARN. This is driven by a few top level use cases that we have been trying to address. Some of them are:

  • Be able to create on-demand HBase clusters easily -by and or in apps
    • With different versions of HBase potentially (for testing etc.)
  • Be able to configure different Hbase instances differently
    • For example, different configs for read/write workload instances
  • Better isolation
    • Run arbitrary co-processors in user’s private cluster
    • User will own the data that the hbase daemons create
  • MR jobs should find it simple to create (transient) HBase clusters
    • For Map-side joins where table data is all in HBase, for example
  • Elasticity of clusters for analytic / batch workload processing
    • Stop / Suspend / Resume clusters as needed
    • Expand / shrink clusters as needed
  • Be able to utilize cluster resources better
    • Run MR jobs while maintaining HBase’s low latency SLAs

If you are interested in getting in on the ground floor on a promising project, here’s your chance!

True, it is a HBase cluster management project but cluster management abounds in as many subjects as any other IT management area.

Not to mention that few of us ever do just “one job,” at most places. Having multiple skills makes you more marketable.

Introduction to Apache HBase Snapshots, Part 2: Deeper Dive

Sunday, June 23rd, 2013

Introduction to Apache HBase Snapshots, Part 2: Deeper Dive by Matteo Bertozzi.

From the post:

In Part 1 of this series about Apache HBase snapshots, you learned how to use the new Snapshots feature and a bit of theory behind the implementation. Now, it’s time to dive into the technical details a bit more deeply.

I have been reading about writing styles recently and one author suggested that every novel start with the second chapter.

That is show the characters in action and get the audience to caring about them before filling in the background.

All of the details in Matteo’s post are important, but you have to get near the end to answer the question: Why should I care?

Try this:

Have you ever deleted a file or table that should not have been deleted?

The cloning and restoration features of HBase snapshots can save you embarrassment, awkward explanations and possibly even your position.
Now read Matteo’s post.

Did that make a difference?

How to Contribute to HBase and Hadoop 2.0

Sunday, June 23rd, 2013

How to Contribute to HBase and Hadoop 2.0 by Nick Dimiduk.

From the post:

In case you haven’t heard, Hadoop 2.0 is on the way! There are loads more new features than I can begin to enumerate, including lots of interesting enhancements to HDFS for online applications like HBase. One of the most anticipated new features is YARN, an entirely new way to think about deploying applications across your Hadoop cluster. It’s easy to think of YARN as the infrastructure necessary to turn Hadoop into a cloud-like runtime for deploying and scaling data-centric applications. Early examples of such applications are rare, but two noteworthy examples are Knitting Boar and Storm on YARN. Hadoop 2.0 will also ship a MapReduce implementation built on top of YARN that is binary compatible with applications written for MapReduce on Hadoop-1.x.

The HBase project is rearing to get onto this new platform as well. Hadoop2 will be a fully supported deployment environment for HBase 0.96 release. There are still lots of bugs to squish and the build lights aren’t green yet. That’s where you come in!

To really “know” software you can:

  • Teach it.
  • Write (good) documentation about it.
  • Squash bugs.

Nick is inviting you to squash bugs for HBase and Hadoop 2.0.

Memories of sun drenched debauchery will fade.

Being a contributor to an Apache project over the summer won’t.


Tuesday, June 11th, 2013

What’s Next for HBase? Big Data Applications Using Frameworks Like Kiji by Michael Stack.

From the post:

Apache Hadoop and HBase have quickly become industry standards for storage and analysis of Big Data in the enterprise, yet as adoption spreads, new challenges and opportunities have emerged. Today, there is a large gap — a chasm, a gorge — between the nice application model your Big Data Application builder designed and the raw, byte-based APIs provided by HBase and Hadoop. Many Big Data players have invested a lot of time and energy in bridging this gap. Cloudera, where I work, is developing the Cloudera Development Kit (CDK). Kiji, an open source framework for building Big Data Applications, is another such thriving option. A lot of thought has gone into its design. More importantly, long experience building Big Data Applications on top of Hadoop and HBase has been baked into how it all works.

Kiji provides a model and set of libraries that you to get up and running quickly.

Kiji provides a model and a set of libraries that allow developers to get up and running quickly. Intuitive Java APIs and Kiji’s rich data model allow developers to build business logic and machine learning algorithms without having to worry about bytes, serialization, schema evolution, and lower-level aspects of the system. The Kiji framework is modularized into separate components to support a wide range of usage and encourage clean separation of functionality. Kiji’s main components include KijiSchema, KijiMR, KijiHive, KijiExpress, KijiREST, and KijiScoring. KijiSchema, for example, helps team members collaborate on long-lived Big Data management projects, and does away with common incompatibility issues, and helps developers build more integrated systems across the board. All of these components are available in a single download called a BentoBox.

When mainstream news only has political scandals, wars and rumors of wars, tech news can brighten your day!

Be sure to visit the Kiji Project website.

Turn-key tutorials to get you started.

Metrics2: The New Hotness for Apache HBase Metrics

Thursday, May 9th, 2013

Metrics2: The New Hotness for Apache HBase Metrics by Elliott Clark.

From the post:

Apache HBase is a distributed big data store modeled after Google’s Bigtable paper. As with all distributed systems, knowing what’s happening at a given time can help spot problems before they arise, debug on-going issues, evaluate new usage patterns, and provide insight into capacity planning.

Since October 2008, version 0.19.0 (HBASE-625), HBase has been using Apache Hadoop’s metrics system to export metrics to JMX, Ganglia, and other metrics sinks. As the code base grew, more and more metrics were added by different developers. New features got metrics. When users needed more data on issues, they added more metrics. These new metrics were not always consistently named, and some were not well documented.

As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In HADOOP-6728 and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so HBASE-4050 was started. I also wanted to clean up the implementation cruft that had accumulated.

Welcome news of a consistent metric system for HBase!

If you can’t measure it, it’s hard to brag about it. 😉

How Scaling Really Works in Apache HBase

Saturday, April 27th, 2013

How Scaling Really Works in Apache HBase by Matteo Bertozzi.

From the post:

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.

You can use a tool or master a tool.

Recommend the latter.

…Apache HBase REST Interface, Part 2

Friday, April 12th, 2013

How-to: Use the Apache HBase REST Interface, Part 2 by Jesse Anderson.

From the post:

This how-to is the second in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 below will show you how to insert multiple rows at once using XML and JSON. The full code samples can be found on GitHub.

Only fair to cover both XML and TBL’s new favorite, JSON. (Tim Berners-Lee Renounces XML?)

Phoenix in 15 Minutes or Less

Sunday, April 7th, 2013

Phoenix in 15 Minutes or Less by Justin Kestelyn.

An amusing FAQ by “James Taylor of Salesforce, which recently open-sourced its Phoenix client-embedded JDBC driver for low-latency queries over HBase.”

From the post:

What is this new Phoenix thing I’ve been hearing about?
Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.

Doesn’t putting an extra layer between my application and HBase just slow things down?
Actually, no. Phoenix achieves as good or likely better performance than if you hand-coded it yourself (not to mention with a heck of a lot less code) by:

  • compiling your SQL queries to native HBase scans
  • determining the optimal start and stop for your scan key
  • orchestrating the parallel execution of your scans
  • bringing the computation to the data by
    • pushing the predicates in your where clause to a server-side filter
    • executing aggregate queries through server-side hooks (called co-processors)

In addition to these items, we’ve got some interesting enhancements in the works to further optimize performance:

  • secondary indexes to improve performance for queries on non row key columns
  • stats gathering to improve parallelization and guide choices between optimizations
  • skip scan filter to optimize IN, LIKE, and OR queries
  • optional salting of row keys to evenly distribute write load


Sounds authentic to me!


…Apache HBase REST Interface, Part 1

Tuesday, March 12th, 2013

How-to: Use the Apache HBase REST Interface, Part 1 by Jesse Anderson.

From the post:

There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.

There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.

This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.

Post also has a reminder about HBaseCon 2013 (June 13, San Francisco).

Introduction to Apache HBase Snapshots

Saturday, March 9th, 2013

Introduction to Apache HBase Snapshots by Matteo Bertozzi.

From the post:

The current (4.2) release of CDH — Cloudera’s 100% open-source distribution of Apache Hadoop and related projects (including Apache HBase) — introduced a new HBase feature, recently landed in trunk, that allows an admin to take a snapshot of a specified table.

Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.

In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.

Here are a few of the use cases for HBase snapshots:

  • Recovery from user/application errors
    • Restore/Recover from a known safe state.
    • View previous snapshots and selectively merge the difference into production.
    • Save a snapshot right before a major application upgrade or change.
  • Auditing and/or reporting on views of data at specific time
    • Capture monthly data for compliance purposes.
    • Run end-of-day/month/quarter reports.
  • Application testing
    • Test schema or application changes on data similar to that in production from a snapshot and then throw it away. For example: take a snapshot, create a new table from the snapshot content (schema plus data), and manipulate the new table by changing the schema, adding and removing rows, and so on. (The original table, the snapshot, and the new table remain mutually independent.)
  • Offloading of work
    • Take a snapshot, export it to another cluster, and run your MapReduce jobs. Since the export snapshot operates at HDFS level, you don’t slow down your main HBase cluster as much as CopyTable does.

Under “application testing” I would include access to your HBase data by non-experts. Gives them something to tinker with and preserves the integrity of your production data.

WANdisco: Free Hadoop Training Webinars

Friday, March 1st, 2013

WANdisco: Free Hadoop Training Webinars

WANdisco has four Hadoop webinars to put on your calendar:

A Hadoop Overview

This webinar will include a review of major components including HDFS, MapReduce, and HBase – the NoSQL database management system used with Hadoop for real-time applications. An overview of Hadoop’s ecosystem will also be provided. Other topics covered will include a review of public and private cloud deployment options, and common business use cases.

Register now Weds, March 13, 10:00 a.m. PT/1:00 p.m. ET

A Hadoop Deep Dive

This webinar will cover Hadoop misconceptions (not all clusters are thousands of machines), information about real world Hadoop deployments, a detailed review of Hadoop’s ecosystem (Sqoop, Flume, Nutch, Oozie, etc.), an in-depth look at HDFS, and an explanation of MapReduce in relation to latency and dependence on other Hadoop activities.

This webinar will introduce attendees to concepts they will need as a prerequisite for subsequent training webinars covering MapReduce, HBase and other major components at a deeper technical level.

Register now Weds, March 27, 10:00 a.m. PT/1:00 p.m. ET

Hadoop: A MapReduce Tutorial

This webinar will cover MapReduce at a deep technical level.

This session will cover the history of MapReduce, how a MapReduce job works, its logical flow, the rules and types of MapReduce jobs, de-bugging and testing MapReduce jobs, writing foolproof MapReduce jobs, various workflow tools that are available, and more.

Register now Weds, April 10, 10:00 a.m. PT/1:00 p.m. ET

Hadoop: HBase In-Depth

This webinar will provide a deep technical review of HBase, and cover flexibility, scalability, components (cells, rows, columns, qualifiers), schema samples, hardware requirements and more.

Register now Weds, April 24, 10:00 a.m. PT/1:00 p.m. ET

I first saw this at: WANdisco Announces Free Hadoop Training Webinars.

A post with no link to WANdisco or to registration for any of the webinars.

If you would prefer that I put in fewer hyperlinks to resources, please let me know.

Apache HBase 0.94.5 is out!

Sunday, February 24th, 2013

Apache HBase 0.94.5 is out! by Enis Soztutar.

From the post:

Last week, the HBase community released 0.94.5, which is the most stable release of HBase so far. The release includes 76 jira issues resolved, with 61 bug fixes, 8 improvements, and 2 new features.

Have you upgraded your HBase installation?

Flatten entire HBase column families… [Mixing Labels and Data]

Monday, February 11th, 2013

Flatten entire HBase column families with Pig and Python UDFs by Chase Seibert.

From the post:

Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

How do you solve this mismatch? If you’re in the early stages of designing a schema, you could reconsider a more row based approach. If you have to work with an existing schema, however, you can with the help of Pig UDFs.

Now there’s an ugly problem.

You can split the label from the data as shown, but that doesn’t help when the label/data is still in situ.

Saying: “Don’t do that!” doesn’t help because it is already being done.

If anything, topic maps need to take subjects as they are found, not as we might wish for them to be.

Curious, would you write an identifier as a regex that parses such a mix of label and data, assigning each to further processing?


I first saw this at Flatten Entire HBase Column Families With Pig and Python UDFs by Alex Popescu.