Archive for the ‘HBase’ Category

Metrics2: The New Hotness for Apache HBase Metrics

Thursday, May 9th, 2013

Metrics2: The New Hotness for Apache HBase Metrics by Elliott Clark.

From the post:

Apache HBase is a distributed big data store modeled after Google’s Bigtable paper. As with all distributed systems, knowing what’s happening at a given time can help spot problems before they arise, debug on-going issues, evaluate new usage patterns, and provide insight into capacity planning.

Since October 2008, version 0.19.0 (HBASE-625), HBase has been using Apache Hadoop’s metrics system to export metrics to JMX, Ganglia, and other metrics sinks. As the code base grew, more and more metrics were added by different developers. New features got metrics. When users needed more data on issues, they added more metrics. These new metrics were not always consistently named, and some were not well documented.

As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In HADOOP-6728 and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so HBASE-4050 was started. I also wanted to clean up the implementation cruft that had accumulated.

Welcome news of a consistent metric system for HBase!

If you can’t measure it, it’s hard to brag about it. ;-)

How Scaling Really Works in Apache HBase

Saturday, April 27th, 2013

How Scaling Really Works in Apache HBase by Matteo Bertozzi.

From the post:

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.

You can use a tool or master a tool.

Recommend the latter.

…Apache HBase REST Interface, Part 2

Friday, April 12th, 2013

How-to: Use the Apache HBase REST Interface, Part 2 by Jesse Anderson.

From the post:

This how-to is the second in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 below will show you how to insert multiple rows at once using XML and JSON. The full code samples can be found on GitHub.

Only fair to cover both XML and TBL’s new favorite, JSON. (Tim Berners-Lee Renounces XML?)

Phoenix in 15 Minutes or Less

Sunday, April 7th, 2013

Phoenix in 15 Minutes or Less by Justin Kestelyn.

An amusing FAQ by “James Taylor of Salesforce, which recently open-sourced its Phoenix client-embedded JDBC driver for low-latency queries over HBase.”

From the post:

What is this new Phoenix thing I’ve been hearing about?
Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.

Doesn’t putting an extra layer between my application and HBase just slow things down?
Actually, no. Phoenix achieves as good or likely better performance than if you hand-coded it yourself (not to mention with a heck of a lot less code) by:

  • compiling your SQL queries to native HBase scans
  • determining the optimal start and stop for your scan key
  • orchestrating the parallel execution of your scans
  • bringing the computation to the data by
    • pushing the predicates in your where clause to a server-side filter
    • executing aggregate queries through server-side hooks (called co-processors)

In addition to these items, we’ve got some interesting enhancements in the works to further optimize performance:

  • secondary indexes to improve performance for queries on non row key columns
  • stats gathering to improve parallelization and guide choices between optimizations
  • skip scan filter to optimize IN, LIKE, and OR queries
  • optional salting of row keys to evenly distribute write load

…..

Sounds authentic to me!

You?

…Apache HBase REST Interface, Part 1

Tuesday, March 12th, 2013

How-to: Use the Apache HBase REST Interface, Part 1 by Jesse Anderson.

From the post:

There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.

There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.

This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.

Post also has a reminder about HBaseCon 2013 (June 13, San Francisco).

Introduction to Apache HBase Snapshots

Saturday, March 9th, 2013

Introduction to Apache HBase Snapshots by Matteo Bertozzi.

From the post:

The current (4.2) release of CDH — Cloudera’s 100% open-source distribution of Apache Hadoop and related projects (including Apache HBase) — introduced a new HBase feature, recently landed in trunk, that allows an admin to take a snapshot of a specified table.

Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.

In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.

Here are a few of the use cases for HBase snapshots:

  • Recovery from user/application errors
    • Restore/Recover from a known safe state.
    • View previous snapshots and selectively merge the difference into production.
    • Save a snapshot right before a major application upgrade or change.
  • Auditing and/or reporting on views of data at specific time
    • Capture monthly data for compliance purposes.
    • Run end-of-day/month/quarter reports.
  • Application testing
    • Test schema or application changes on data similar to that in production from a snapshot and then throw it away. For example: take a snapshot, create a new table from the snapshot content (schema plus data), and manipulate the new table by changing the schema, adding and removing rows, and so on. (The original table, the snapshot, and the new table remain mutually independent.)
  • Offloading of work
    • Take a snapshot, export it to another cluster, and run your MapReduce jobs. Since the export snapshot operates at HDFS level, you don’t slow down your main HBase cluster as much as CopyTable does.

Under “application testing” I would include access to your HBase data by non-experts. Gives them something to tinker with and preserves the integrity of your production data.

WANdisco: Free Hadoop Training Webinars

Friday, March 1st, 2013

WANdisco: Free Hadoop Training Webinars

WANdisco has four Hadoop webinars to put on your calendar:

A Hadoop Overview

This webinar will include a review of major components including HDFS, MapReduce, and HBase – the NoSQL database management system used with Hadoop for real-time applications. An overview of Hadoop’s ecosystem will also be provided. Other topics covered will include a review of public and private cloud deployment options, and common business use cases.

Register now Weds, March 13, 10:00 a.m. PT/1:00 p.m. ET

A Hadoop Deep Dive

This webinar will cover Hadoop misconceptions (not all clusters are thousands of machines), information about real world Hadoop deployments, a detailed review of Hadoop’s ecosystem (Sqoop, Flume, Nutch, Oozie, etc.), an in-depth look at HDFS, and an explanation of MapReduce in relation to latency and dependence on other Hadoop activities.

This webinar will introduce attendees to concepts they will need as a prerequisite for subsequent training webinars covering MapReduce, HBase and other major components at a deeper technical level.

Register now Weds, March 27, 10:00 a.m. PT/1:00 p.m. ET

Hadoop: A MapReduce Tutorial

This webinar will cover MapReduce at a deep technical level.

This session will cover the history of MapReduce, how a MapReduce job works, its logical flow, the rules and types of MapReduce jobs, de-bugging and testing MapReduce jobs, writing foolproof MapReduce jobs, various workflow tools that are available, and more.

Register now Weds, April 10, 10:00 a.m. PT/1:00 p.m. ET

Hadoop: HBase In-Depth

This webinar will provide a deep technical review of HBase, and cover flexibility, scalability, components (cells, rows, columns, qualifiers), schema samples, hardware requirements and more.

Register now Weds, April 24, 10:00 a.m. PT/1:00 p.m. ET

I first saw this at: WANdisco Announces Free Hadoop Training Webinars.

A post with no link to WANdisco or to registration for any of the webinars.

If you would prefer that I put in fewer hyperlinks to resources, please let me know.

Apache HBase 0.94.5 is out!

Sunday, February 24th, 2013

Apache HBase 0.94.5 is out! by Enis Soztutar.

From the post:

Last week, the HBase community released 0.94.5, which is the most stable release of HBase so far. The release includes 76 jira issues resolved, with 61 bug fixes, 8 improvements, and 2 new features.

Have you upgraded your HBase installation?

Flatten entire HBase column families… [Mixing Labels and Data]

Monday, February 11th, 2013

Flatten entire HBase column families with Pig and Python UDFs by Chase Seibert.

From the post:

Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

How do you solve this mismatch? If you’re in the early stages of designing a schema, you could reconsider a more row based approach. If you have to work with an existing schema, however, you can with the help of Pig UDFs.

Now there’s an ugly problem.

You can split the label from the data as shown, but that doesn’t help when the label/data is still in situ.

Saying: “Don’t do that!” doesn’t help because it is already being done.

If anything, topic maps need to take subjects as they are found, not as we might wish for them to be.

Curious, would you write an identifier as a regex that parses such a mix of label and data, assigning each to further processing?

Suggestions?

I first saw this at Flatten Entire HBase Column Families With Pig and Python UDFs by Alex Popescu.

Hortonworks Data Platform 1.2 Available Now!

Friday, January 18th, 2013

Hortonworks Data Platform 1.2 Available Now! by Kim Rose.

From the post:

Hortonworks Data Platform (HDP) 1.2, the industry’s only complete 100-percent open source platform powered by Apache Hadoop is available today. The enterprise-grade Hortonworks Data Platform includes the latest version of Apache Ambari for comprehensive management, monitoring and provisioning of Apache Hadoop clusters. By also introducing additional new capabilities for improving security and ease of use, HDP delivers an enterprise-class distribution of Apache Hadoop that is endorsed and adopted by some of the largest vendors in the IT ecosystem.

Hortonworks continues to drive innovation through a range of Hadoop-related projects, packaging the most enterprise-ready components, such as Ambari, into the Hortonworks Data Platform. Powered by an Apache open source community, Ambari represents the forefront of innovation in Apache Hadoop management. Built on Apache Hadoop 1.0, the most stable and reliable code available today, HDP 1.2 improves the ease of enterprise adoption for Apache Hadoop with comprehensive management and monitoring, enhanced connectivity to high-performance drivers, and increased enterprise-readiness of Apache HBase, Apache Hive and Apache HCatalog projects.

The Hortonworks Data Platform 1.2 features a number of new enhancements designed to improve the enterprise viability of Apache Hadoop, including:

  • Simplified Hadoop Operations—Using the latest release of Apache Ambari, HDP 1.2 now provides both cluster management and the ability to zoom into cluster usage and performance metrics for jobs and tasks to identify the root cause of performance bottlenecks or operations issues. This enables Hadoop users to identify issues and optimize future job processing.
  • Improved Security and Multi-threaded Query—HDP 1.2 provides an enhanced security architecture and pluggable authentication model that controls access to Hive tables and the metastore. In addition, HDP 1.2 improves scalability by supporting multiple concurrent query connections to Hive from business intelligence tools and Hive clients.
  • Integration with High-performance Drivers Built for Big Data—HDP 1.2 empowers organizations with a trusted and reliable ODBC connector that enables the integration of current systems with high-performance drivers built for big data. The ODBC driver enables integration with reporting or visualization components through a SQL engine built into the driver. Hortonworks has partnered with Simba to deliver a trusted, reliable high-performance ODBC connector that is enterprise ready and completely free.
  • HBase Enhancements—By including and testing HBase 0.94.2, HDP 1.2 delivers important performance and operational improvements for customers building and deploying highly scalable interactive applications using HBase.

There goes the weekend!

ANN: HBase 0.94.4 is available for download

Wednesday, January 16th, 2013

ANN: HBase 0.94.4 is available for download by lars Hofhansl

Bug fix release with 81 issues resolved plus performance enhancements!

First seen tweet by Stack

Apache Gora

Monday, December 10th, 2012

Apache Gora

From the webpage:

What is Apache Gora?

The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column
stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.

Why Apache Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use in-memory data model and persistence for big data framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

  • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.
  • Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.
  • Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
  • Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
  • MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

When writing about the Nutch 2.X development path, I discovered my omission of Gora from this blog. Apologies for having overlooked it until now.

Apache Nutch v1.6 and Apache 2.1 Releases

Monday, December 10th, 2012

Apache Nutch v1.6 Released

From the news:

The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL’s and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. Please see the list of changes or the release report made in this version for a full breakdown. The release is available here.

See the Nutch 1.x tutorial.

Apache Nutch v2.1 Released

From the news:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.1. This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search. Please see the list of changes made in this version for a full breakdown. The release is available here.

See the Nutch 2.x tutorial.

I haven’t done a detailed comparison but roughly, Nutch 1.x relies upon Solr for storage and Nutch 2.x relies upon Gora and HBase.

Surprised that isn’t in the FAQ.

Perhaps I will investigate further and offer a short summary of the differences.

Streaming data into Apache HBase using Apache Flume

Thursday, November 29th, 2012

Streaming data into Apache HBase using Apache Flume

From the post:

Apache Flume was conceived as a fault-tolerant ingest system for the Apache Hadoop ecosystem. Flume comes packaged with an HDFS Sink which can be used to write events into HDFS, and two different implementations of HBase sinks to write events into HBase. You can read about the basic architecture of Apache Flume 1.x in this blog post. You can also read about how Flume’s File Channel persists events and still provides extremely high performance in an earlier blog post. In this article, we will explore how to configure Flume to write events into HBase, and write custom serializers to write events into HBase in a format of the user’s choice.

Streaming data is great, but being able to capture it when needed, is even better!

Kiji Project [Framework for HBase]

Wednesday, November 14th, 2012

Kiji Project: An Open Source Framework for Building Big Data Applications with Apache HBase by Aaron Kimball.

From the post:

Our team at WibiData has been developing applications on Hadoop since 2010 and we’ve helped many organizations transform how they use data by deploying Hadoop. HBase in particular has allowed companies of all types to drive their business using scalable, high performance storage. Organizations have started to leverage these capabilities for various big data applications, including targeted content, personalized recommendations, enhanced customer experience and social network analysis.

While building many of these applications, we have seen emerging tools, design patterns and best practices repeated across projects. One of the clear lessons learned is that Hadoop and HBase provide very low-level interfaces. Each large-scale application we have built on top of Hadoop has required a great deal of scaffolding and data management code. This repetitive programming is tedious, error-prone, and makes application interoperability more challenging in the long run.

Today, we are proud to announce the launch of the Kiji project (www.kiji.org), as well as the first Kiji component: KijiSchema. The Kiji project was developed to host a suite of open source components built on top of Apache HBase and Apache Hadoop, that makes it easier for developers to:

  1. Use HBase as a real-time data storage and serving layer for applications
  2. Maximize HBase performance using data management best practices
  3. Get started building data applications quickly with easy startup and configuration

Kiji is open source and licensed under the Apache 2.0 license. The Kiji project is modularized into separate components to simplify adoption and encourage clean separation of functionality. Our approach emphasizes interoperability with other systems, leveraging the open source HBase, Avro and MapReduce projects, enabling you to easily fit Kiji into your development process and applications.

KijiSchema: Schema Management for HBase

The first component within the Kiji project is KijiSchema, which provides layout and schema management on top of HBase. KijiSchema gives developers the ability to easily store both structured and unstructured data within HBase using Avro serialization. It supports a variety of rich schema features, including complex, compound data types, HBase column key and time-series indexing, as well cell-level evolving schemas that dynamically encode version information.

KijiSchema promotes the use of entity-centric data modeling, where all information about a given entity (user, mobile device, ad, product, etc.), including dimensional and transaction data, is encoded within the same row. This approach is particularly valuable for user-based analytics such as targeting, recommendations, and personalization.

This looks important!

Reading further about their “entiity-centric” approach:

Entity-Centric Data Model

KijiSchema’s data model is entity-centric. Each row typically holds information about a single entity in your information scheme. As an example, a consumer e-commerce web site may have a row representing each user of their site. The entity-centric data model enables easier analysis of individual entities. For example, to recommend products to a user, information such as the user’s past purchases, previously viewed items, search queries, etc. all need to be brought together. The entity-centric model stores all of these attributes of the user in the same row, allowing for efficient access to relevant information.

The entity-centric data model stands in comparison to a more typical log-based approach to data collection. Many MapReduce systems import log files for analysis. Logs are action-centric; each action performed by a user (adding an item to a shopping cart, checking out, performing a search, viewing a product) generates a new log entry. Collecting all the data required for a per-user analysis thus requires a scan of many logs. The entity-centric model is a “pivoted” form of this same information. By pivoting the information as the data is loaded into KijiSchema, later analysis can be run more efficiently, either in a MapReduce job operating over all users, or in a more narrowly-targeted fashion if individual rows require further computation.

I’m already convinced about a single representative for an entity. ;-)

Need to work through the documentation on capturing diverse information about a single entity in one row.

I suspect that the structures that capture data aren’t entities for purposes of this model.

Still, will be an interesting exploration.

HBase Futures

Monday, October 22nd, 2012

HBase Futures by Devaraj Das.

From the post:

As we have said here, Hortonworks has been steadily increasing our investment in HBase. HBase’s adoption has been increasing in the enterprise. To continue this trend, we feel HBase needs investments in the areas of:

  1. Reliability and High Availability (all data always available, and recovery from failures is quick)
  2. Autonomous operation (minimum operator intervention)
  3. Wire compatibility (to support rolling upgrades across a couple of versions at least)
  4. Cross data-center replication (for disaster recovery)
  5. Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
  6. Monitoring and Diagnostics (which regionserver is hot or what caused an outage)

Probably just a personal prejudice but I would have mentioned semantics in that list.

You?

HBase at Hortonworks: An Update [Features, Consumer Side?]

Monday, October 22nd, 2012

HBase at Hortonworks: An Update by Devaraj Das.

From the post:

HBase is a critical component of the Apache Hadoop ecosystem and a core component of the Hortonworks Data Platform. HBase enables a host of low latency Hadoop use-cases; As a publishing platform, HBase exposes data refined in Hadoop to outside systems; As an online column store, HBase supports the blending of random access data read/write with application workloads whose data is directly accessible to Hadoop MapReduce.

The HBase community is moving forward aggressively, improving HBase in many ways. We are in the process of integrating HBase 0.94 into our upcoming HDP 1.1 refresh. This “minor upgrade” will include a lot of bug fixes (nearly 200 in number) and quite a few performance improvements and will be wire compatible with HBase 0.92 (in HDP 1.0).

The post concludes:

All of the above is just what we’ve been doing recently and Hortonworkers are only a small fraction of the HBase contributor base. When one factors in all the great contributions coming from across the Apache HBase community, we predict 2013 is going to be a great year for HBase. HBase is maturing fast, becoming both more operationally reliable and more feature rich.

When a technical infrastructure becomes “feature rich,” can “features” for consumer services/interfaces be far behind?

Delivering location-based coupons for latte’s on a cellphone may seem like a “feature.” But we can do that with a man wearing a sandwich board.

A “feature” for the consumer needs to be more than digital imitation of an analog capability.

What consumer “feature(s)” would you offer based on new features in HBase?

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover

Sunday, October 21st, 2012

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover by Devaraj Das.

From the post:

In this blog, I’ll cover how we tested Full Stack HA with NameNode HA in Hadooop 1 with Hadoop and HBase as components of the stack.

Yes, NameNode HA is finally available in the Hadoop 1 line. The test was done with Hadoop branch-1 and HBase-0.92.x on a cluster of roughly ten nodes. The aim was to try to keep a really busy HBase cluster up in the face of the cluster’s NameNode repeatedly going up and down. Note that, HBase would be functional during the time NameNode would be down. It’d only affect those operations that requires a trip to the NameNode (for example, rolling of the WAL, or compaction, or flush), and those would affect only the relevant end users (a user using the HBase get API may not be affected if that get didn’t require a new file open, for example).

A non-reliable cluster is just that, a non-reliable cluster. Not as bad as a backup that may or may not restore your data, but almost.

Regularly and routinely test any alleged HA capability along with backup restore capability. Document that testing.

As opposed to “testing” when either has to work or critical operations will fail or critical data will be lost.*

*Not Miller but résumé time.

Hadoop/HBase Cluster ~ 1 Hour/$10 (What do you have to lose?)

Thursday, October 11th, 2012

Set Up a Hadoop/HBase Cluster on EC2 in (About) an Hour by George London.

From the post:

I’m going to walk you through a (relatively) simple set of steps that will get you up and running MapReduce programs on a cloud-based, six-node distributed Hadoop/HBase cluster as fast as possible. This is all based on what I’ve picked up on my own, so if you know of better/faster methods, please let me know in comments!

We’re going to be running our cluster on Amazon EC2, and launching the cluster using Apache Whirr and configuring it using Cloudera Manager Free Edition. Then we’ll run some basic programs I’ve posted on Github that will parse data and load it into Apache HBase.

All together, this tutorial will take a bit over one hour and cost about $10 in server costs.

This is the sort of tutorial that I long to write for topic maps.

There is a longer version of this tutorial here.

CDH4.1 Now Released!

Wednesday, October 3rd, 2012

CDH4.1 Now Released! by Charles Zedlewski.

From the post:

We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:

  • Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
  • Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
  • Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
  • Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
  • FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
  • Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
  • Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.

It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.

Enjoy!

HAcid: A lightweight transaction system for HBase

Tuesday, October 2nd, 2012

HAcid: A lightweight transaction system for HBase

From the post:

HAcid is a client library that applications can use for operating multi-row transactions in HBase. Seems to be motivated by Google’s Percolator.

Link to the original paper

Understanding User Authentication and Authorization in Apache HBase

Wednesday, September 26th, 2012

Understanding User Authentication and Authorization in Apache HBase by Matteo Bertozzi.

From the post:

With the default Apache HBase configuration, everyone is allowed to read from and write to all tables available in the system. For many enterprise setups, this kind of policy is unacceptable.

Administrators can set up firewalls that decide which machines are allowed to communicate with HBase. However, machines that can pass the firewall are still allowed to read from and write to all tables. This kind of mechanism is effective but insufficient because HBase still cannot differentiate between multiple users that use the same client machines, and there is still no granularity with regard to HBase table, column family, or column qualifier access.

In this post, we will discuss how Kerberos is used with Hadoop and HBase to provide User Authentication, and how HBase implements User Authorization to grant users permissions for particular actions on a specified set of data.

When you think about security, remember: Accumulo: Why The World Needs Another NoSQL Database. Accumulo was written to provide cell level security.

Nice idea but the burden of administering cell level authorizations is going to lead to sloppy security practices. Or granting higher level permissions, inadvisedly, to some users.

Not to mention the truck sized security hole in Accumulo for imported data changing access tokens.

You can get a lot of security mileage out of HBase and Kerberos, long before you get to cell level security permissions.

Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra

Monday, August 27th, 2012

Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra by Russell Jurney.

From the post:

Hadoop is about freedom as much as scale: providing you disk spindles and processor cores together to process your data with whatever tool you choose. Unleash your creativity. Pig as duct tape facilitates this freedom, enabling you to connect distributed systems at scale in minutes, not hours. In this post we’ll demonstrate how you can turn raw data into a web service using Hadoop, Pig, HBase, JRuby and Sinatra. In doing so we will demonstrate yet another way to use Pig as connector to publish data you’ve processed on Hadoop.

When (not if) the next big cache of emails or other “sensitive” documents drops, everyone who has followed this and similar tutorials should be ready.

HBase Replication: Operational Overview

Thursday, August 16th, 2012

HBase Replication: Operational Overview by Himanshu Vashishtha

From the post:

This is the second blogpost about HBase replication. The previous blogpost, HBase Replication Overview, discussed use cases, architecture and different modes supported in HBase replication. This blogpost is from an operational perspective and will touch upon HBase replication configuration, and key concepts for using it — such as bootstrapping, schema change, and fault tolerance.

The sort of post that makes you long for one or more mini-clusters. ;-)

HBase FuzzyRowFilter: Alternative to Secondary Indexes

Saturday, August 11th, 2012

HBase FuzzyRowFilter: Alternative to Secondary Indexes by Alex Baranau.

From the post:

In this post we’ll explain the usage of FuzzyRowFilter which can help in many situations where secondary indexes solutions seems to be the only choice to avoid full table scans.

Background

When it comes to HBase the way you design your row key affects everything. It is a common pattern to have composite row key which consists of several parts, e.g. userId_actionId_timestamp. This allows for fast fetching of rows (or single row) based on start/stop row keys which have to be a prefix of the row keys you want to select. E.g. one may select last time of userX logged in by specifying row key prefix “userX_login_”. Or last action of userX by fetching the first row with prefix “userX_”. These partial row key scans work very fast and does not require scanning the whole table: HBase storage is optimized to make them fast.

Problem

However, there are cases when you need to fetch data based on key parts which happen to be in the middle of the row key. In the example above you may want to find last logged in users. When you don’t know the first parts of the key partial row key scan turns into full table scan which might be very slow and resource intensive.

Although Alex notes the solution he presents is no “silver bullet,” it illustrates:

  • The impact of key design on later usage.
  • Important of knowing all your options for query performance.

I would capture the availability of the “FuzzyRowFilter,” key structure and cardinality of data using a topic map. Saves then next HBase administrator time and effort.

True, they can always work out the details for themselves but then they make not have your analytical skills.

Apache HBase (DZone Refcard)

Monday, August 6th, 2012

Apache HBase (DZone Refcard) by Otis Gospodnetic and Alex Baranau.

From the webpage:

The Essential HBase Cheat Sheet

HBase is the Hadoop database: a distributed, scalable Big Data store that lets you host very large tables — billions of rows multiplied by millions of columns — on clusters built with commodity hardware. Just as Google Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

If you aren’t familiar with the DZone Refcards, they are a valuable resource.

HBase Replication Overview

Monday, July 30th, 2012

HBase Replication Overview by Himanshu Vashishtha.

From the post:

HBase Replication is a way of copying data from one HBase cluster to a different and possibly distant HBase cluster. It works on the principle that the transactions from the originating cluster are pushed to another cluster. In HBase jargon, the cluster doing the push is called the master, and the one receiving the transactions is called the slave. This push of transactions is done asynchronously, and these transactions are batched in a configurable size (default is 64MB). Asynchronous mode incurs minimal overhead on the master, and shipping edits in a batch increases the overall throughput.

This blogpost discusses the possible use cases, underlying architecture and modes of HBase replication as supported in CDH4 (which is based on 0.92). We will discuss Replication configuration, bootstrapping, and fault tolerance in a follow up blogpost.

Use cases

HBase replication supports replicating data across datacenters. This can be used for disaster recovery scenarios, where we can have the slave cluster serve real time traffic in case the master site is down. Since HBase replication is not intended for automatic failover, the act of switching from the master to the slave cluster in order to start serving traffic is done by the user. Afterwards, once the master cluster is up again, one can do a CopyTable job to copy the deltas to the master cluster (by providing the start/stop timestamps) as described in the CopyTable blogpost.

Another replication use case is when a user wants to run load intensive MapReduce jobs on their HBase cluster; one can do so on the slave cluster while bearing a slight performance decrease on the master cluster.

So there is a non-romantic, sysadmin side to “big data.” I understand, no one ever even speaks unless something has gone wrong with the system. Sysadmins either get no contacts (a good thing) or pages, tweets, emails, phone calls and physical visits from irate users, managers, etc.

This post is a start towards always having the first case, no contacts. Leaves you more time for things that interest sysadmins. I won’t tell if you don’t.

U.S. Senate vs. Apache Accumulo: Whose side are you on?

Wednesday, July 18th, 2012

Jack Park sent a link to NSA Mimics Google, Pisses Off Senate this morning. If you are unfamiliar with the software, see: Apache Accumulo.

Long story made short:

The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.

At issue is:

The bill indicates that Accumulo may violate OMB Circular A-130, a government policy that bars agencies from building software if it’s less expensive to use commercial software that’s already available. And according to one congressional staffer who worked on the bill, this is indeed the case. He asked that his name not be used in this story, as he’s not authorized to speak with the press.

On its face, OMB Circular A-130 sounds like a good idea. Don’t build your own if it is cheaper to buy commercial.

But here the Senate trying to play favorites.

I have a suggestion: Let’s disappoint them.

Let’s contribute to all three projects:

Apache Accumulo

Apache Cassandra

Apache HBase

Would you look at that! All three of these projects are based at Apache!

Let’s push all three projects forward in terms of working on releases, documentation, testing, etc.

But more than that, let’s build applications based on all three projects that analyze political contributions, patterns of voting, land transfers, stock purchases, virtually every fact than can be known about members of the Senate and the Senate Armed Services Committee in particular.

They are accustomed to living in a gold fish bowl.

Let’s move them into a frying pan.

PS: +1 if the NSA is ordered to contribute to open source projects, if the projects are interested. Direction from the U.S. Senate is not part of the Apache governance model.

Configuring HBase Memstore: What You Should Know

Tuesday, July 17th, 2012

Configuring HBase Memstore: What You Should Know by Alex Baranau.

Alex gives three good reasons to care about HBase Memstore:

There are number of reasons HBase users and/or administrators should be aware of what Memstore is and how it is used:

  • There are number of configuration options for Memstore one can use to achieve better performance and avoid issues. HBase will not adjust settings for you based on usage pattern.
  • Frequent Memstore flushes can affect reading performance and can bring additional load to the system
  • The way Memstore flushes work may affect your schema design

Which reason is yours?

HBase Log Splitting

Friday, July 13th, 2012

HBase Log Splitting by Jimmy Xiang.

When I was being certified in NetWare many years ago, the #1 reason for dismissal of sysadmins was failure to maintain backups. I don’t know what the numbers are like now but suspect it is probably about the same. But in any event, you are less likely to be fired if you don’t lose data when your servers fail. That’s just a no-brainer.

If you are running HBase, take the time to review this post and make sure the job that gets lost isn’t yours.

From the post:

In the recent blog post about the HBase Write Path, we talked about the write-ahead-log (WAL), which plays an important role in preventing data loss should a HBase region server failure occur. This blog post describes how HBase prevents data loss after a region server crashes, using an especially critical process for recovering lost updates called log splitting.
Log splitting

As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.

A region server serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file has information about which region it belongs to. When a region is opened, we need to replay those edits in the WAL file that belong to that region. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails.

Log splitting is done by HMaster as the cluster starts or by ServerShutdownHandler as a region server shuts down. Since we need to guarantee consistency, affected regions are unavailable until data is restored. So we need to recover and replay all WAL edits before letting those regions become available again. As a result, regions affected by log splitting are unavailable until the process completes and any required edits are applied.