Archive for the ‘HCatalog’ Category

Hadoop Tutorials – Hortonworks

Wednesday, October 16th, 2013

With the GA release of Hadoop 2, it seems appropriate to list a set of tutorials for the Hortonworks Sandbox.

Tutorial 1: Hello World – An Overview of Hadoop with HCatalog, Hive and Pig

Tutorial 2: How To Process Data with Apache Pig

Tutorial 3: How to Process Data with Apache Hive

Tutorial 4: How to Use HCatalog, Pig & Hive Commands

Tutorial 5: How to Use Basic Pig Commands

Tutorial 6: How to Load Data for Hadoop into the Hortonworks Sandbox

Tutorial 7: How to Install and Configure the Hortonworks ODBC driver on Windows 7

Tutorial 8: How to Use Excel 2013 to Access Hadoop Data

Tutorial 9: How to Use Excel 2013 to Analyze Hadoop Data

Tutorial 10: How to Visualize Website Clickstream Data

Tutorial 11: How to Install and Configure the Hortonworks ODBC driver on Mac OS X

Tutorial 12: How to Refine and Visualize Server Log Data

Tutorial 13: How To Refine and Visualize Sentiment Data

Tutorial 14: How To Analyze Machine and Sensor Data

By the time you finish these, I am sure there will be more tutorials or even proposed additions to the Hadoop stack!

(Updated December 3, 2013 to add #13 and #14.)

Apache Bigtop: The “Fedora of Hadoop”…

Wednesday, June 26th, 2013

Apache Bigtop: The “Fedora of Hadoop” is Now Built on Hadoop 2.x by Roman Shaposhnik.

From the post:

Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.

And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:

  • Apache Zookeeper 3.4.5
  • Apache Flume 1.3.1
  • Apache HBase 0.94.5
  • Apache Pig 0.11.1
  • Apache Hive 0.10.0
  • Apache Sqoop 2 (AKA 1.99.2)
  • Apache Oozie 3.3.2
  • Apache Whirr 0.8.2
  • Apache Mahout 0.7
  • Apache Solr (SolrCloud) 4.2.1
  • Apache Crunch (incubating) 0.5.0
  • Apache HCatalog 0.5.0
  • Apache Giraph 1.0.0
  • LinkedIn DataFu 0.0.6
  • Cloudera Hue 2.3.0

And we were just talking about YARN and applications weren’t we? 😉


(Participate if you can but at least send a note of appreciation to Cloudera.)

So, what’s brewing with HCatalog

Monday, February 18th, 2013

So, what’s brewing with HCatalog

From the post:

Apache HCatalog announced release of version 0.5.0 in the past week. Along with that, it has initiated steps to graduate from an incubator project to be an Apache Top Level project or sub-project. Let’s look at the current state of HCatalog, its increasing relevance and where it is heading.

HCatalog for a small introduction, is a “table management and storage management layer for Apache Hadoop” which:

  • enables Pig, MapReduce, and Hive users to easily share data on the grid.
  • provides a table abstraction for a relational view of data in HDFS
  • ensures format indifference (viz RCFile format, text files, sequence files)
  • provides a notification service when new data becomes available

Nice summary of the current state of HCatalog, pointing to a presentation by Alan Gates from Big Data Spain 2012.

Alan Gates CHUGs HCatalog in Windy City

Friday, September 28th, 2012

Alan Gates CHUGs HCatalog in Windy City (Chicago Hadoop User Group) by Kim Truong

From the post:

Alan Gates presented HCatalog to the Chicago Hadoop User Group (CHUG) on 9/17/12. There was a great turnout, and the strength of CHUG is evidence that Chicago is a Hadoop city. Below are some kind words from the host, Mark Slusar.

On 9/17/12, the Chicago Hadoop User Group (CHUG) was delighted to host Hortonworks Co-Founder Alan Gates to give an overview of HCatalog. In addition to downtown Chicago meetups, Allstate Insurance Company in Northbrook, IL hosts regular Chicago Hadoop User Group Meetups. After noshing on refreshments provided by Hortonworks, attendees were treated to an in-depth overview of HCatalog, it’s history, as well as how and when to use it. Alan’s experience and expertise were an excellent contribution to CHUG. Alan made a great connection with every attendee. With his detailed lecture, he answered many questions, and also joined a handful of attendees for drinks after the meetup. CHUG would be thrilled to have Alan & Hortonworks team return in the future!” – Mark Slusar

What a great way to start the weekend!


HCatalog Meetup at Twitter

Thursday, September 20th, 2012

HCatalog Meetup at Twitter by Russell Jurney.

From the post:

Representatives from Twitter, Yahoo, LinkedIn, Hortonworks and IBM met at Twitter HQ on Thursday to talk HCatalog. Committers from HCatalog, Pig and Hive were on hand to discuss the state of HCatalog and its future.

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

See Russell’s post for more details.

Then brush up on HCatalog (if you aren’t already following it).

Data Integration Services & Hortonworks Data Platform

Thursday, June 28th, 2012

Data Integration Services & Hortonworks Data Platform by Jim Walker

From the post:

What’s possible with all this data?

Data Integration is a key component of the Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. Ok, I have a cluster… now what? Do I write a script to move the data? What is the language? Isn’t this just ETL with HDFS as another target?Well, yes…

Sure you can write custom scripts to perform a load, but that is hardly repeatable and not viable in the long term. You could also use Apache Sqoop (available in HDP today), which is a tool to push bulk data from relational stores into HDFS. While effective and great for basic loads, there is work to be done on the connections and transforms necessary in these types of flows. While custom scripts and Sqoop are both viable alternatives, they won’t cover everything and you still need to be a bit technical to be successful.

For wide scale adoption of Apache Hadoop, tools that abstract integration complexity are necessary for the rest of us. Enter Talend Open Studio for Big Data. We have worked with Talend in order to deeply integrate their graphical data integration tools with HDP as well as extend their offering beyond HDFS, Hive, Pig and HBase into HCatalog (metadata service) and Oozie (workflow and job scheduler).

Jim covers four advantages of using Talend:

  • Bridge the skills gap
  • HCatalog Integration
  • Connect to the entire enterprise
  • Graphic Pig Script Creation

Definitely something to keep in mind.

Booting HCatalog on Elastic MapReduce [periodic discovery audits?]

Wednesday, June 27th, 2012

The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce by Russell Jurney.

From the post:

Series Introduction

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

  • Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
  • Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here:
  • Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Russell continues walking the Enron Emails through a full data lifecycle in the Hadoop ecosystem.

Given the current use and foreseeable use of email, these are important lessons for more than one reason.

What about periodic discovery audits on enterprise email archives?

To see what others may find, or to identify poor wording/disclosure practices?

Apache HCatalog 0.4.0 Released

Saturday, May 19th, 2012

Apache HCatalog 0.4.0 Released by Alan Gates.

From the post:

In case you didn’t see the news, I wanted to share the announcement that HCatalog 0.4.0 is now available.

For those of you that are new to the project, HCatalog provides a metadata and table management system that simplifies data sharing between Apache Hadoop and other enterprise data systems. You can learn more about the project on the Apache project site.

From the HCatalog documentation (0.4.0):

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files.

HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and sequence file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

Being curious about a reference to partitions having the capacity to be multidimensional, I set off looking for information on supported data types and found:

The table shows how Pig will interpret the HCatalog data type.

HCatalog Data Type

Pig Data Type

primitives (int, long, float, double, string)

int, long, float, double, string to chararray

map (key type should be string, valuetype must be string)


List<any type>


struct<any type fields>


The Hadoop ecosystem is evolving at a fast and furious pace!

HCatalog, tables and metadata for Hadoop

Monday, May 2nd, 2011

HCatalog, tables and metadata for Hadoop

HCatolog is described at its Apache site as:

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

This includes:

  • Providing a shared schema and data type mechanism.
  • Providing a table abstraction so that users need not be concerned with where or how their data is stored.
  • Providing interoperability across data processing tools such as Pig, Map Reduce, Streaming, and Hive.

From the post:

Last month the HCatalog project (formerly known as Howl) was accepted into the Apache Incubator. We have already branched for a 0.1 release, which we hope to push in the next few weeks. Given all this activity, I thought it would be a good time to write a post on the motivation behind HCatalog, what features it will provide, and who is working on it.