HCatalog « Another Word For It

Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.

And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:

Apache Zookeeper 3.4.5

Apache Flume 1.3.1

Apache HBase 0.94.5

Apache Pig 0.11.1

Apache Hive 0.10.0

Apache Sqoop 2 (AKA 1.99.2)

Apache Oozie 3.3.2

Apache Whirr 0.8.2

Apache Mahout 0.7

Apache Solr (SolrCloud) 4.2.1

Apache Crunch (incubating) 0.5.0

Apache HCatalog 0.5.0

Apache Giraph 1.0.0

LinkedIn DataFu 0.0.6

Cloudera Hue 2.3.0

And we were just talking about YARN and applications weren’t we?

Enjoy!

(Participate if you can but at least send a note of appreciation to Cloudera.)

Comments Off

February 18, 2013

So, what’s brewing with HCatalog

Filed under: Hadoop,HCatalog — Patrick Durusau @ 11:36 am

So, what’s brewing with HCatalog

From the post:

Apache HCatalog announced release of version 0.5.0 in the past week. Along with that, it has initiated steps to graduate from an incubator project to be an Apache Top Level project or sub-project. Let’s look at the current state of HCatalog, its increasing relevance and where it is heading.

HCatalog for a small introduction, is a “table management and storage management layer for Apache Hadoop” which:

enables Pig, MapReduce, and Hive users to easily share data on the grid.

provides a table abstraction for a relational view of data in HDFS

ensures format indifference (viz RCFile format, text files, sequence files)

provides a notification service when new data becomes available

Nice summary of the current state of HCatalog, pointing to a presentation by Alan Gates from Big Data Spain 2012.

Comments Off

September 28, 2012

Alan Gates CHUGs HCatalog in Windy City

Filed under: Hadoop,HCatalog — Patrick Durusau @ 2:19 pm

Alan Gates CHUGs HCatalog in Windy City (Chicago Hadoop User Group) by Kim Truong

From the post:

Alan Gates presented HCatalog to the Chicago Hadoop User Group (CHUG) on 9/17/12. There was a great turnout, and the strength of CHUG is evidence that Chicago is a Hadoop city. Below are some kind words from the host, Mark Slusar.

On 9/17/12, the Chicago Hadoop User Group (CHUG) was delighted to host Hortonworks Co-Founder Alan Gates to give an overview of HCatalog. In addition to downtown Chicago meetups, Allstate Insurance Company in Northbrook, IL hosts regular Chicago Hadoop User Group Meetups. After noshing on refreshments provided by Hortonworks, attendees were treated to an in-depth overview of HCatalog, it’s history, as well as how and when to use it. Alan’s experience and expertise were an excellent contribution to CHUG. Alan made a great connection with every attendee. With his detailed lecture, he answered many questions, and also joined a handful of attendees for drinks after the meetup. CHUG would be thrilled to have Alan & Hortonworks team return in the future!” – Mark Slusar

What a great way to start the weekend!

Enjoy!

Comments Off

September 27, 2012

Hadoop and Metadata (Removing the Impedance Mis-match)

Filed under: Hadoop,HCatalog,Metadata — Patrick Durusau @ 7:11 pm

Hadoop and Metadata (Removing the Impedance Mis-match) by Alan Gates, Russell Jurney.

From the post:

Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us? How do we share data which can be stored and processed in any format the user desires? Furthermore, how do we integrate between different tools and with other systems that make-up data-center as computer?

As a Hadoop user, the need for a metadata directory is clear. Users don’t want to ‘reinvent the wheel’ and repeat the work of others. They want to share results and intermediate data-sets and collaborate with colleagues. Given the needs of users, the case for a generic metadata mechanism on top of Hadoop is easy to make: increased visibility into data assets by registering them with a metadata registry for discovery and sharing enables increased efficiency. Less work for the user.

Users also want to be able to use different tool-sets and systems together – Hadoop and non-Hadoop alike. As a Hadoop user, there is a clear need for interoperability among the diverse tools on today’s Hadoop cluster: Hive, Pig, Cascading, Java MapReduce and streaming Python, C/C++, perl, and ruby with data stored in formats from CSV, TSV, Thrift, Protobuf, Avro, SequenceFiles, Hive’s RCFile as well as proprietary formats.

Finally, raw data does not usually originate on the Hadoop Distributed Filesystem. There is a clear need for a central point to register resources from different kinds of systems for ETL onto HDFS, and to publish results of analyses on Hadoop onto other systems.

Sounds topic mappish doesn’t it?

Marketable HCatalog data products anyone?

I first saw this at Hortonworks.

Comments Off

September 20, 2012

HCatalog Meetup at Twitter

Filed under: Hadoop,HCatalog,Pig — Patrick Durusau @ 7:22 pm

HCatalog Meetup at Twitter by Russell Jurney.

From the post:

Representatives from Twitter, Yahoo, LinkedIn, Hortonworks and IBM met at Twitter HQ on Thursday to talk HCatalog. Committers from HCatalog, Pig and Hive were on hand to discuss the state of HCatalog and its future.

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

See Russell’s post for more details.

Then brush up on HCatalog (if you aren’t already following it).

Comments Off

June 28, 2012

Data Integration Services & Hortonworks Data Platform

Filed under: Data Integration,HCatalog,Hortonworks,Pig,Talend — Patrick Durusau @ 6:30 pm

Data Integration Services & Hortonworks Data Platform by Jim Walker

From the post:

What’s possible with all this data?

Data Integration is a key component of the Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. Ok, I have a cluster… now what? Do I write a script to move the data? What is the language? Isn’t this just ETL with HDFS as another target?Well, yes…

Sure you can write custom scripts to perform a load, but that is hardly repeatable and not viable in the long term. You could also use Apache Sqoop (available in HDP today), which is a tool to push bulk data from relational stores into HDFS. While effective and great for basic loads, there is work to be done on the connections and transforms necessary in these types of flows. While custom scripts and Sqoop are both viable alternatives, they won’t cover everything and you still need to be a bit technical to be successful.

For wide scale adoption of Apache Hadoop, tools that abstract integration complexity are necessary for the rest of us. Enter Talend Open Studio for Big Data. We have worked with Talend in order to deeply integrate their graphical data integration tools with HDP as well as extend their offering beyond HDFS, Hive, Pig and HBase into HCatalog (metadata service) and Oozie (workflow and job scheduler).

Jim covers four advantages of using Talend:

Bridge the skills gap
HCatalog Integration
Connect to the entire enterprise
Graphic Pig Script Creation

Definitely something to keep in mind.

Comments Off

June 27, 2012

Booting HCatalog on Elastic MapReduce [periodic discovery audits?]

Filed under: Amazon Web Services AWS,HCatalog,Hive,Pig — Patrick Durusau @ 8:06 am

The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce by Russell Jurney.

From the post:

Series Introduction

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.

Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.

Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Russell continues walking the Enron Emails through a full data lifecycle in the Hadoop ecosystem.

Given the current use and foreseeable use of email, these are important lessons for more than one reason.

What about periodic discovery audits on enterprise email archives?

To see what others may find, or to identify poor wording/disclosure practices?

Comments Off

May 19, 2012

Apache HCatalog 0.4.0 Released

Filed under: Hadoop,HCatalog — Patrick Durusau @ 5:02 pm

Apache HCatalog 0.4.0 Released by Alan Gates.

From the post:

In case you didn’t see the news, I wanted to share the announcement that HCatalog 0.4.0 is now available.

For those of you that are new to the project, HCatalog provides a metadata and table management system that simplifies data sharing between Apache Hadoop and other enterprise data systems. You can learn more about the project on the Apache project site.

From the HCatalog documentation (0.4.0):

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files.

HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and sequence file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

Being curious about a reference to partitions having the capacity to be multidimensional, I set off looking for information on supported data types and found:

The table shows how Pig will interpret the HCatalog data type.

HCatalog Data Type

Pig Data Type

primitives (int, long, float, double, string)

int, long, float, double, string to chararray

map (key type should be string, valuetype must be string)

map

List<any type>

bag

struct<any type fields>

tuple

The Hadoop ecosystem is evolving at a fast and furious pace!

Comments Off

May 2, 2011

HCatalog, tables and metadata for Hadoop

Filed under: Hadoop,HCatalog — Patrick Durusau @ 10:33 am

HCatalog, tables and metadata for Hadoop

HCatolog is described at its Apache site as:

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

This includes:

Providing a shared schema and data type mechanism.

Providing a table abstraction so that users need not be concerned with where or how their data is stored.

Providing interoperability across data processing tools such as Pig, Map Reduce, Streaming, and Hive.

From the post:

Last month the HCatalog project (formerly known as Howl) was accepted into the Apache Incubator. We have already branched for a 0.1 release, which we hope to push in the next few weeks. Given all this activity, I thought it would be a good time to write a post on the motivation behind HCatalog, what features it will provide, and who is working on it.

Comments Off

HCatalog Data Type	Pig Data Type
primitives (int, long, float, double, string)	int, long, float, double, string to chararray
map (key type should be string, valuetype must be string)	map
List<any type>	bag
struct<any type fields>	tuple