Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 5, 2012

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE

Filed under: Avro,Hive,Pig — Patrick Durusau @ 7:58 pm

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE by Russell Jurney.

From the post:

Series Introduction

This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Part one of this series is available here.

Code examples for this post are available here: https://github.com/rjurney/enron-hive.

In the last post, we used Pig to Extract-Transform-Load a MySQL database of the Enron emails to document format and serialize them in Avro. Now that we’ve done this, we’re ready to get to the business of data science: extracting new and interesting properties from our data for consumption by analysts and users. We’re also going to use Amazon EC2, as HIVE local mode requires Hadoop local mode, which can be tricky to get working.

Continues the high standard set in part one for walking through an entire data lifecycle in the Hadoop ecosystem.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress