The Data Lifecycle, Part One: Avroizing the Enron Emails by Russell Jurney.
From the post:
Series Introduction
This is part one of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.
The Berkeley Enron Emails
In this project we will convert a MySQL database of Enron emails into Avro document format for analysis on Hadoop with Pig. Complete code for this example is available on here on github.
Email is a rich source of information for analysis by many means. During the investigation of the Enron scandal of 2001, 517,431 messages from 114 inboxes of key Enron executives were collected. These emails were published and have become a common dataset for academics to analyze document collections and social networks. Andrew Fiore and Jeff Heer at UC Berkeley have cleaned this email set and provided it as a MySQL archive.
We hope that this dataset can become a sort of common set for examples and questions, as anonymizing one’s own data in public forums can make asking questions and getting quality answers tricky and time consuming.
More information about the Enron Emails is available:
Covering the data lifecycle in any detail is a rare event.
To do so with a meaningful data set is even rarer.
You will get the maximum benefit from this series by “playing along” and posting your comments and observations.
[…] The Data Lifecycle, Part One: Avroizing the Enron Emails Russell Jurney’s series on analyzing the Enron emails. […]
Pingback by Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron « Another Word For It — July 13, 2012 @ 3:22 pm