Exploring Enron Email Dataset with Kiji and Hive; Apache YARN and Apache Tez Hadoop-DC.
Tuesday, January 7, 2014 6:00 PM to 9:30 PM
Neustar (Room: Neuview) 21575 Ridgetop Circle, Sterling, VA
From the webpage:
Exploring Enron Email Dataset with Kiji and Hive
Lee Sheng, WibiData
Apache Hive is a data warehousing system for large volumes of data stored in Hadoop that provides SQL based access for exploring datasets. KijiSchema provides evolvable schemas of primitive and compound types on top of HBase. The integration between these provides the best aspects of both worlds (ad hoc SQL based querying on top of datasets using evolvable schemas containing complex objects). This talk will present an examples of queries utilizing this integration to do exploratory analysis of the Enron email corpus. Delving into topics such as email responder pairs and sentiment analysis can expose many of the interesting points in the rise and fall of Enron.
Apache YARN & Apache Tez
Tom McCuch Technical Director, Hortonworks
Apache Hadoop has become synonymous with Big Data and powers large scale data processing across some of the biggest companies in the world. Hadoop 2 is the next generation release of Hadoop and marks a pivotal point in its maturity with YARN – the new Hadoop compute framework. YARN – Yet Another Resource Negotiator – is a complete re-architecture of the Hadoop compute stack with a clean separation between platform and application. This opens up Hadoop data processing to new applications that can be executed IN Hadoop instead of outside Hadoop, thus improving efficiency, performance, data sharing and lowering operation costs. The Big Data ecosystem is already converging on YARN with new applications like Apache Tez being written specifically for YARN. Apache Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing. The talk will provide a brief overview of key Hadoop 2 innovations, focusing in on YARN and Tez – covering architecture, motivational use cases and future roadmap. Finally, the impact of YARN on the Hadoop community will be demonstrated through running interactive queries with both Hive on Tez and with Hive on MapReduce, and comparing their performance side-by-side on the same Hadoop 2 cluster.
When I saw the low tomorrow in DC is going to be 16F and the high 21F, I thought I should pass this along.
Does anyone have a very large set of phone metadata that is public?
Thinking rather than grinding over Enron’s stumbles, again, phone metadata could be hands-on training for a variety of careers. 😉
Looking forward to seeing videos of these presentations!