Archive for the ‘KIji Project’ Category

Enron, Email, Kiji, Hive, YARN, Tez (Jan. 7th, DC)

Monday, January 6th, 2014

Exploring Enron Email Dataset with Kiji and Hive; Apache YARN and Apache Tez Hadoop-DC.

Tuesday, January 7, 2014 6:00 PM to 9:30 PM
Neustar (Room: Neuview) 21575 Ridgetop Circle, Sterling, VA

From the webpage:

Exploring Enron Email Dataset with Kiji and Hive

Lee Sheng, WibiData

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop that provides SQL based access for exploring datasets. KijiSchema provides evolvable schemas of primitive and compound types on top of HBase. The integration between these provides the best aspects of both worlds (ad hoc SQL based querying on top of datasets using evolvable schemas containing complex objects). This talk will present an examples of queries utilizing this integration to do exploratory analysis of the Enron email corpus. Delving into topics such as email responder pairs and sentiment analysis can expose many of the interesting points in the rise and fall of Enron.

Apache YARN & Apache Tez

Tom McCuch Technical Director, Hortonworks

Apache Hadoop has become synonymous with Big Data and powers large scale data processing across some of the biggest companies in the world. Hadoop 2 is the next generation release of Hadoop and marks a pivotal point in its maturity with YARN – the new Hadoop compute framework. YARN – Yet Another Resource Negotiator – is a complete re-architecture of the Hadoop compute stack with a clean separation between platform and application. This opens up Hadoop data processing to new applications that can be executed IN Hadoop instead of outside Hadoop, thus improving efficiency, performance, data sharing and lowering operation costs. The Big Data ecosystem is already converging on YARN with new applications like Apache Tez being written specifically for YARN. Apache Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing. The talk will provide a brief overview of key Hadoop 2 innovations, focusing in on YARN and Tez – covering architecture, motivational use cases and future roadmap. Finally, the impact of YARN on the Hadoop community will be demonstrated through running interactive queries with both Hive on Tez and with Hive on MapReduce, and comparing their performance side-by-side on the same Hadoop 2 cluster.

When I saw the low tomorrow in DC is going to be 16F and the high 21F, I thought I should pass this along.

Does anyone have a very large set of phone metadata that is public?

Thinking rather than grinding over Enron’s stumbles, again, phone metadata could be hands-on training for a variety of careers. 😉

Looking forward to seeing videos of these presentations!


Tuesday, June 11th, 2013

What’s Next for HBase? Big Data Applications Using Frameworks Like Kiji by Michael Stack.

From the post:

Apache Hadoop and HBase have quickly become industry standards for storage and analysis of Big Data in the enterprise, yet as adoption spreads, new challenges and opportunities have emerged. Today, there is a large gap — a chasm, a gorge — between the nice application model your Big Data Application builder designed and the raw, byte-based APIs provided by HBase and Hadoop. Many Big Data players have invested a lot of time and energy in bridging this gap. Cloudera, where I work, is developing the Cloudera Development Kit (CDK). Kiji, an open source framework for building Big Data Applications, is another such thriving option. A lot of thought has gone into its design. More importantly, long experience building Big Data Applications on top of Hadoop and HBase has been baked into how it all works.

Kiji provides a model and set of libraries that you to get up and running quickly.

Kiji provides a model and a set of libraries that allow developers to get up and running quickly. Intuitive Java APIs and Kiji’s rich data model allow developers to build business logic and machine learning algorithms without having to worry about bytes, serialization, schema evolution, and lower-level aspects of the system. The Kiji framework is modularized into separate components to support a wide range of usage and encourage clean separation of functionality. Kiji’s main components include KijiSchema, KijiMR, KijiHive, KijiExpress, KijiREST, and KijiScoring. KijiSchema, for example, helps team members collaborate on long-lived Big Data management projects, and does away with common incompatibility issues, and helps developers build more integrated systems across the board. All of these components are available in a single download called a BentoBox.

When mainstream news only has political scandals, wars and rumors of wars, tech news can brighten your day!

Be sure to visit the Kiji Project website.

Turn-key tutorials to get you started.

Kiji Project [Framework for HBase]

Wednesday, November 14th, 2012

Kiji Project: An Open Source Framework for Building Big Data Applications with Apache HBase by Aaron Kimball.

From the post:

Our team at WibiData has been developing applications on Hadoop since 2010 and we’ve helped many organizations transform how they use data by deploying Hadoop. HBase in particular has allowed companies of all types to drive their business using scalable, high performance storage. Organizations have started to leverage these capabilities for various big data applications, including targeted content, personalized recommendations, enhanced customer experience and social network analysis.

While building many of these applications, we have seen emerging tools, design patterns and best practices repeated across projects. One of the clear lessons learned is that Hadoop and HBase provide very low-level interfaces. Each large-scale application we have built on top of Hadoop has required a great deal of scaffolding and data management code. This repetitive programming is tedious, error-prone, and makes application interoperability more challenging in the long run.

Today, we are proud to announce the launch of the Kiji project (, as well as the first Kiji component: KijiSchema. The Kiji project was developed to host a suite of open source components built on top of Apache HBase and Apache Hadoop, that makes it easier for developers to:

  1. Use HBase as a real-time data storage and serving layer for applications
  2. Maximize HBase performance using data management best practices
  3. Get started building data applications quickly with easy startup and configuration

Kiji is open source and licensed under the Apache 2.0 license. The Kiji project is modularized into separate components to simplify adoption and encourage clean separation of functionality. Our approach emphasizes interoperability with other systems, leveraging the open source HBase, Avro and MapReduce projects, enabling you to easily fit Kiji into your development process and applications.

KijiSchema: Schema Management for HBase

The first component within the Kiji project is KijiSchema, which provides layout and schema management on top of HBase. KijiSchema gives developers the ability to easily store both structured and unstructured data within HBase using Avro serialization. It supports a variety of rich schema features, including complex, compound data types, HBase column key and time-series indexing, as well cell-level evolving schemas that dynamically encode version information.

KijiSchema promotes the use of entity-centric data modeling, where all information about a given entity (user, mobile device, ad, product, etc.), including dimensional and transaction data, is encoded within the same row. This approach is particularly valuable for user-based analytics such as targeting, recommendations, and personalization.

This looks important!

Reading further about their “entiity-centric” approach:

Entity-Centric Data Model

KijiSchema’s data model is entity-centric. Each row typically holds information about a single entity in your information scheme. As an example, a consumer e-commerce web site may have a row representing each user of their site. The entity-centric data model enables easier analysis of individual entities. For example, to recommend products to a user, information such as the user’s past purchases, previously viewed items, search queries, etc. all need to be brought together. The entity-centric model stores all of these attributes of the user in the same row, allowing for efficient access to relevant information.

The entity-centric data model stands in comparison to a more typical log-based approach to data collection. Many MapReduce systems import log files for analysis. Logs are action-centric; each action performed by a user (adding an item to a shopping cart, checking out, performing a search, viewing a product) generates a new log entry. Collecting all the data required for a per-user analysis thus requires a scan of many logs. The entity-centric model is a “pivoted” form of this same information. By pivoting the information as the data is loaded into KijiSchema, later analysis can be run more efficiently, either in a MapReduce job operating over all users, or in a more narrowly-targeted fashion if individual rows require further computation.

I’m already convinced about a single representative for an entity. 😉

Need to work through the documentation on capturing diverse information about a single entity in one row.

I suspect that the structures that capture data aren’t entities for purposes of this model.

Still, will be an interesting exploration.