Archive for the ‘Avro’ Category

Integrating Kafka and Spark Streaming: Code Examples and State of the Game

Wednesday, October 1st, 2014

Integrating Kafka and Spark Streaming: Code Examples and State of the Game by Michael G. Noll.

From the post:

Spark Streaming has been getting some attention lately as real-time data processing tool, often mentioned alongside Apache Storm. If you ask me, no real-time data processing tool is complete without Kafka integration (smile), hence I added an example Spark Streaming application to kafka-storm-starter that demonstrates how to read from Kafka and write to Kafka, using Avro as the data format and Twitter Bijection for handling the data serialization.

In this post I will explain this Spark Streaming example in further detail and also shed some light on the current state of Kafka integration in Spark Streaming. All this with the disclaimer that this happens to be my first experiment with Spark Streaming.

If mid-week is when you like to brush up on emerging technologies, Michael’s post is a good place to start.

The post is well organized and has enough notes, asides and references to enable you to duplicate the example and to expand your understanding of Kafka and Spark Streaming.

Kafka-Storm-Starter

Friday, May 23rd, 2014

Kafka-Storm-Starter by Michael G. Noll.

From the webpage:

Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using Apache Avro as the data serialization format.

If you aren’t excited already (from their respective homepages):

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

Apache Storm is a free and open source distributed realtime computation system.

Apache Avro™ is a data serialization system.

Now are you excited?

Good!

Note the superior organization of the project documentation!

Following the table of contents you find:

Quick Start

Show me!

$ ./sbt test

Short of starting up remotely and allowing you to import/keyboard data, I can’t imagine an easier way to start project documentation.

It’s a long weekend in the United States so check out Michael G. Noll’s GitHub repository for other interesting projects.

CDH3 update 5 is now available

Monday, August 13th, 2012

CDH3 update 5 is now available by Arvind Prabhakar

From the post:

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

  • Flume 1.2.0 – Provides a durable file channel and many more features over the previous release.
  • Hive AvroSerDe – Replaces the Haivvreo SerDe and provides robust support for Avro data format.
  • WebHDFS – A full read/write REST API to HDFS.

Maintenance release. Installation is good practice before major releases.

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE

Tuesday, June 5th, 2012

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE by Russell Jurney.

From the post:

Series Introduction

This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Part one of this series is available here.

Code examples for this post are available here: https://github.com/rjurney/enron-hive.

In the last post, we used Pig to Extract-Transform-Load a MySQL database of the Enron emails to document format and serialize them in Avro. Now that we’ve done this, we’re ready to get to the business of data science: extracting new and interesting properties from our data for consumption by analysts and users. We’re also going to use Amazon EC2, as HIVE local mode requires Hadoop local mode, which can be tricky to get working.

Continues the high standard set in part one for walking through an entire data lifecycle in the Hadoop ecosystem.

The Data Lifecycle, Part One: Avroizing the Enron Emails

Friday, May 25th, 2012

The Data Lifecycle, Part One: Avroizing the Enron Emails by Russell Jurney.

From the post:

Series Introduction

This is part one of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

The Berkeley Enron Emails

In this project we will convert a MySQL database of Enron emails into Avro document format for analysis on Hadoop with Pig. Complete code for this example is available on here on github.

Email is a rich source of information for analysis by many means. During the investigation of the Enron scandal of 2001, 517,431 messages from 114 inboxes of key Enron executives were collected. These emails were published and have become a common dataset for academics to analyze document collections and social networks. Andrew Fiore and Jeff Heer at UC Berkeley have cleaned this email set and provided it as a MySQL archive.

We hope that this dataset can become a sort of common set for examples and questions, as anonymizing one’s own data in public forums can make asking questions and getting quality answers tricky and time consuming.

More information about the Enron Emails is available:

Covering the data lifecycle in any detail is a rare event.

To do so with a meaningful data set is even rarer.

You will get the maximum benefit from this series by “playing along” and posting your comments and observations.

Apache Avro at RichRelevance

Friday, December 23rd, 2011

Apache Avro at RichRelevance

From the post:

In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the basics – text formats and SequenceFiles. Neither of these were sufficient. Text formats are not compact enough, and can be painful to maintain over time. A basic binary format may be more compact, but it has the same maintenance issues as text. Furthermore, we needed rich data types including lists and nested records.

After analysis similar to Doug Cutting’s blog post, we chose Apache Avro. As a result we were able to eliminate manual version management, reduce joins during data processing, and adopt a new vision for what data belongs in our event logs. On Cyber Monday 2011, we logged 343 million page view events, and nearly 100 million other events into Avro data files.

I think you are going to like this post and Avro as well!

IBM InfoSphere BigInsights

Friday, June 3rd, 2011

IBM InfoSphere BigInsights

Two items stand out in the usual laundry list of “easy administration” and “IBM supports open source” list of claims:

The Jaql query language. Jaql, a Query Language for JavaScript Object Notation (JSON), provides the capability to process both structured and non-traditional data. Its SQL-like interface is well suited for quick ramp-up by developers familiar with the SQL language and makes it easier to integrate with relational databases.

….

Integrated installation. BigInsights includes IBM value-added technologies, as well as open source components, such as Hadoop, Lucene, Hive, Pig, Zookeeper, Hbase, and Avro, to name a few.

I guess it must include a “few” things since the 64-bit Linux download is 398 MBs.

Just pointing out its availability. More commentary to follow.

Using Apache Avro – Repeatable/Shareable?

Wednesday, January 26th, 2011

Using Apache Avro by Boris Lublinsky.

From the post:

Avro[1] is a recent addition to Apache’s Hadoop family of projects. Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.

Avro provides functionality that is similar to the other marshalling systems such as Thrift, Protocol Buffers, etc. The main differentiators of Avro include[2]:

  • “Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.”

I wonder about the symbolic resolution of differences using field names?

At least being repeatable and shareable.

By repeatable I mean that six months or even six weeks from now one understands the resolution. Not much use if the transformation is opaque to its author.

And shareable should mean that I can transfer the resolution to someone else who can then decide to follow, not follow or modify the resolution.

In another lifetime I was a sysadmin. I can count on less than one finger the number of times I would have followed a symbolic resolution that was not transparent. Simply not done.

Wait until the data folks, who must be incredibly trusting (anyone have some candy?), encounter someone who cares about critical systems and data.

Topic maps can help with that encounter.