Archive for the ‘Flume’ Category

How-to: Process Data using Morphlines (in Kite SDK)

Friday, April 11th, 2014

How-to: Process Data using Morphlines (in Kite SDK) by Janos Matyas.

From the post:

SequenceIQ has an Apache Hadoop-based platform and API that consume and ingest various types of data from different sources to offer predictive analytics and actionable insights. Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation.

These datasets come from different sources (industry-standard and proprietary adapters, Apache Flume, MQTT, iBeacon, and so on), so we need a flexible, embeddable framework to support our ETL process chain. Hello, Morphlines! (As you may know, originally the Morphlines library was developed as part of Cloudera Search; eventually, it graduated into the Kite SDK as a general-purpose framework.)

To define a Morphline transformation chain, you need to describe the steps in a configuration file, and the framework will then turn into an in-memory container for transformation commands. Commands perform tasks such as transforming, loading, parsing, and processing records, and they can be linked in a processing chain.

In this blog post, I’ll demonstrate such an ETL process chain containing custom Morphlines commands (defined via config file and Java), and use the framework within MapReduce jobs and Flume. For the sample ETL with Morphlines use case, we have picked a publicly available “million song” dataset from The raw data consist of one JSON file/entry for each track; the dictionary contains the following keywords:

A welcome demonstration of Morphines but I do wonder about the statement:

Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation. (Emphasis added.)

If you don’t have experience with S3 and this pipleine, it is a good starting point for your investigations.

Apache Bigtop: The “Fedora of Hadoop”…

Wednesday, June 26th, 2013

Apache Bigtop: The “Fedora of Hadoop” is Now Built on Hadoop 2.x by Roman Shaposhnik.

From the post:

Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.

And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:

  • Apache Zookeeper 3.4.5
  • Apache Flume 1.3.1
  • Apache HBase 0.94.5
  • Apache Pig 0.11.1
  • Apache Hive 0.10.0
  • Apache Sqoop 2 (AKA 1.99.2)
  • Apache Oozie 3.3.2
  • Apache Whirr 0.8.2
  • Apache Mahout 0.7
  • Apache Solr (SolrCloud) 4.2.1
  • Apache Crunch (incubating) 0.5.0
  • Apache HCatalog 0.5.0
  • Apache Giraph 1.0.0
  • LinkedIn DataFu 0.0.6
  • Cloudera Hue 2.3.0

And we were just talking about YARN and applications weren’t we? 😉


(Participate if you can but at least send a note of appreciation to Cloudera.)

Apache Flume 1.3.1

Saturday, January 5th, 2013

Apache Flume 1.3.1

From the webpage:

This release is the third release of Apache Flume as an Apache top level project and is the third release that is considered ready for production use. This release is primarily a maintenance release for Flume 1.3.0, and includes several bug fixes and performance enhancements.

If you are using Flume in production, maintenance releases are important.

If you are learning Flume, why start with working around fixed bugs? Use the latest stable release.

Streaming data into Apache HBase using Apache Flume

Thursday, November 29th, 2012

Streaming data into Apache HBase using Apache Flume

From the post:

Apache Flume was conceived as a fault-tolerant ingest system for the Apache Hadoop ecosystem. Flume comes packaged with an HDFS Sink which can be used to write events into HDFS, and two different implementations of HBase sinks to write events into HBase. You can read about the basic architecture of Apache Flume 1.x in this blog post. You can also read about how Flume’s File Channel persists events and still provides extremely high performance in an earlier blog post. In this article, we will explore how to configure Flume to write events into HBase, and write custom serializers to write events into HBase in a format of the user’s choice.

Streaming data is great, but being able to capture it when needed, is even better!

Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume

Tuesday, October 16th, 2012

Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume by Jon Natkins.

From the post:

This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.

Every story has a beginning, and every data pipeline has a source. So, to build Hadoop applications, we need to get data from a source into HDFS.

Apache Flume is one way to bring data into HDFS using CDH. The Apache Flume website describes Flume as “a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” At the most basic level, Flume enables applications to collect data from its origin and send it to a resting location, such as HDFS. At a slightly more detailed level, Flume achieves this goal by defining dataflows consisting of three primary structures: sources, channels and sinks. The pieces of data that flow through Flume are called events, and the processes that run the dataflow are called agents.

A very good introduction to the use of Flume!

Does it seem to you that the number of examples using Twitter, not just for “big data” but in general seems to be on the rise?

Just a personal observation and subject to all the flaws, “all the buses were going the other way,” of such.

Judging from the state of my inbox, some people are still writing more than 140 characters at a time.

Will it make a difference in our tools/thinking if we focus on shorter strings as opposed to longer ones?

CDH4.1 Now Released!

Wednesday, October 3rd, 2012

CDH4.1 Now Released! by Charles Zedlewski.

From the post:

We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:

  • Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
  • Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
  • Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
  • Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
  • FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
  • Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
  • Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.

It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.


About Apache Flume FileChannel

Thursday, September 27th, 2012

About Apache Flume FileChannel by Brock Noland.

From the post:

This blog post is about Apache Flume’s File Channel. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

FileChannel is a persistent Flume channel that supports writing to multiple disks in parallel and encryption.

Just in case you are one of those folks with large amounts of data to move about.

Analyzing Twitter Data with Hadoop [Hiding in a Public Data Stream]

Wednesday, September 19th, 2012

Analyzing Twitter Data with Hadoop by Jon Natkins

From the post:

Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.

In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.

Looking forward to more posts in this series!

Social media is a focus for marketing teams for obvious reasons.

Analysis of snaps, crackles and pops en masse.

What if you wanted to communicate securely with others using social media?

Thinking of something more robust and larger than two (or three) lovers agreeing on code words.

How would you hide in a public data stream?

Or the converse, how would you hunt for someone in a public data stream?

How would you use topic maps to manage the semantic side of such a process?

Welcome Hortonworks Data Platform 1.1

Wednesday, September 12th, 2012

Welcome Hortonworks Data Platform 1.1 by Jim Walker.

From the post:

Hortonworks Data Platform 1.1 Brings Expanded High Availability and Streaming Data Capture, Easier Integration with Existing Tools to Improve Enterprise Reliability and Performance of Apache Hadoop

It is exactly three months to the day that Hortonworks Data Platform version 1.0 was announced. A lot has happened since that day…

  • Our distribution has been downloaded by thousands and is delivering big value to organizations throughout the world,
  • Hadoop Summit gathered over 2200 Hadoop enthusiasts into the San Jose Convention Center,
  • And, our Hortonworks team grew by leaps and bounds!

In these same three months our growing team of committers, engineers, testers and writers have been busy knocking out our next release, Hortonworks Data Platform 1.1. We are delighted to announce availability of HDP 1.1 today! With this release, we expand our high availability options with the addition of Red Hat based HA, add streaming capability with Flume, expand monitoring API enhancements and have made significant performance improvements to the core platform.

New features include high availability, capturing data streams (Flume), improved operations management and performance increases.

For the details, see the post, documentation or even download Hortonworks Data Platform 1.1 for a spin.

Unlike Odo’s Klingon days, a day with several items from Hortonworks is a good day. Enjoy!

CDH3 update 5 is now available

Monday, August 13th, 2012

CDH3 update 5 is now available by Arvind Prabhakar

From the post:

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

  • Flume 1.2.0 – Provides a durable file channel and many more features over the previous release.
  • Hive AvroSerDe – Replaces the Haivvreo SerDe and provides robust support for Avro data format.
  • WebHDFS – A full read/write REST API to HDFS.

Maintenance release. Installation is good practice before major releases.

Apache Flume Development Status Update

Tuesday, July 3rd, 2012

Apache Flume Development Status Update by Hari Shreedharan.

From the post:

Apache Flume is a scalable, reliable, fault-tolerant, distributed system designed to collect, transfer, and store massive amounts of event data into HDFS. Apache Flume recently graduated from the Apache Incubator as a Top Level Project at Apache. Flume is designed to send data over multiple hops from the initial source(s) to the final destination(s). Click here for details of the basic architecture of Flume. In this article, we will discuss in detail some new components in Flume 1.x (also known as Flume NG), which is currently on the trunk branch, techniques and components that can be be used to route the data, configuration validation, and finally support for serializing events.

In the past several months, contributors have been busy adding several new sources, sinks and channels to Flume. Flume now supports Syslog as a source, where sources have been added to support Syslog over TCP and UDP.

Flume now has a high performance, persistent channel – the File Channel. This means if the agent fails for any reason before events committed by the source are not removed and the transaction committed by the sink, the events will reloaded from disk and can be taken when the agent starts up again. The events will only be removed from the channel when the transaction is committed by the sink. The File channel uses a Write Ahead Log to save events.

Among the other features that have been added to Flume is the ability to modify events “in flight.”

I would not construe “event” too narrowly.

Emails, tweets, arrivals, departures, temperatures, wind direction, speed, etc., can all be viewed as one or more “events.”

The merging and other implications of one or more event modifiers will be the subject of a future post.

CDH3 update 4 is now available

Saturday, May 12th, 2012

CDH3 update 4 is now available by David Wang.

From the post:

We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements.

First, there have been a few notable HBase updates. In this release, we’ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some of the recent hbck changes were incorporated to better detect and handle various types of corruptions. Lastly, HDFS append support is now disabled by default in this release as it is no longer needed for HBase. Please see the CDH3 Known Issues and Workarounds page for details.

In addition to the HBase updates, CDH3 update 4 also includes the latest release of Apache Flume (incubating) – version 1.1.0. A detailed description of what it brings to the table is found in a previous Cloudera blog post describing its architecture. Please note that we will continue to ship Flume 0.9.4 as well.

Apache Bigtop 0.3.0 (incubating) has been released

Wednesday, April 4th, 2012

Apache Bigtop 0.3.0 (incubating) has been released by Roman Shaposhnik.

From the post:

Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:

  • Apache Hadoop 1.0.1
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Thoughts on what is missing from this ecosystem?

What if you moved from the company where you wrote the scripts? And they needed new scripts?

Re-write? On what basis?

Is your “big data” big enough to need “big documentation?”

Fluentd: the missing log collector

Saturday, December 10th, 2011

Fluentd: the missing log collector

From the post:

The Problems

The fundamental problem with logs is that they are usually stored in files although they are best represented as streams (by Adam Wiggins, CTO at Heroku). Traditionally, they have been dumped into text-based files and collected by rsync in hourly or daily fashion. With today’s web/mobile applications, this creates two problems.

Problem 1: Need Ad-Hoc Parsing

The text-based logs have their own format, and the analytics engineer needs to write a dedicated parser for each format. However, You are a DATA SCIENTIST, NOT A PARSER GENERATOR, right? 🙂

Problem 2: Lacks Freshness

The logs lag. The realtime analysis of user behavior makes feature iterations a lot faster. A nimbler A/B testing will help you differentiate your service from competitors.

This is where Fluentd comes in. We believe Fluentd solves all issues of scalable log collection by getting rid of files, and turns logs into true semi-structured data streams.

If you are interested in log file processing, take a look at Fluentd and compare it to the competition.

As far as logs as streams, I think the “file view” of most data, logs or not, isn’t helpful. What does it matter to me if the graphs for a document are being generated in real time by a server and updated in my document? Or that a select bibliography is being updated so that readers get the late breaking research in a fast developing field?

The “fixed text” of a document is a view based upon the production means for documents. When those production means change, so should our view of documents.

Apache Flume – Architecture of Flume NG

Friday, December 9th, 2011

Apache Flume – Architecture of Flume NG by Arvind Prabhakar.

From the post:

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at Flume NG is work related to new major revision of Flume and is the subject of this post.

Prior to entering the incubator, Flume saw incremental releases leading up to version 0.9.4. As Flume became adopted it became clear that certain design choices would need to be reworked in order to address problems reported in the field. The work necessary to make this change began a few months ago under the JIRA issue FLUME-728. This work currently resides on a separate branch by the name flume-728, and is informally referred to as Flume NG. At the time of writing this post Flume NG had gone through two internal milestones – NG Alpha 1, and NG Alpha 2 and a formal incubator release of Flume NG is in the works.

At a high-level, Flume NG uses a single-hop message delivery guarantee semantics to provide end-to-end reliability for the system. To accomplish this, certain new concepts have been incorporated into its design, while certain other existing concepts have been either redefined, reused or dropped completely.

In this blog post, I will describe the fundamental concepts incorporated in Flume NG and talk about it’s high-level architecture. This is a first in a series of blog posts by Flume team that will go into further details of it’s design and implementation.

Log data from disparate sources is one likely use case for topic maps. See what you think of the new architecture for Apache Flume.

Good pointers to additional information as well.

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Sunday, November 20th, 2011

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.

In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:

  1. In or between which of these elements, would human analysis/judgment have the greatest impact?
  2. Would human analysis/judgment be best made by experts or crowds?
  3. What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
  4. Performance with feedback or homeostasis mechanisms?

That is a very crude and uninformed starter set of questions.

Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)

Hadoop User Group UK: Data Integration

Sunday, October 16th, 2011

Hadoop User Group UK: Data Integration

Three presentations captured as podcasts from the Hadoop User Group UK:




Fresh as of 13 October 2011.

Thanks to Skills Matter for making the podcasts available!

Apache Flume incubation wiki

Sunday, October 2nd, 2011

Apache Flume incubation wiki

From the website:

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop’s HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.

A number of resources for Flume.

Will “data flows” as the dominant means of accessing data be a consequence of an environment where a “local copy” of data is no longer meaningful or an enabler of such an environment? Or both?

I think topic maps would do well to develop models for streaming and perhaps probabilistic merging or even probabilistic creation of topics/associations from data streams. Static structures only give the appearance of certainty.

Real-time Streaming Analysis for Hadoop and Flume

Saturday, August 6th, 2011

Real-time Streaming Analysis for Hadoop and Flume

From the description:

This talk introduces an open-source SQL-based system for continuous or ad-hoc analysis of streaming data built on top of the Flume data collection platform for Hadoop.

Big data analytics based on Hadoop often require aggregating data in a large data store like HDFS or HBase, and then running periodic MapReduce processes over this data set. Getting “near real time” results requires running MapReduce jobs more frequently over smaller data sets, which has a practical frequency limit based on the size of the data and complexity of the analytics; the lower bound on analysis latency is on the order of minutes. This has spawned a trend of building custom analytics directly into the data ingestion pipeline, enabling some streaming operations such as early alerting, index generation, or real-time tuning of ad systems before performing less time-sensitive (but more comprehensive) analysis in MapReduce.

We present an open-source tool which extends the Flume data collection platform with a SQL-like language for analysis over streaming event-based data sets. We will discuss the motivation for the system, its architecture and interaction with Flume, potential applications, and examples of its usage.

Deeply awesome! Just wish I had been present to see the demo!

Makes me think of topic map creation from data streams with the ability to test different subject identity merging conditions, in real time. Rather than repetitive stories about a helicopter being downed, you get a summary report and a listing by location and time of publication of repetitive reports. Say one screen full of content and access to the noise. Better use of your time?

Real-Time Log Processing System based on Flume and Cassandra – Post

Thursday, March 3rd, 2011

Real-Time Log Processing System based on Flume and Cassandra

Very cool!

What would be even cooler, would be to have real-time associations with subjects that have information from outside the data set.

Or better yet, real-time on-demand associations with subjects that have information from outside the data set.

I suppose the classic use case would be running stats on all the sports events on a Saturday or Sunday, including individuals stats and merging in the latest doping, paternity and similar tests.

Other applications?