Real-time Streaming Analysis for Hadoop and Flume
From the description:
This talk introduces an open-source SQL-based system for continuous or ad-hoc analysis of streaming data built on top of the Flume data collection platform for Hadoop.
Big data analytics based on Hadoop often require aggregating data in a large data store like HDFS or HBase, and then running periodic MapReduce processes over this data set. Getting “near real time” results requires running MapReduce jobs more frequently over smaller data sets, which has a practical frequency limit based on the size of the data and complexity of the analytics; the lower bound on analysis latency is on the order of minutes. This has spawned a trend of building custom analytics directly into the data ingestion pipeline, enabling some streaming operations such as early alerting, index generation, or real-time tuning of ad systems before performing less time-sensitive (but more comprehensive) analysis in MapReduce.
We present an open-source tool which extends the Flume data collection platform with a SQL-like language for analysis over streaming event-based data sets. We will discuss the motivation for the system, its architecture and interaction with Flume, potential applications, and examples of its usage.
Deeply awesome! Just wish I had been present to see the demo!
Makes me think of topic map creation from data streams with the ability to test different subject identity merging conditions, in real time. Rather than repetitive stories about a helicopter being downed, you get a summary report and a listing by location and time of publication of repetitive reports. Say one screen full of content and access to the noise. Better use of your time?