Archive for the ‘Stream Analytics’ Category

Streaming 101 & 102 – [Stream Processing with Batch Identities?]

Sunday, February 21st, 2016

The world beyond batch: Streaming 101 by Tyler Akidau.

From part 1:

Streaming data processing is a big deal in big data these days, and for good reasons. Amongst them:

  • Businesses crave ever more timely data, and switching to streaming is a good way to achieve lower latency.
  • The massive, unbounded data sets that are increasingly common in modern business are more easily tamed using a system designed for such never-ending volumes of data.
  • Processing data as they arrive spreads workloads out more evenly over time, yielding more consistent and predictable consumption of resources.

Despite this business-driven surge of interest in streaming, the majority of streaming systems in existence remain relatively immature compared to their batch brethren, which has resulted in a lot of exciting, active development in the space recently.

Since I have quite a bit to cover, I’ll be splitting this across two separate posts:

  1. Streaming 101: This first post will cover some basic background information and clarify some terminology before diving into details about time domains and a high-level overview of common approaches to data processing, both batch and streaming.
  2. The Dataflow Model: The second post will consist primarily of a whirlwind tour of the unified batch + streaming model used by Cloud Dataflow, facilitated by a concrete example applied across a diverse set of use cases. After that, I’ll conclude with a brief semantic comparison of existing batch and streaming systems.

The world beyond batch: Streaming 102

In this post, I want to focus further on the data-processing patterns from last time, but in more detail, and within the context of concrete examples. The arc of this post will traverse two major sections:

  • Streaming 101 Redux: A brief stroll back through the concepts introduced in Streaming 101, with the addition of a running example to highlight the points being made.
  • Streaming 102: The companion piece to Streaming 101, detailing additional concepts that are important when dealing with unbounded data, with continued use of the concrete example as a vehicle for explaining them.

By the time we’re finished, we’ll have covered what I consider to be the core set of principles and concepts required for robust out-of-order data processing; these are the tools for reasoning about time that truly get you beyond classic batch processing.

You should also catch the paper by Tyler and others, The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing.

Cloud Dataflow, known as Beam at the Apache incubator, offers a variety of operations for combining and/or merging collections of values in data.

I mention that because I would hate to hear of you doing stream processing with batch identities. You know, where you decide on some fixed set of terms and those are applied across dynamic data.

Hmmm, fixed terms applied to dynamic data. Doesn’t even sound right does it?

Sometimes, fixed terms (read schema, ontology) are fine but in linguistically diverse environments (read real life), that isn’t always adequate.

Enjoy the benefits of stream processing but don’t artificially limit them with batch identities.

I first saw this in a tweet by Bob DuCharme.

Streamtools – Update

Saturday, April 19th, 2014

streamtools 0.2.4

From the webpage:

This release contains:

  • toEmail and fromEmail blocks: use streamtools to receive and create emails!
  • linear modelling blocks: use streamtools to perform linear and logistic regression using stochastic gradient descent.
  • GUI updates : a new block reference/creation panel.
  • a kullback leibler block for comparing distributions.
  • added a tutorials section to streamtools available at /tutorials in your streamtools server.
  • many small bug fixes and tweaks.

See also: Introducing Streamtools.

+1 on news input becoming more stream-like. But streams, of water and news, can both become polluted.

Filtering water is a well-known science.

Filtering information is doable but with less certain results.

How do you filter your input? (Not necessarily automatically, algorithmic, etc. You have to define the filter first, then choose the means implement it.)

I first saw this in a tweet by Micahael Dewar.

Hadoop Alternative Hydra Re-Spawns as Open Source

Sunday, March 16th, 2014

Hadoop Alternative Hydra Re-Spawns as Open Source by Alex Woodie.

From the post:

It may not have the name recognition or momentum of Hadoop. But Hydra, the distributed task processing system first developed six years ago by the social bookmarking service maker AddThis, is now available under an open source Apache license, just like Hadoop. And according to Hydra’s creator, the multi-headed platform is very good at some big data tasks that the yellow pachyderm struggles with–namely real-time processing of very big data sets.

Hydra is a big data storage and processing platform developed by Matt Abrams and his colleagues at AddThis (formerly Clearspring), the company that develops the Web server widgets that allow visitors to easily share something via their Twitter, Facebook, Pintrest, Google+, or Instagram accounts.

When AddThis started scaling up its business in the mid-2000s, it got flooded with data about what users were sharing. The company needed a scalable, distributed system that could deliver real-time analysis of that data to its customers. Hadoop wasn’t a feasible option at that time. So it built Hydra instead.

So, what is Hydra? In short, it’s a distributed task processing system that supports streaming and batch operations. It utilizes a tree-based data structure to store and process data across clusters with thousands of individual nodes. It features a Linux-based file system, which makes it compatible with ext3, ext4, or even ZFS. It also features a job/cluster management component that automatically allocates new jobs to the cluster and rebalance existing jobs. The system automatically replicates data and handles node failures automatically.

The tree-based structure allows it to handle streaming and batch jobs at the same time. In his January 23 blog post announcing that Hydra is now open source, Chris Burroughs, a member of AddThis’ engineering department, provided this useful description of Hydra: “It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).”

To learn a lot more about Hydra, see its GitHub page.

Another candidate for “real-time processing of very big data sets.”

The reflex is to applaud improvements in processing speed. But what sort of problems require that kind of speed? I know the usual suspects, modeling the weather, nuclear explosions, chemical reactions, but at some point, the processing ends and a human reader has to comprehend the results.

Better to get information to the human reader sooner rather than later, but there is a limit to the speed at which a user can understand the results of a computational process.

From a UI perspective, what research is there on how fast/slow information should be pushed at a user?

Could make the difference between an app that is just annoying and one that is truly useful.

I first saw this in a tweet by Joe Crobak.

Zooming Through Historical Data…

Monday, January 20th, 2014

Zooming Through Historical Data with Streaming Micro Queries by Alex Woodie.

From the post:

Stream processing engines, such as Storm and S4, are commonly used to analyze real-time data as it flows into an organization. But did you know you can use this technology to analyze historical data too? A company called ZoomData recently showed how.

In a recent YouTube presentation, Zoomdata Justin Langseth demonstrated his company’s technology, which combines open source stream processing engines like Apache with data connection and visualization libraries based on D3.js.

“We’re doing data analytics and visualization a little differently than it’s traditionally done,” Langseth says in the video. “Legacy BI tools will generate a big SQL statement, run it against Oracle or Teradata, then wait for two to 20 to 200 seconds before showing it to the user. We use a different approach based on the Storm stream processing engine.”

Once hooked up to a data source–such as Cloudera Impala or Amazon Redshift–data is then fed into the Zoomdata platform, which performs calculations against the data as it flows in, “kind of like continues event processing but geared more toward analytics,” Langseth says.

From the video description:

In this hands-on webcast you’ll learn how LivePerson and Zoomdata perform stream processing and visualization on mobile devices of structured site traffic and unstructured chat data in real-time for business decision making. Technologies include Kafka, Storm, and d3.js for visualization on mobile devices. Byron Ellis, Data Scientist for LivePerson will join Justin Langseth of Zoomdata to discuss and demonstrate the solution.

After watching the video, what do you think the concept of “micro queries?”

I ask because I don’t know of any technical reason why a “large” query could not stream out interim results and display those as more results were arriving.

Visualization isn’t usually done that way but that brings me to my next question: Assuming we have interim results visualized, how useful are interim results? Being actionable on interim results really depends on the domain.

I rather like Zoomdata’s emphasis on historical data and the video is impressive.

You can download a VM at Zoomdata.

If you can think of upsides/downsides to the interim results issue, please give a shout!


Saturday, November 23rd, 2013

Introducing SAMOA, an open source platform for mining big data streams by Gianmarco De Francisci Morales and Albert Bifet.

From the post:

Machine learning and data mining are well established techniques in the world of IT and especially among web companies and startups. Spam detection, personalization and recommendations are just a few of the applications made possible by mining the huge quantity of data available nowadays. However, “big data” is not only about Volume, but also about Velocity (and Variety, 3V of big data).

The usual pipeline for modeling data (what “data scientists” do) involves taking a sample from production data, cleaning and preprocessing it to make it usable, training a model for the task at hand and finally deploying it to production. The final output of this process is a pipeline that needs to run periodically (and be maintained) in order to keep the model up to date. Hadoop and its ecosystem (e.g., Mahout) have proven to be an extremely successful platform to support this process at web scale.

However, no solution is perfect and big data is “data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”. The current challenge is to move towards analyzing data as soon as it arrives into the system, nearly in real-time.

For example, models for mail spam detection get outdated with time and need to be retrained with new data. New data (i.e., spam reports) comes in continuously and the model starts being outdated the moment it is deployed: all the new data is sitting without creating any value until the next model update. On the contrary, incorporating new data as soon as it arrives is what the “Velocity” in big data is about. In this case, Hadoop is not the ideal tool to cope with streams of fast changing data.

Distributed stream processing engines are emerging as the platform of choice to handle this use case. Examples of these platforms are Storm, S4, and recently Samza. These platforms join the scalability of distributed processing with the fast response of stream processing. Yahoo has already adopted Storm as a key technology for low-latency big data processing.

Alas, currently there is no common solution for mining big data streams, that is, for doing machine learning on streams on a distributed environment.


SAMOA (Scalable Advanced Massive Online Analysis) is a framework for mining big data streams. As most of the big data ecosystem, it is written in Java. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

After you get SAMOA installed, you may want to read: Distributed Decision Tree Learning for Mining Big Data Streams by Arinto Murdopo (thesis).

The nature of streaming data prevents SAMOA from offering the range of machine learning algorithms common in machine learning packages.

But if the SAMOA algorithms fit your use cases, what other test would you apply?

AK Data Science Summit – Streaming and Sketching

Tuesday, July 2nd, 2013

AK Data Science Summit – Streaming and Sketching – June 20, 2013

From the post:

Aggregate Knowledge, along with Foundation Capital, proudly presented the AK Data Science Summit on Streaming and Sketching Algorithms in Big Data and Analaytics at 111 Minna Gallery in San Francisco on June 20th, 2013. It was a one day conference dedicated to bridging the gap between implementers and researchers. You can find video and slides of the talks given and panels held on that day. Thank you again to our speakers and panelists for their time and efforts!

What do you do when big data is too big?

Use streaming and sketching algorithms!

Not every subject identity problem allows time for human editorial intervention.

Consider target acquisition in a noisy environment, where some potential targets are hostile and some are not.

Capturing what caused a fire/no-fire decision (identification as hostile) enables refinement or transfer to other systems.

…Efficient Approximate Data De-Duplication in Streams [Approximate Merging?]

Tuesday, December 18th, 2012

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams by Suman K. Bera, Sourav Dutta, Ankur Narang, Souvik Bhattacherjee.


Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent.
In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.

If you think of more than one representative for a subject as “duplication,” then merging is a special class of “deduplication.”

Deduplication that discards redundant information but that preserves unique additional information and relationships to other subjects.

As you move away from static and towards transient topic maps, representations of subjects in real time data streams, this and similar techniques will become very important.

I first saw this in a tweet from Stefano Bertolo.

PS: A new equivalent term (to me) for deduplication: “intelligent compression.” Pulls about 46K+ “hits” in a popular search engine today. May want to add it to your routine search queries.

MOA Massively Online Analysis

Saturday, December 1st, 2012

MOA Massively Online Analysis : Real Time Analytics for Data Streams

From the homepage:

What is MOA?

MOA is an open source framework for data stream mining. It includes a collection of machine learning algorithms (classification, regression, and clustering) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.

What can MOA do for you?

MOA performs BIG DATA stream mining in real time, and large scale machine learning. MOA can be easily used with Hadoop, S4 or Storm, and extended with new mining algorithms, and new stream generators or evaluation measures. The goal is to provide a benchmark suite for the stream mining community. Details.

Short tutorials and a manual are available. Enough to get started but you will need additional resources on machine learning if it isn’t already familiar.

A small niggle about documentation. Many projects have files named “tutorial” or in this case “Tutorial1,” or “Manual.” Those files are easier to discover/save, if the project name, version(?), is prepended to tutorial or manual. Thus “Moa-2012-08-tutorial1” or “Moa-2012-08-manual.”

If data streams are in your present or future, definitely worth a look.

Design a Twitter Like Application with Nati Shalom

Thursday, November 1st, 2012

Design a Twitter Like Application with Nati Shalom

From the description:

Design a large scale NoSQL/DataGrid application similar to Twitter with Nati Shalom.

The use case is solved with Gigaspaces and Cassandra but other NoSQL and DataGrids solutions could be used.

Slides :

If you enjoyed the posts I pointed to at: Building your own Facebook Realtime Analytics System, you will enjoy the video. (Same author.)

Not to mention Nati teaches patterns, the specific software being important but incidental.

Sketching and streaming algorithms for processing massive data [Heraclitus and /dev/null]

Saturday, October 27th, 2012

Sketching and streaming algorithms for processing massive data by Jelani Nelson.

From the introduction:

Several modern applications require handling data so massive that traditional algorithmic models do not provide accurate means to design and evaluate efficient algorithms. Such models typically assume that all data fits in memory, and that running time is accurately modeled as the number of basic instructions the algorithm performs. However in applications such as online social networks, large-scale modern scientific experiments, search engines, online content delivery, and product and consumer tracking for large retailers such as Amazon and Walmart, data too large to fit in memory must be analyzed. This consideration has led to the development of several models for processing such large amounts of data: The external memory model [1, 2] and cache-obliviousness [3, 4], where one aims to minimize the number of blocks fetched from disk; property testing [5], where it is assumed the data is so massive that we do not wish to even look at it all and thus aim to minimize the number of probes made into the data; and massively parallel algorithms operating in such systems as MapReduce and Hadoop [6, 7]. Also in some applications, data arrives in a streaming fashion and must be processed on the fly. Such cases arise, for example, with packet streams in network traffic monitoring, or query streams arriving at a Web-based service such as a search engine.

In this article we focus on this latter streaming model of computation, where a given algorithm must make one pass over a data set to then compute some function. We pursue such streaming algorithms, which use memory that is sublinear in the amount of data, since we assume the data is too large to fit in memory. Sometimes it becomes useful to consider algorithms that are allowed not just one, but a few passes over the data, in cases where the data set lives on disk and the number of passes may dominate the overall running time. We also occasionally discuss sketches. A sketch is with respect to some function f, and a sketch of a data set x is a compressed representation of x from which one can compute f(x). Of course under this definition f(x) is itself a valid sketch of x, but we often require more of our sketch than just being able to compute f(x). For example, we typically require that it should be possible for the sketch to be updated as more data arrives, and sometimes we also require sketches of two different data sets that are prepared independently can be compared to compute some function of the aggregate data, or similarity or difference measures across different data sets.

Our goal in this article is not to be comprehensive in our coverage of streaming algorithms. Rather, we discuss in some detail a few surprising results in order to convince the reader that it is possible to obtain some non-trivial algorithms within this model. Those interested in learning more about this area are encouraged to read the surveys [8, 9], or view the notes online for streaming courses taught by Chakrabarti at Dartmouth [10], Indyk at MIT [11], and McGregor at UMass Amherst [12].

Projections of data growth are outdated nearly as soon as they are uttered.

Suffice it to say that whatever data we are called upon to process today, will be larger next year. How much larger depends on the domain, the questions to be answered and a host of other factors. But it will be larger.

We need to develop methods of subject recognition, when like Heraclitus, we cannot ever step in the same stream twice.

If we miss it on the first pass, there isn’t going to be a second one. Next stop for some data streams is going to be /dev/null.

What approaches are you working on?

Wrapping Up TimesOpen: Sockets and Streams

Saturday, September 15th, 2012

Wrapping Up TimesOpen: Sockets and Streams by Joe Fiore.

From the post:

This past Wednesday night, more than 80 developers came to the Times building for the second TimesOpen event of 2012, “Sockets and Streams.”

If you were one of the 80 developers, good for you! The rest of us will have to wait for the videos.

Links to the slides are given but a little larger helping of explanation would be useful.

Data streams have semantic diversity, just like static data, only less time to deal with it.

Ups the semantic integration bar.

Are you ready?

Sockets and Streams [Registration Open – Event 12 September – Hurry]

Tuesday, September 4th, 2012

Sockets and Streams

Wednesday, September 12
7 p.m.–10 p.m

The New York Times
620 Eighth Avenue
New York, NY
15th Floor

From the webpage:

Explore innovations in real-time web systems and content, as well as related topics in interaction design.

Nice way to spend an evening in New York City!

Expect to hear good reports!

Streaming Data Mining Tutorial slides (and more)

Tuesday, August 21st, 2012

Streaming Data Mining Tutorial slides (and more) by Igor Carron.

From the post:

Jelani Nelson.and Edo Liberty just released an important tutorial they gave at KDD 12 on the state of the art and practical algorithms used in mining streaming data, entitled: Streaming Data Mining I personally marvel at the development of these deep algorithms which, because of the large data streams constraints, get to redefine what it means to do seemingly simple functions such as counting in the Big Data world. Here are some slides that got my interest, but the 111 pages are worth the read:

Pointers to more slides and videos follow.

Themes in streaming algorithms (workshop at TU Dortmund)

Saturday, August 11th, 2012

Themes in streaming algorithms (workshop at TU Dortmund) by Anna C. Gilbert.

From the post:

I recently attended the streaming algorithms workshop at Technische Universitat Dortumund. It was a follow-on to the very successful series of streaming algorithms workshops held in Kanpur over the last six years. Suresh and Andrew have both given excellent summaries of the individual talks at the workshop (see day 1, day 2, and day 3) so, as both a streaming algorithms insider and outsider, I thought it would be good to give a high-level overview of what themes there are in streaming algorithms research these days, to identify new areas of research and to highlight advances in existing areas.

Anna gives the briefest of summaries but I think they will entice you to look further.

Curious, how would you distinguish a “stream of data” from “read once data?”

That is in the second case you only get one pass at reading the data. Errors are just that, errors, but you can’t look back to be sure.

Is data “integrity” an artifact of small data sets and under-powered computers?

Hadoop Streaming Made Simple using Joins and Keys with Python

Monday, July 9th, 2012

Hadoop Streaming Made Simple using Joins and Keys with Python

From the post:

There are a lot of different ways to write MapReduce jobs!!!

Sample code for this post

I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).

When doing streaming with Hadoop you do have a few library options. If you are a Ruby programmer then wukong is awesome! For Python programmers you can use dumbo and more recently released mrjob.

I like working under the hood myself and getting down and dirty with the data and here is how you can too.

Interesting post and good tips on data exploration. Can’t really query/process the unknown.

Suggestions of other data exploration examples? (Not so much processing the known but looking to “learn” about data sources.)

Visualizing Streaming Text Data with Dynamic Maps

Monday, July 2nd, 2012

Visualizing Streaming Text Data with Dynamic Maps by Emden Gansner, Yifan Hu, and Stephen North.


The many endless rivers of text now available present a serious challenge in the task of gleaning, analyzing and discovering useful information. In this paper, we describe a methodology for visualizing text streams in real time. The approach automatically groups similar messages into “countries,” with keyword summaries, using semantic analysis, graph clustering and map generation techniques. It handles the need for visual stability across time by dynamic graph layout and Procrustes projection techniques, enhanced with a novel stable component packing algorithm. The result provides a continuous, succinct view of evolving topics of interest. It can be used in passive mode for overviews and situational awareness, or as an interactive data exploration tool. To make these ideas concrete, we describe their application to an online service called TwitterScope.

Or, see: TwitterScope, at

Worth the visit to see the static pics in the paper in action.

Definitely a tool with a future in data exploration.

I know “Procrustes” from the classics so had to look up Procrustes transformation. Which was reported to mean:

A Procrustes transformation is a geometric transformation that involves only translation, rotation, uniform scaling, or a combination of these transformations. Hence, it may change the size, but not the shape of a geometric object.

Sounds like abuse of “Procrustes” because I would think having my limbs cut off would change my shape. 😉

Intrigued by the notion of not changing “…the shape of a geometric object.”

Could we say that adding identifications to a subject representative does not change the subject it identifies?

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

Friday, May 18th, 2012

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

From the post:

Here is the first series of videos from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012). Congratulations to the local Organizing committee: Joshua Bloom, Damian Eads, Berian James, Peter Nugent, John Rice, Joseph Richards and Dan Starr for making the meeting happen and putting it all on videos for others to learn from. (in near real time!).The titles of the talks are linked to the presentation slides. The full program which ends tomorrow is here.. All the videos are here: Part 1, Part 2, Part 3, Part 4, Part 5.

Posted by Igor Carron at Nuit Blanche.

Finding enough hours to watch all of these is going to be a problem!

Which ones do you like best?

Event Stream Processor Matrix

Thursday, October 6th, 2011

Event Stream Processor Matrix

From the post:

We published our first ever UI-focused post on Top JavaScript Dynamic Table Libraries the other day and got some valuable feedback – thanks!

We are back to talking about the backend again. Our Search Analytics and Scalable Performance Monitoring services/products accept, process, and store huge amounts of data. One thing both of these services do is process a stream of events in real-time (and batch, of course). So what solutions are there that help one process data in real-time and perform some operations on a rolling window of data, such as the last 5 or 30 minutes of incoming event stream? We know of several solutions that fit that bill, so we decided to put together a matrix with essential attributes of those tools in order to compare them and make our pick. Below is the matrix we came up with. If you are viewing this on our site, the table is likely going to be too wide, but it should look find in a proper feed reader.

Another great collection of resources from Sematext!

So, must have subjects because something is being recognized in order to have “events.” That implies to me that with subjects being recognized, it is likely enterprises have other information about those subjects that would be useful to have together, “merged” I think is the term I want. 😉

Has anyone seen “events” recognized in one system being populated with information from another system? With different criteria for the same subject?