Archive for the ‘Streams’ Category

Streaming 101 & 102 – [Stream Processing with Batch Identities?]

Sunday, February 21st, 2016

The world beyond batch: Streaming 101 by Tyler Akidau.

From part 1:

Streaming data processing is a big deal in big data these days, and for good reasons. Amongst them:

  • Businesses crave ever more timely data, and switching to streaming is a good way to achieve lower latency.
  • The massive, unbounded data sets that are increasingly common in modern business are more easily tamed using a system designed for such never-ending volumes of data.
  • Processing data as they arrive spreads workloads out more evenly over time, yielding more consistent and predictable consumption of resources.

Despite this business-driven surge of interest in streaming, the majority of streaming systems in existence remain relatively immature compared to their batch brethren, which has resulted in a lot of exciting, active development in the space recently.

Since I have quite a bit to cover, I’ll be splitting this across two separate posts:

  1. Streaming 101: This first post will cover some basic background information and clarify some terminology before diving into details about time domains and a high-level overview of common approaches to data processing, both batch and streaming.
  2. The Dataflow Model: The second post will consist primarily of a whirlwind tour of the unified batch + streaming model used by Cloud Dataflow, facilitated by a concrete example applied across a diverse set of use cases. After that, I’ll conclude with a brief semantic comparison of existing batch and streaming systems.

The world beyond batch: Streaming 102

In this post, I want to focus further on the data-processing patterns from last time, but in more detail, and within the context of concrete examples. The arc of this post will traverse two major sections:

  • Streaming 101 Redux: A brief stroll back through the concepts introduced in Streaming 101, with the addition of a running example to highlight the points being made.
  • Streaming 102: The companion piece to Streaming 101, detailing additional concepts that are important when dealing with unbounded data, with continued use of the concrete example as a vehicle for explaining them.

By the time we’re finished, we’ll have covered what I consider to be the core set of principles and concepts required for robust out-of-order data processing; these are the tools for reasoning about time that truly get you beyond classic batch processing.

You should also catch the paper by Tyler and others, The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing.

Cloud Dataflow, known as Beam at the Apache incubator, offers a variety of operations for combining and/or merging collections of values in data.

I mention that because I would hate to hear of you doing stream processing with batch identities. You know, where you decide on some fixed set of terms and those are applied across dynamic data.

Hmmm, fixed terms applied to dynamic data. Doesn’t even sound right does it?

Sometimes, fixed terms (read schema, ontology) are fine but in linguistically diverse environments (read real life), that isn’t always adequate.

Enjoy the benefits of stream processing but don’t artificially limit them with batch identities.

I first saw this in a tweet by Bob DuCharme.

Spark Release 1.5.0

Thursday, September 10th, 2015

Spark Release 1.5.0

From the post:

Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page.

You can consult JIRA for the detailed changes. We have curated a list of high level changes here:

Time for your Fall Spark Upgrade!


Streaming Data IO in R

Monday, June 29th, 2015

Streaming Data IO in R – curl, jsonlite, mongolite by Jeroem Ooms.


The jsonlite package provides a powerful JSON parser and generator that has become one of standard methods for getting data in and out of R. We discuss some recent additions to the package, in particular support streaming (large) data over http(s) connections. We then introduce the new mongolite package: a high-performance MongoDB client based on jsonlite. MongoDB (from “humongous”) is a popular open-source document database for storing and manipulating very big JSON structures. It includes a JSON query language and an embedded V8 engine for in-database aggregation and map-reduce. We show how mongolite makes inserting and retrieving R data to/from a database as easy as converting it to/from JSON, without the bureaucracy that comes with traditional databases. Users that are already familiar with the JSON format might find MongoDB a great companion to the R language and will enjoy the benefits of using a single format for both serialization and persistency of data.

R, JSON, MongoDB, what’s there not to like? 😉

From UseR! 2015.


Signatures, patterns and trends: Timeseries data mining at Etsy

Sunday, June 7th, 2015

From the description:

Etsy loves metrics. Everything that happens in our data centres gets recorded, graphed and stored. But with over a million metrics flowing in constantly, it’s hard for any team to keep on top of all that information. Graphing everything doesn’t scale, and traditional alerting methods based on thresholds become very prone to false positives.

That’s why we started Kale, an open-source software suite for pattern mining and anomaly detection in operational data streams. These are big topics with decades of research, but many of the methods in the literature are ineffective on terabytes of noisy data with unusual statistical characteristics, and techniques that require extensive manual analysis are unsuitable when your ops teams have service levels to maintain.

In this talk I’ll briefly cover the main challenges that traditional statistical methods face in this environment, and introduce some pragmatic alternatives that scale well and are easy to implement (and automate) on Elasticsearch and similar platforms. I’ll talk about the stumbling blocks we encountered with the first release of Kale, and the resulting architectural changes coming in version 2.0. And I’ll go into a little technical detail on the algorithms we use for fingerprinting and searching metrics, and detecting different kinds of unusual activity. These techniques have potential applications in clustering, outlier detection, similarity search and supervised learning, and they are not limited to the data centre but can be applied to any high-volume timeseries data.

Blog post:

Signature, patterns and trends? Sounds relevant to monitoring network patterns. Yes?

Good focus on anomaly detection, pointing out that many explanations are overly simplistic.

Use case is one (1) million incoming metrics.

Looking forward to seeing this released as open source!

Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka

Thursday, January 22nd, 2015

Webinar: Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka by Helena Edelson.

From the post:

On Tuesday, January 13 I gave a webinar on Apache Spark, Spark Streaming and Cassandra. Over 1700 registrants from around the world signed up. This is a follow-up post to that webinar, answering everyone’s questions. In the talk I introduced Spark, Spark Streaming and Cassandra with Kafka and Akka and discussed wh​​​​y these particular technologies are a great fit for lambda architecture due to some key features and strategies they all have in common, and their elegant integration together. We walked through an introduction to implementing each, then showed how to integrate them into one clean streaming data platform for real-time delivery of meaning at high velocity. All this in a highly distributed, asynchronous, parallel, fault-tolerant system.

Video | Slides | Code | Diagram

About The Presenter: Helena Edelson is a committer on several open source projects including the Spark Cassandra Connector, Akka and previously Spring Integration and Spring AMQP. She is a Senior Software Engineer on the Analytics team at DataStax, a Scala and Big Data conference speaker, and has presented at various Scala, Spark and Machine Learning Meetups.

I have long contended that it is possible to have a webinar that has little if any marketing fluff and maximum technical content. Helena’s presentation is an example of that type of webinar.

Very much worth the time to watch.

BTW, being so content full, questions were answered as part of this blog post. Technical webinars just don’t get any better organized than this one.

Perhaps technical webinars should be marked with TW and others with CW (for c-suite webinars). To prevent disorientation in the first case and disappointment in the second one.

Improved Fault-tolerance and Zero Data Loss in Spark Streaming

Tuesday, January 20th, 2015

Improved Fault-tolerance and Zero Data Loss in Spark Streaming by Tathagata Das.

From the post:

Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its beginning, Spark Streaming has included support for recovering from failures of both driver and worker machines. However, for some data sources, input data could get lost while recovering from the failures. In Spark 1.2, we have added preliminary support for write ahead logs (also known as journaling) to Spark Streaming to improve this recovery mechanism and give stronger guarantees of zero data loss for more data sources. In this blog, we are going to elaborate on how this feature works and how developers can enable it to get those guarantees in Spark Streaming applications.


Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster. Since Spark Streaming is built on Spark, it enjoys the same fault-tolerance for worker nodes. However, the demand of high uptimes of a Spark Streaming application require that the application also has to recover from failures of the driver process, which is the main application process that coordinates all the workers. Making the Spark driver fault-tolerant is tricky because it is an arbitrary user program with arbitrary computation patterns. However, Spark Streaming applications have an inherent structure in the computation — it runs the same Spark computation periodically on every micro-batch of data. This structure allows us to save (aka, checkpoint) the application state periodically to reliable storage and recover the state on driver restarts.

For sources like files, this driver recovery mechanism was sufficient to ensure zero data loss as all the data was reliably stored in a fault-tolerant file system like HDFS or S3. However, for other sources like Kafka and Flume, some of the received data that was buffered in memory but not yet processed could get lost. This is because of how Spark applications operate in a distributed manner. When the driver process fails, all the executors running in a standalone/yarn/mesos cluster are killed as well, along with any data in their memory. In case of Spark Streaming, all the data received from sources like Kafka and Flume are buffered in the memory of the executors until their processing has completed. This buffered data cannot be recovered even if the driver is restarted. To avoid this data loss, we have introduced write ahead logs in Spark Streaming in the Spark 1.2 release.

Solid piece on the principles and technical details you will need for zero data loss in Spark Streaming. With suggestions for changes that may be necessary to support zero data loss at no loss in throughput. The latter being a non-trivial consideration.

Curious, I understand that many systems require zero data loss but do you have examples of systems were some data loss is acceptable? To what extend is data loss acceptable? (Given lost baggage rates, is airline baggage one of those?)

Akka – Streams and HTTP

Wednesday, December 24th, 2014

New documentation for Akka Streams and HTTP.

From Akka Streams:

The way we consume services from the internet today includes many instances of streaming data, both downloading from a service as well as uploading to it or peer-to-peer data transfers. Regarding data as a stream of elements instead of in its entirety is very useful because it matches the way computers send and receive them (for example via TCP), but it is often also a necessity because data sets frequently become too large to be handled as a whole. We spread computations or analyses over large clusters and call it “big data”, where the whole principle of processing them is by feeding those data sequentially—as a stream—through some CPUs.

Actors can be seen as dealing with streams as well: they send and receive series of messages in order to transfer knowledge (or data) from one place to another. We have found it tedious and error-prone to implement all the proper measures in order to achieve stable streaming between actors, since in addition to sending and receiving we also need to take care to not overflow any buffers or mailboxes in the process. Another pitfall is that Actor messages can be lost and must be retransmitted in that case lest the stream have holes on the receiving side. When dealing with streams of elements of a fixed given type, Actors also do not currently offer good static guarantees that no wiring errors are made: type-safety could be improved in this case.

For these reasons we decided to bundle up a solution to these problems as an Akka Streams API. The purpose is to offer an intuitive and safe way to formulate stream processing setups such that we can then execute them efficiently and with bounded resource usage—no more OutOfMemoryErrors. In order to achieve this our streams need to be able to limit the buffering that they employ, they need to be able to slow down producers if the consumers cannot keep up. This feature is called back-pressure and is at the core of the Reactive Streams initiative of which Akka is a founding member. For you this means that the hard problem of propagating and reacting to back-pressure has been incorporated in the design of Akka Streams already, so you have one less thing to worry about; it also means that Akka Streams interoperate seamlessly with all other Reactive Streams implementations (where Reactive Streams interfaces define the interoperability SPI while implementations like Akka Streams offer a nice user API).

From HTTP:

The purpose of the Akka HTTP layer is to expose Actors to the web via HTTP and to enable them to consume HTTP services as a client. It is not an HTTP framework, it is an Actor-based toolkit for interacting with web services and clients….

Just in case you tire of playing with your NVIDIA GRID K2 board and want to learn more about Akka streams and HTTP. 😉