Archive for the ‘Storm’ Category

History of Apache Storm and lessons learned

Thursday, December 31st, 2015

History of Apache Storm and lessons learned by Nathan Marz.

From the post:

Apache Storm recently became a top-level project, marking a huge milestone for the project and for me personally. It’s crazy to think that four years ago Storm was nothing more than an idea in my head, and now it’s a thriving project with a large community used by a ton of companies. In this post I want to look back at how Storm got to this point and the lessons I learned along the way.

The topics I will cover through Storm’s history naturally follow whatever key challenges I had to deal with at those points in time. The first 25% of this post is about how Storm was conceived and initially created, so the main topics covered there are the technical issues I had to figure out to enable the project to exist. The rest of the post is about releasing Storm and establishing it as a widely used project with active user and developer communities. The main topics discussed there are marketing, communication, and community development.

Any successful project requires two things:

  1. It solves a useful problem
  2. You are able to convince a significant number of people that your project is the best solution to their problem

What I think many developers fail to understand is that achieving that second condition is as hard and as interesting as building the project itself. I hope this becomes apparent as you read through Storm’s history.

All projects are different but the requirements for success:

  1. It solves a useful problem
  2. You are able to convince a significant number of people that your project is the best solution to their problem

sound universal to me!

To clarify point #2, “people” means “other people.”

Preaching to a mirror or choir isn’t going to lead to success.

Nor will focusing on “your problem” as opposed to “their problem.”

PS: New Year’s Eve advice – Don’t download large files. 😉 Slower than you want to think. Suspect people on my subnet are streaming football games and/or porno videos, perhaps both (screen within screen).

I first saw this in a tweet by Bob DuCharme.


Friday, February 13th, 2015


From the webpage:

An Apache Storm topology that will, by design, trigger failures at run-time.

The purpose of this bolt-of-death topology is to help testing Storm cluster stability. It was originally created to identify the issues surrounding the Storm defects described at STORM-329 and STORM-404.

This reminds me of PANIC! UNIX System Crash Dump Analysis Handbook by Chris Drake. Has it really been twenty (20) years since that came out?

If you need something a bit more up to date, Linux Kernel Crash Book: Everything you need to know by Igor Ljubuncic aka Dedoimedo, is available as both free and $ PDF files (to support the website).

Everyone needs a hobby, perhaps analyzing clusters and core dumps will be yours!


I first saw storm-bolt-of-death in a tweet by Michael G. Noll.

Announcing Apache Storm 0.9.3

Thursday, December 18th, 2014

Announcing Apache Storm 0.9.3 by Taylor Goetz

From the post:

With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and real-time streaming use cases. Apache Storm brings real-time data processing capabilities to help capture new business opportunities by powering low-latency dashboards, security alerts, and operational enhancements integrated with other applications running in the Hadoop cluster.


Now there’s an early holiday surprise!



Sunday, December 14th, 2014

GearPump (GitHub)

From the wiki homepage:

GearPump is a lightweight, real-time, big data streaming engine. It is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. GearPump draws from a number of existing frameworks including MillWheel, Apache Storm, Spark Streaming, Apache Samza, Apache Tez, and Hadoop YARN while leveraging Akka actors throughout its architecture.

What originally caught my attention was this passage on the GitHub page:

Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

Think about that for a second.

Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

The GitHub page features a word count example and pointers to the wiki with more examples.

What if every topic “knew” the index value of every topic that should merge with it on display to a user?

When added to a topic map it broadcasts its merging property values and any topic with those values responds by transmitting its index value.

When you retrieve a topic, it has all the IDs necessary to create a merged view of the topic on the fly and on the client side.

There would be redundancy in the map but de-duplication for storage space went out with preferences for 7-bit character values to save memory space. So long as every topic returns the same result, who cares?

Well, it might make a difference when the CIA want to give every contractor full access to its datastores 24×7 via their cellphones. But, until that is an actual requirement, I would not worry about the storage space overmuch.

I first saw this in a tweet from Suneel Marthi.

Introducing Pyleus: An Open-source Framework for Building Storm Topologies in Pure Python

Saturday, October 18th, 2014

Introducing Pyleus: An Open-source Framework for Building Storm Topologies in Pure Python by Patrick L.

From the post:

Yelp loves Python, and we use it at scale to power our websites and process the huge amount of data we produce.

Pyleus is a new open-source framework that aims to do for Storm what mrjob, another open-source Yelp project, does for Hadoop: let developers process large amounts of data in pure Python and iterate quickly, spending more time solving business-related problems and less time concerned with the underlying platform.

First, a brief introduction to Storm. From the project’s website, “Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.”

A Pyleus topology consists of, at minimum, a YAML file describing the structure of the topology, declaring each component and how tuples flow between them. The pyleus command-line tool builds a self-contained Storm JAR which can be submitted to any Storm cluster.

Since the U.S. baseball league championships are over, something to occupy you over the weekend. 😉

History of Apache Storm and lessons learned

Tuesday, October 7th, 2014

History of Apache Storm and lessons learned by Nathan Marz.

From the post:

Apache Storm recently became a top-level project, marking a huge milestone for the project and for me personally. It’s crazy to think that four years ago Storm was nothing more than an idea in my head, and now it’s a thriving project with a large community used by a ton of companies. In this post I want to look back at how Storm got to this point and the lessons I learned along the way.


The topics I will cover through Storm’s history naturally follow whatever key challenges I had to deal with at those points in time. The first 25% of this post is about how Storm was conceived and initially created, so the main topics covered there are the technical issues I had to figure out to enable the project to exist. The rest of the post is about releasing Storm and establishing it as a widely used project with active user and developer communities. The main topics discussed there are marketing, communication, and community development.

Any successful project requires two things:

  1. It solves a useful problem
  2. You are able to convince a significant number of people that your project is the best solution to their problem

What I think many developers fail to understand is that achieving that second condition is as hard and as interesting as building the project itself. I hope this becomes apparent as you read through Storm’s history.

Every project/case is somewhat different but this history of Storm is a relevant and great read!

I would highlight: It solves a useful problem.

I don’t read that to say:

  • It solves a problem I want to solve
  • It solves a problem you didn’t know you had
  • It solves a problem I care about
  • etc.

To be a “useful” problem, some significant segment of users must recognize it as a problem. If they don’t see it as a problem, then it doesn’t need a solution.

The Apache Software Foundation Announces Apache™ Storm™ as a Top-Level Project

Monday, September 29th, 2014

The Apache Software Foundation Announces Apache™ Storm™ as a Top-Level Project

From the post:

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 200 Open Source projects and initiatives, announced today that Apache™ Storm™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles.

“Apache Storm’s graduation is not only an indication of its maturity as a technology, but also of the robust, active community that develops and supports it,” said P. Taylor Goetz, Vice President of Apache Storm. “Storm’s vibrant community ensures that Storm will continue to evolve to meet the demands of real-time stream processing and computation use cases.”

Apache Storm is a high-performance, easy-to-implement distributed real-time computation framework for processing fast, large streams of data, adding reliable data processing capabilities to Apache Hadoop. Using Storm, a Hadoop cluster can efficiently process a full range of workloads, from real-time to interactive to batch.

As with all Apache products, Apache Storm software is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. For documentation and ways to become involved with Apache Storm, visit and @Apache_Storm on Twitter.

You will see many notices of Apache™ Storm™’s graduation to a Top-Level Project. Odds are you have already seen one. But, like the weather channel reporting rain at your location, someone may have missed the news. 😉

Apache Storm 0.9 Training Deck and Tutorial

Monday, September 15th, 2014

Apache Storm 0.9 Training Deck and Tutorial by Michael G. Noll.

From the post:

Today I am happy to share an extensive training deck on Apache Storm version 0.9, which covers Storm’s core concepts, operating Storm in production, and developing Storm applications. I also discuss data serialization with Apache Avro and Twitter Bijection.

The training deck (130 slides) is aimed at developers, operations, and architects.

What the training deck covers

  1. Introducing Storm: history, Storm adoption in the industry, why Storm
  2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
  3. Operating Storm: architecture, hardware specs, deploying, monitoring
  4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps (with kafka-storm-starter), performance and scalability tuning
  5. Playing with Storm using Wirbelsturm

What a great way to start the week! Well, at least if you were intending to start learning about Storm this week.

BTW, see Michael’s post for links to other resources, such as his tutorial on Kafka.

HDP 2.1 Tutorials

Wednesday, August 13th, 2014

HDP 2.1 tutorials from Hortonworks:

  1. Securing your Data Lake Resource & Auditing User Access with HDP Security
  2. Searching Data with Apache Solr
  3. Define and Process Data Pipelines in Hadoop with Apache Falcon
  4. Interactive Query for Hadoop with Apache Hive on Apache Tez
  5. Processing streaming data in Hadoop with Apache Storm
  6. Securing your Hadoop Infrastructure with Apache Knox

The quality you have come to expect from Hortonwork tutorials but the data sets are a bit dull.

What data sets would you suggest to spice up this tutorials?

Summingbird:… [VLDB 2014]

Monday, August 4th, 2014

Summingbird: A Framework for Integrating Batch and Online MapReduce Computations by Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin.


Summingbird is an open-source domain-specifi c language implemented in Scala and designed to integrate online and batch MapReduce computations in a single framework. Summingbird programs are written using data flow abstractions such as sources, sinks, and stores, and can run on diff erent execution platforms: Hadoop for batch processing (via Scalding/Cascading) and Storm for online processing. Different execution modes require di fferent bindings for the data flow abstractions (e.g., HDFS files or message queues for the source) but do not require any changes to the program logic. Furthermore, Summingbird can operate in a hybrid processing mode that transparently integrates batch and online results to efficiently generate up-to-date aggregations over long time spans. The language was designed to improve developer productivity and address pain points in building analytics solutions at Twitter where often, the same code needs to be written twice (once for batch processing and again for online processing) and indefi nitely maintained in parallel. Our key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion. This means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice we have not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.

Heavy sledding but deeply interesting work. Particularly about “…integrating batch and online processing in a seamless fashion.”

I first saw this in a tweet by Jimmy Lin.

Storm 0.9.2 released

Wednesday, June 25th, 2014

Storm 0.9.2 released

From the post:

We are pleased to announce that Storm 0.9.2-incubating has been released and is available from the downloads page. This release includes many important fixes and improvements.

There are a number of fixes and improvements but the topology visualization tool by Kyle Nusbaum (@knusbaum) will be the one that catches your eye.

Upgrade before the next release catches you. 😉

Analyzing 1.2 Million Network Packets…

Sunday, June 15th, 2014

Analyzing 1.2 Million Network Packets per Second in Real Time by James Sirota and Sheetal Dolas.

Slides giving an overview of OpenSOC (Open Security Operations Center).

I mention this in case you are not the NSA and simply streaming the backbone of the Internet to storage for later analysis. Some business cases require real time results.

The project is also a good demonstration of building a high throughput system using only open source software.

Not to mention a useful collaboration between Cisco and Hortonworks.

BTW, take a look at slide 18. I would say they are adding information to the representative of a subject, wouldn’t you? While on the surface this looks easy, merging that data with other data, say held by local law enforcement, might not be so easy.

For example, depending on where you are intercepting traffic, you will be told I am about thirty (30) miles from my present physical location or some other answer. 😉 Now, if someone had annotated an earlier packet with that information and it was accessible to you, well, your targeting of my location could be a good deal more precise.

And there is the question of using data annotated by different sources who may have been attacked by the same person or group.

Even at 1.2 million packets per second there is still a role for subject identity and merging.


Friday, May 23rd, 2014

Kafka-Storm-Starter by Michael G. Noll.

From the webpage:

Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using Apache Avro as the data serialization format.

If you aren’t excited already (from their respective homepages):

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

Apache Storm is a free and open source distributed realtime computation system.

Apache Avro™ is a data serialization system.

Now are you excited?


Note the superior organization of the project documentation!

Following the table of contents you find:

Quick Start

Show me!

$ ./sbt test

Short of starting up remotely and allowing you to import/keyboard data, I can’t imagine an easier way to start project documentation.

It’s a long weekend in the United States so check out Michael G. Noll’s GitHub repository for other interesting projects.

Clojure and Storm

Sunday, April 13th, 2014

Two recent posts by Charles Ditzel, here and here, offer pointers to resources on Clojure and Storm.

For your reading convenience:

Storm Apache site.

Building an Activity Feed Stream with Storm Recipe from the Clojure Cookbook.

Storm: distributed and fault-tolerant realtime computation by Nathan Marz. (Presentation, 2012)

Storm Topologies

StormScreenCast2 Storm in use at Twitter (2011)


Hortonworks Data Platform 2.1

Wednesday, April 2nd, 2014

Hortonworks Data Platform 2.1 by Jim Walker.

From the post:

The pace of innovation within the Apache Hadoop community is truly remarkable, enabling us to announce the availability of Hortonworks Data Platform 2.1, incorporating the very latest innovations from the Hadoop community in an integrated, tested, and completely open enterprise data platform.

A VM available now, full releases to follow later in April.

Just grabbing the headings from Jim’s post:

The Stinger Initiative: Apache Hive, Tez and YARN for Interactive Query

Data Governance with Apache Falcon

Security with Apache Knox

Stream Processing with Apache Storm

Searching Hadoop Data with Apache Solr

Advanced Operations with Apache Ambari

See Jim’s post for some of the details and the VM for others.

Algebird 0.5.0 Released

Thursday, March 6th, 2014

Algebird 0.5.0

From the webpage:

Abstract algebra for Scala. This code is targeted at building aggregation systems (via Scalding or Storm). It was originally developed as part of Scalding’s Matrix API, where Matrices had values which are elements of Monoids, Groups, or Rings. Subsequently, it was clear that the code had broader application within Scalding and on other projects within Twitter.

Other links you will find helpful:

0.5.0 Release notes.

Algebird mailing list.

Algebird Wiki.

Zooming Through Historical Data…

Monday, January 20th, 2014

Zooming Through Historical Data with Streaming Micro Queries by Alex Woodie.

From the post:

Stream processing engines, such as Storm and S4, are commonly used to analyze real-time data as it flows into an organization. But did you know you can use this technology to analyze historical data too? A company called ZoomData recently showed how.

In a recent YouTube presentation, Zoomdata Justin Langseth demonstrated his company’s technology, which combines open source stream processing engines like Apache with data connection and visualization libraries based on D3.js.

“We’re doing data analytics and visualization a little differently than it’s traditionally done,” Langseth says in the video. “Legacy BI tools will generate a big SQL statement, run it against Oracle or Teradata, then wait for two to 20 to 200 seconds before showing it to the user. We use a different approach based on the Storm stream processing engine.”

Once hooked up to a data source–such as Cloudera Impala or Amazon Redshift–data is then fed into the Zoomdata platform, which performs calculations against the data as it flows in, “kind of like continues event processing but geared more toward analytics,” Langseth says.

From the video description:

In this hands-on webcast you’ll learn how LivePerson and Zoomdata perform stream processing and visualization on mobile devices of structured site traffic and unstructured chat data in real-time for business decision making. Technologies include Kafka, Storm, and d3.js for visualization on mobile devices. Byron Ellis, Data Scientist for LivePerson will join Justin Langseth of Zoomdata to discuss and demonstrate the solution.

After watching the video, what do you think the concept of “micro queries?”

I ask because I don’t know of any technical reason why a “large” query could not stream out interim results and display those as more results were arriving.

Visualization isn’t usually done that way but that brings me to my next question: Assuming we have interim results visualized, how useful are interim results? Being actionable on interim results really depends on the domain.

I rather like Zoomdata’s emphasis on historical data and the video is impressive.

You can download a VM at Zoomdata.

If you can think of upsides/downsides to the interim results issue, please give a shout!

Storm Technical Preview Available Now!

Friday, December 13th, 2013

Storm Technical Preview Available Now! by Himanshu Bari.

From the post:

In October, we announced our intent to include and support Storm as part of Hortonworks Data Platform. With this commitment, we also outlined and proposed an open roadmap to improve the enterprise readiness of this key project. We are committed to doing this with a 100% open source approach and your feedback is immensely valuable in this process.

Today, we invite you to take a look at our Storm technical preview. This preview includes the latest release of Storm with instructions on how to install Storm on Hortonworks Sandbox and run a sample topology to familiarize yourself with the technology. This is the final pre-Apache release of Storm.

You know this but I wanted to emphasize how your participation in alpha/beta/candidate/preview releases benefits not only the community but yourself as well.

Bugs that are found and squashed now won’t bother you (or anyone else) later in production.

Not to mention you get to exercise your skills before using the software become routine and so does your use of it.

Enjoy the weekend!

Of Algebirds, Monoids, Monads, …

Tuesday, December 3rd, 2013

Of Algebirds, Monoids, Monads, and Other Bestiary for Large-Scale Data Analytics by Michael G. Noll.

From the post:

Have you ever asked yourself what monoids and monads are, and particularly why they seem to be so attractive in the field of large-scale data processing? Twitter recently open-sourced Algebird, which provides you with a JVM library to work with such algebraic data structures. Algebird is already being used in Big Data tools such as Scalding and SummingBird, which means you can use Algebird as a mechanism to plug your own data structures – e.g. Bloom filters, HyperLogLog – directly into large-scale data processing platforms such as Hadoop and Storm. In this post I will show you how to get started with Algebird, introduce you to monoids and monads, and address the question why you get interested in those in the first place.

Goal of this article

The main goal of this is article is to spark your curiosity and motivation for Algebird and the concepts of monoid, monads, and category theory in general. In other words, I want to address the questions “What’s the big deal? Why should I care? And how can these theoretical concepts help me in my daily work?”

You can call this a “blog post” but I rarely see blog posts with a table of contents! 😉

The post should come with a warning: May require substantial time to read, digest, understand.

Just so you know, I was hooked by this paragraph early on:

So let me use a different example because adding Int values is indeed trivial. Imagine that you are working on large-scale data analytics that make heavy use of Bloom filters. Your applications are based on highly-parallel tools such as Hadoop or Storm, and they create and work with many such Bloom filters in parallel. Now the money question is: How do you combine or add two Bloom filters in an easy way?

Are you motivated?

I first saw this in a tweet by CompSciFact.

Storm, Neo4j and Python:…

Wednesday, November 20th, 2013

Storm, Neo4j and Python: Real-Time Stream Computation on Graphs by Sonal Raj.

From the webpage:

This page serves a resource repository for my talk at Pycon India 2013 held at Bangalore, India on 30th August – 1st September, 2013. The talk introduces the basics of the Storm real-time distributed Computation Platform popularised by Twitter, and the Neo4J Graph Database and goes on to explain how they can be used in conjuction to perform real-time computations on Graph Data with the help of emerging python libraries – py2neo (for Neo4J) and petrel (for Storm)

Great slides, code skeletons, pointers to references and a live visualization!

See the video at: PyCon India 2013.

Demo gremlins mar the demonstration part but you can see:

A Storm Topology on AWS showing signup locations for people joining based on a sample Social Network data

A quote from the slides that sticks with me:

Process Infinite Streams of data one-tuple-at-a-time.


Applying the Big Data Lambda Architecture

Sunday, October 27th, 2013

Applying the Big Data Lambda Architecture by Michael Hausenblas.

From the article:

Based on his experience working on distributed data processing systems at Twitter, Nathan Marz recently designed a generic architecture addressing common requirements, which he called the Lambda Architecture. Marz is well-known in Big Data: He’s the driving force behind Storm and at Twitter he  led the streaming compute team, which provides and develops shared infrastructure to support critical real-time applications.

Marz and his team described the underlying motivation for building systems with the lambda architecture as:

  • The need for a robust system that is fault-tolerant, both against hardware failures and human mistakes.
  • To serve a wide range of workloads and use cases, in which low-latency reads and updates are required. Related to this point, the system should support ad-hoc queries.
  • The system should be linearly scalable, and it should scale out rather than up, meaning that throwing more machines at the problem will do the job.
  • The system should be extensible so that features can be added easily, and it should be easily debuggable and require minimal maintenance.

From a bird’s eye view the lambda architecture has three major components that interact with new data coming in and responds to queries, which in this article are driven from the command line:

The goal of the article:

In this article, I employ the lambda architecture to implement what I call UberSocialNet (USN). This open-source project enables users to store and query acquaintanceship data. That is, I want to be able to capture whether I happen to know someone from multiple social networks, such as Twitter or LinkedIn, or from real-life circumstances. The aim is to scale out to several billions of users while providing low-latency access to the stored information. To keep the system simple and comprehensible, I limit myself to bulk import of the data (no capabilities to live-stream data from social networks) and provide only a very simple a command-line user interface. The guts, however, use the lambda architecture.

Something a bit challenging for the start of the week. 😉


Sunday, September 22nd, 2013


From the webpage:

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

  • Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple call-back based "process message" API that should be familiar to anyone who's used Map/Reduce.
  • Managed state: Samza manages snapshotting and restoration of a stream processor's state. Samza will restore a stream processor's state to a snapshot consistent with the processor's last read messages when the processor is restarted.
  • Fault tolerance: Samza will work with YARN to restart your stream processor if there is a machine or processor failure.
  • Durability: Samza uses Kafka to guarantee that messages will be processed in the order they were written to a partition, and that no messages will ever be lost.
  • Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, re-playable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.
  • Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.
  • Processor isolation: Samza works with Apache YARN, which supports processor security through Hadoop's security model, and resource isolation through Linux CGroups.

Check out Hello Samza to try Samza. Read the Background page to learn more about Samza.

Ironic that I should find Samza when I was searching for the new incubator page for Storm. 😉

In fact, I found a comparison of Samza and Storm.

You can learn a great deal about Storm (and Samza) reading the comparison. It’s quite good.

Apache Takes Storm Into Incubation

Sunday, September 22nd, 2013

Apache Takes Storm Into Incubation by Isaac Lopez.

From the post:

Get used to saying it: “Apache Storm.”

On Wednesday night, Doug Cutting, Director for the Apache Software Foundation (ASF), announced that the organization will be adding the distributed real time computation system known as Storm as the foundations newest Incubator podling.

Storm was created by BackType lead engineer, Nathan Marz in early 2011, before the software (along with the entire company) was acquired by Twitter. At Twitter, Storm became the back bone of the social giant’s web analytics framework, tracking every click happening within the rapidly-expanding Twittersphere. The Blue Bird also uses Storm as part of its “What’s Trending” widget.

In September of 2011, Marz announced that Storm would be released into open source, where it has enjoyed a great deal of success, getting used by such companies as Groupon, Yahoo!, InfoChimps, NaviSite, Nodeable, Ooyala, The Weather Channel, and more.

In case you don’t know Storm:

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use!

The Rationale page on the wiki explains what Storm is and why it was built. This presentation is also a good introduction to the project.

Storm has a website at Follow @stormprocessor on Twitter for updates on the project. (From the Storm Github page.)

When the Apache pages for Storm are posted I will update this post.

Summingbird [Twitter open sources]

Tuesday, September 3rd, 2013

Twitter open sources Storm-Hadoop hybrid called Summingbird by Derrick Harris.

I look away for a few hours to review a specification and look what pops up:

Twitter has open sourced a system that aims to mitigate the tradeoffs between batch processing and stream processing by combining them into a hybrid system. In the case of Twitter, Hadoop handles batch processing, Storm handles stream processing, and the hybrid system is called Summingbird. It’s not a tool for every job, but it sounds pretty handy for those it’s designed to address.

Twitter’s blog post announcing Summingbird is pretty technical, but the problem is pretty easy to understand if you think about how Twitter works. Services like Trending Topics and search require real-time processing of data to be useful, but they eventually need to be accurate and probably analyzed a little more thoroughly. Storm is like a hospital’s triage unit, while Hadoop is like longer-term patient care.

This description of Summingbird from the project’s wiki does a pretty good job of explaining how it works at a high level.


While the Summingbird announcement is heavy sledding, it is well written. The projects spawned by Summingbird are rife with possibilities.

I appreciate Derrick’s comment:

It’s not a tool for every job, but it sounds pretty handy for those it’s designed to address.

I don’t know of any tools “for every job,” the opinions of some graph advocates notwithstanding. 😉

If Summingbird fits your problem set, spend some serious time seeing what it has to offer.

Streaming IN Hadoop: Yahoo! release Storm-YARN

Saturday, June 15th, 2013

Streaming IN Hadoop: Yahoo! release Storm-YARN by Jim Walker.

From the post:

Over the past year, customers have told us they want to store all their data in one place and interact with it in multiple ways… they want to use Hadoop, but in order to do so, it needs to extend beyond batch. It also needs to be interactive and real-time (among others).

This is the entire principle behind YARN, which together with others in the community, Arun Murthy and the team at Hortonworks have been working on for more than 5 years! The YARN based architecture of Hadoop 2.0 is hugely significant and we have been working closely with many partners to incorporate it into their applications.

Storm-YARN Released as Open Source

Yahoo! has been testing Hadoop 2 and its YARN-based architecture for quite some time. All the while they have worked on the convergence of the streaming framework Storm with Hadoop. This work has resulted in a YARN based version of Storm that will radically improve performance and resource management for streaming.

The release blog post from Yahoo.

Processing of data, even big data, is approaching “interactive and real-time,” although I suspect definitions of those terms vary. What is “interactive” for an automated trader might be too fast for human trader.

What I haven’t seen is concurrent development on the handling of the semantics of big data.

After the initial hysteria over the scope of NSA snooping, except for cases where the NSA was given the identity of a suspect (and not always then), was its data gathering of any use.

In topic map terms, the semantic impedance between the data systems was too great for useful manipulation of the data sets as one.

Streaming in Hadoop is welcome news, but until we can robustly manages the semantics of data in streams, much gold is going to pass uncollected from streams.

Real-Time Data Aggregation [White Paper Registration Warning]

Tuesday, April 30th, 2013

Real-Time Data Aggregation by Caroline Lim.

From the post:

Fast response times generate costs savings and greater revenue. Enterprise data architectures are incomplete unless they can ingest, analyze, and react to data in real-time as it is generated. While previously inaccessible or too complex — scalable, affordable real-time solutions are now finally available to any enterprise.

Infochimps Cloud::Streams

Read Infochimps’ newest whitepaper on how Infochimps Cloud::Streams is a proprietary stream processing framework based on four years of experience with sourcing and analyzing both bulk and in-motion data sources. It offers a linearly and fault-tolerant stream processing engine that leverages a number of well-proven web-scale solutions built by Twitter and Linkedin engineers, with an emphasis on enterprise-class scalability, robustness, and ease of use.

The price of this whitepaper is disclosure of your contact information.

Annoying considering the lack of substantive content about the solution. The use cases are mildly interesting but admit to any number of similar solutions.

If you need real-time data aggregation, skip the white paper and contact your IT consultant/vendor. (Including Infochimps, who do very good work, which is why a non-substantive white paper is so annoying.)

MOA Massively Online Analysis

Saturday, December 1st, 2012

MOA Massively Online Analysis : Real Time Analytics for Data Streams

From the homepage:

What is MOA?

MOA is an open source framework for data stream mining. It includes a collection of machine learning algorithms (classification, regression, and clustering) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.

What can MOA do for you?

MOA performs BIG DATA stream mining in real time, and large scale machine learning. MOA can be easily used with Hadoop, S4 or Storm, and extended with new mining algorithms, and new stream generators or evaluation measures. The goal is to provide a benchmark suite for the stream mining community. Details.

Short tutorials and a manual are available. Enough to get started but you will need additional resources on machine learning if it isn’t already familiar.

A small niggle about documentation. Many projects have files named “tutorial” or in this case “Tutorial1,” or “Manual.” Those files are easier to discover/save, if the project name, version(?), is prepended to tutorial or manual. Thus “Moa-2012-08-tutorial1” or “Moa-2012-08-manual.”

If data streams are in your present or future, definitely worth a look.

Real-time Big Data Analytics Engine – Twitter’s Storm

Monday, October 15th, 2012

Real-time Big Data Analytics Engine – Twitter’s Storm by Istvan Szegedi.

From the post:

Hadoop is a batch-oriented big data solution at its heart and leaves gaps in ad-hoc and real-time data processing at massive scale so some people have already started counting its days as we know it now. As one of the alternatives, we have already seen Google BigQuery to support ad-hoc analytics and this time the post is about Twitter’s Storm real-time computation engine which aims to provide solution in the real-time data analytics world. Storm was originally developed by BackType and running now under Twitter’s name, after BackType has been acquired by them. The need for having a dedicated real-time analytics solution was explained by Nathan Marz as follows: “There’s no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing…. The lack of a “Hadoop of realtime” has become the biggest hole in the data processing ecosystem. Storm fills that hole.”

Introduction to Storm, including a walk through the word count typology example that comes with the current download.

A useful addition to your toolkit!

Choking Cassandra Bolt

Tuesday, August 7th, 2012

Got your attention? Good!

Brian O’Neill details in A Big Data Trifecta: Storm, Kafka and Cassandra an architecture that was fast enough to choke the Cassandra Bolt component. (And also details how to fix that problem.)

Based on the exchange of tuples. Writing at 5,000 writes per second on a laptop.

More details to follow but I think you can get enough from the post to start experimenting on your own.

I first saw this at: Alex Popesu’s myNoSQL under A Big Data Triefecta: Storm, Kafka and Cassandra.

Dempsy – a New Real-time Framework for Processing BigData

Friday, May 4th, 2012

Dempsy – a New Real-time Framework for Processing BigData by Boris Lublinsky.

From the post:

Real time processing of BigData seems to be one of the hottest topics today. Nokia has just released a new open-source project – Dempsy. Dempsy is comparable to Storm, Esper, Streambase, HStreaming and Apache S4. The code is released under the Apache 2 license

Dempsy is meant to solve the problem of processing large amounts of "near real time" stream data with the lowest lag possible; problems where latency is more important that "guaranteed delivery." This class of problems includes use cases such as:

  • Real time monitoring of large distributed systems
  • Processing complete rich streams of social networking data
  • Real time analytics on log information generated from widely distributed systems
  • Statistical analytics on real-time vehicle traffic information on a global basis

The important properties of Dempsy are:

  • It is Distributed. That is to say a Dempsy application can run on multiple JVMs on multiple physical machines.
  • It is Elastic. That is, it is relatively simple to scale an application to more (or fewer) nodes. This does not require code or configuration changes but done by dynamic insertion or removal of processing nodes.
  • It implements Message Processing. Dempsy is based on message passing. It moves messages between Message processors, which act on the messages to perform simple atomic operations such as enrichment, transformation, etc. In general, an application is intended to be broken down into more smaller simpler processors rather than fewer large complex processors.
  • It is a Framework. It is not an application container like a J2EE container, nor a simple library. Instead, like the Spring Framework, it is a collection of patterns, the libraries to enable those patterns, and the interfaces one must implement to use those libraries to implement the patterns.

Dempsy’ programming model is based on message processors communicating via messages and resembles a distributed actor framework . While not strictly speaking an actor framework in the sense of Erlang or Akka actors, where actors explicitely direct messages to other actors, Dempsy’s Message Processors are "actor like POJOs" similar to Processor Elements in S4 and to some extent Bolts in Storm. Message processors are similar to actors in that they operate on a single message at a time, and need not deal with concurrency directly. Unlike actors, Message Processors also are relieved of the the need to know the destination(s) for their output messages, as this is handled inside by Dempsy based on the message properties.

In short Dempsy is a framework to enable the decomposing of a large class of message processing problems into flows of messages between relatively simple processing units implemented as POJOs. 

The Dempsy Tutorial contains more information.

See the post for an interview with Dempsy’s creator, NAVTEQ Fellow Jim Carroll.

Will the “age of data” mean that applications and their code will also be viewed and processed as data? The capabilities you have are those you request for a particular data set? Would like to see topic maps on the leading (and not dragging) edge of that change.