Archive for the ‘Summingbird’ Category

Summingbird:… [VLDB 2014]

Monday, August 4th, 2014

Summingbird: A Framework for Integrating Batch and Online MapReduce Computations by Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin.

Abstract:

Summingbird is an open-source domain-specifi c language implemented in Scala and designed to integrate online and batch MapReduce computations in a single framework. Summingbird programs are written using data flow abstractions such as sources, sinks, and stores, and can run on diff erent execution platforms: Hadoop for batch processing (via Scalding/Cascading) and Storm for online processing. Different execution modes require di fferent bindings for the data flow abstractions (e.g., HDFS files or message queues for the source) but do not require any changes to the program logic. Furthermore, Summingbird can operate in a hybrid processing mode that transparently integrates batch and online results to efficiently generate up-to-date aggregations over long time spans. The language was designed to improve developer productivity and address pain points in building analytics solutions at Twitter where often, the same code needs to be written twice (once for batch processing and again for online processing) and indefi nitely maintained in parallel. Our key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion. This means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice we have not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.

Heavy sledding but deeply interesting work. Particularly about “…integrating batch and online processing in a seamless fashion.”

I first saw this in a tweet by Jimmy Lin.

The Road to Summingbird:…

Sunday, January 12th, 2014

The Road to Summingbird: Stream Processing at (Every) Scale by Sam Ritchie.

Description:

Twitter’s Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with realtime systems at scale.

But what if your project is not quite at “scale” yet? Should you ignore scale until it becomes a problem, or swallow the pill ahead of time? Is using Summingbird overkill for small projects? I argue that it’s not. This talk will discuss the ideas and components of Summingbird that you could, and SHOULD, use in your startup’s code from day one. You’ll come away with a new appreciation for monoids and semigroups and a thirst for abstract algebra.

A slide deck that will make you regret missing the presentation.

I wasn’t able to find a video of Sam’s presentation at Data Day Texas 2014, but I did find a collection of his presentations, including some videos, at: http://sritchie.github.io/.

Valuable lessons for startups and others.

Summingbird [Twitter open sources]

Tuesday, September 3rd, 2013

Twitter open sources Storm-Hadoop hybrid called Summingbird by Derrick Harris.

I look away for a few hours to review a specification and look what pops up:

Twitter has open sourced a system that aims to mitigate the tradeoffs between batch processing and stream processing by combining them into a hybrid system. In the case of Twitter, Hadoop handles batch processing, Storm handles stream processing, and the hybrid system is called Summingbird. It’s not a tool for every job, but it sounds pretty handy for those it’s designed to address.

Twitter’s blog post announcing Summingbird is pretty technical, but the problem is pretty easy to understand if you think about how Twitter works. Services like Trending Topics and search require real-time processing of data to be useful, but they eventually need to be accurate and probably analyzed a little more thoroughly. Storm is like a hospital’s triage unit, while Hadoop is like longer-term patient care.

This description of Summingbird from the project’s wiki does a pretty good job of explaining how it works at a high level.

(…)

While the Summingbird announcement is heavy sledding, it is well written. The projects spawned by Summingbird are rife with possibilities.

I appreciate Derrick’s comment:

It’s not a tool for every job, but it sounds pretty handy for those it’s designed to address.

I don’t know of any tools “for every job,” the opinions of some graph advocates notwithstanding. 😉

If Summingbird fits your problem set, spend some serious time seeing what it has to offer.