Archive for the ‘Scalding’ Category

Open-sourcing tools for Hadoop

Saturday, November 22nd, 2014

Open-sourcing tools for Hadoop by Colin Marc.

From the post:

Stripe’s batch data infrastructure is built largely on top of Apache Hadoop. We use these systems for everything from fraud modeling to business analytics, and we’re open-sourcing a few pieces today:


Timberlake is a dashboard that gives you insight into the Hadoop jobs running on your cluster. Jeff built it as a replacement for YARN’s ResourceManager and MRv2’s JobHistory server, and it has some features we’ve found useful:

  • Map and reduce task waterfalls and timing plots
  • Scalding and Cascading awareness
  • Error tracebacks for failed jobs


Avi wrote a Scala framework for distributed learning of ensemble decision tree models called Brushfire. It’s inspired by Google’s PLANET, but built on Hadoop and Scalding. Designed to be highly generic, Brushfire can build and validate random forests and similar models from very large amounts of training data.


Sequins is a static database for serving data in Hadoop’s SequenceFile format. I wrote it to provide low-latency access to key/value aggregates generated by Hadoop. For example, we use it to give our API access to historical fraud modeling features, without adding an online dependency on HDFS.


At Stripe, we use Parquet extensively, especially in tandem with Cloudera Impala. Danielle, Jeff, and Avi wrote Herringbone (a collection of small command-line utilities) to make working with Parquet and Impala easier.

More open source tools for your Hadoop installation!

I am considering creating a list of closed source tools for Hadoop. It would be shorter and easier to maintain than a list of open source tools for Hadoop. 😉

Processing 3.0a1

Monday, July 28th, 2014

Processing 3.0a1

From the description:

3.0a1 (26 July 2014) Win 32 / Win 64 / Linux 32 / Linux 64 / Mac OS X.

The revisions cover incremental changes between releases, and are especially important to read for pre-releases.

From the revisions:

Kicking off the 3.0 release process. The focus for Processing 3 is improving the editor and the coding process, so we’ll be integrating what was formerly PDE X as the main editor.

This release also includes a number of bug fixes and changes, based on in-progress Google Summer of Code projects and a few helpful souls on Github.

Please contribute to the Processing 3 release by testing and reporting bugs. Or better yet, helping us fix them and submitting pull requests.

In case you are unfamiliar with Processing:

Processing is a programming language, development environment, and online community. Since 2001, Processing has promoted software literacy within the visual arts and visual literacy within technology. Initially created to serve as a software sketchbook and to teach computer programming fundamentals within a visual context, Processing evolved into a development tool for professionals. Today, there are tens of thousands of students, artists, designers, researchers, and hobbyists who use Processing for learning, prototyping, and production.


Scalding Source Code

Thursday, June 12th, 2014

Programming MapReduce With Scalding

Source code for PACKT book: Programming MapReduce With Scalding (June 2014).

I haven’t seen the book and there are no sample chapters, yet.

Ping me if you post comments about it.


Scalding 0.9: Get it while it’s hot!

Thursday, April 10th, 2014

Scalding 0.9: Get it while it’s hot! by P. Oscar Boykin.

From the post:

It’s been just over two years since we open sourced Scalding and today we are very excited to release the 0.9 version. Scalding at Twitter powers everything from internal and external facing dashboards, to custom relevance and ad targeting algorithms, including many graph algorithms such as PageRank, approximate user cosine similarity and many more.

Oscar covers:

  • Joins
  • Input/output
    • Parquet Format
    • Avro
    • TemplateTap
  • Hadoop counters
  • Typed API
  • Matrix API

Or if you want something a bit more visual and just as enthusiastic, see:

Basically the same content but with Oscar live!

Algebird 0.5.0 Released

Thursday, March 6th, 2014

Algebird 0.5.0

From the webpage:

Abstract algebra for Scala. This code is targeted at building aggregation systems (via Scalding or Storm). It was originally developed as part of Scalding’s Matrix API, where Matrices had values which are elements of Monoids, Groups, or Rings. Subsequently, it was clear that the code had broader application within Scalding and on other projects within Twitter.

Other links you will find helpful:

0.5.0 Release notes.

Algebird mailing list.

Algebird Wiki.

PredictionIO Guide

Sunday, October 20th, 2013

PredictionIO Guide

From the webpage:

PredictionIO is an open source Machine Learning Server. It empowers programmers and data engineers to build smart applications. With PredictionIO, you can add the following features to your apps instantly:

  • predict user behaviors
  • offer personalized video, news, deals, ads and job openings
  • help users to discover interesting events, documents, apps and restaurants
  • provide impressive match-making services
  • and more….

PredictionIO is built on top of solid open source technology. We support Hadoop, Mahout, Cascading and Scalding natively.

PredictionIO looks interesting in general but especially its Item Similarity Engine.

From the Item Similarity: Overview:

People who like this may also like….

This engine tries to suggest N items that are similar to a targeted item. By being ‘similar’, it does not necessarily mean that the two items look alike, nor they share similar attributes. The definition of similarity is independently defined by each algorithm and is usually calculated by a distance function. The built-in algorithms assume that similarity between two items means the likelihood any user would like (or buy, view etc) both of them.

The example that comes to mind is merging all “shoes” from any store and using the resulting price “occurrences” to create a price range and average for each store.

Cascading and Scalding

Tuesday, May 28th, 2013

Cascading and Scalding by Danny Bickson.

Danny has posted some links for Cascading and Scalding, alternatives to Pig.

I continue to be curious about documentation of semantics for Pig scripts or any of its alternatives.

Or for that matter, in any medium to large-sized mapreduce shop, how do you index those semantics?

“Functional Programming for…Big Data”

Wednesday, March 20th, 2013

“Functional Programming for optimization problems in Big Data” by Paco Nathan.

Interesting slide deck, even if it doesn’t start with high drama. 😉


  1. Data Science
  2. Functional Programming
  3. Workflow Abstraction
  4. Typical Use Cases
  5. Open Data Example

The reading list mentioned in these slides makes a nice self-review course in data science.

The Open Data Example is for Palo Alto but you can substitute a city with open data closer to home.

Programming Isn’t Math

Sunday, March 10th, 2013

Programming Isn’t Math by Oscar Boykin.

From the description:

Functional programming has a rich history of drawing from mathematical theory, yet in this highly entertaining talk from the Northeast Scala Symposium, Twitter data scientist Oscar Boykin make the case that programming is distinct from mathematics. This distinction is healthy and does not mean we can’t leverage many results and concepts from mathematics.

As examples, Oscar will discuss some recent work — algebird, bijection, scalding — and show cases where mathematical purity were both helpful and harmful to developing products at Twitter.

The phrase “…highly entertaining…” may be an understatement.

The type of presentation where you want to starting reading new material during the presentation but you are afraid of missing the next gold nugget!

Definitely one to start the week on!

An Overview of Scalding

Saturday, March 2nd, 2013

An Overview of Scalding by Dean Wampler.

From the description:

Dean Wampler, Ph.D., is Principal Consultant at Think Big Analytics. In this video he will cover its benefits over the Java API include a dramatic reduction in the source code required, reflecting several Scala improvements over Java, full access to “functional programming” constructs that are ideal for data problems, and a Matrix library addition to support machine learning and other algorithms. He also demonstrates the benefits of Scalding using examples and explains just enough Scala syntax so you can follow along. Dean’s philosophy is that there is no better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need. This presentation was given on February 12th at the Nokia offices in Chicago, IL.


During this period of rapid innovation around “big data,” what interests me is the development of tools to fit problems.

As opposed to fitting problems to fixed data models and tools.

Both require a great deal of skill, but they are different skill sets.


I first saw this at Alex Popescu’s myNoSQL.

A Quick Guide to Hadoop Map-Reduce Frameworks

Thursday, February 7th, 2013

A Quick Guide to Hadoop Map-Reduce Frameworks by Alex Popescu.

Alex has assembled links to guides to MapReduce frameworks:

Thanks Alex!

Intro to Scalding by @posco and @argyris [video lecture]

Sunday, November 4th, 2012

Intro to Scalding by @posco and @argyris by Marti Hearst.

From the post:

On Thursday we learned about an alternative language for analyzing big data: Scalding. It’s built on Scala and is used extensively by the Twitter Revenue group. Oscar Boykin presented a lecture that he and Argyris Zymnis put together for us:

(video – see Marti’s post)

Because scalding is built on the functional programming language Scala, it has advantage oover Pig in that you can have the equivalent of user-defined functions directly in your code. See for the lecture notes more details. Be sure watch the video to get all the details especially since Oscar managed to make us all laugh throughout his lecture. Thanks guys!

Another great lecture from Marti’s class, “Analyzing Big Data with Twitter.”

When the revenue department of a business, at least a successful business, starts using a technology, it’s time to take notice.

Twitter’s Scalding and Algebird: Matrix and Lighweight Algebra Library

Tuesday, September 25th, 2012

Twitter’s Scalding and Algebird: Matrix and Lighweight Algebra Library by Alex Popescu.

Alex points out:

  1. Scalding now includes a type-safe Matrix API
  2. In the familiar Fields API, we’ve added the ability to add type information to fields which allows scalding to pick up Ordering instances so that grouping on almost any scala collection becomes easy.
  3. Algebird is our lightweight abstract algebra library for Scala and is targeted for building aggregation systems (such as Storm).

Of the three, I am going to take a look at Algebird first.

Scalding for the Impatient

Sunday, August 12th, 2012

Scalding for the Impatient by Sujit Pal.

From the post:

Few weeks ago, I wrote about Pig, a DSL that allows you to specify a data processing flow in terms of PigLatin operations, and results in a sequence of Map-Reduce jobs on the backend. Cascading is similar to Pig, except that it provides a (functional) Java API to specify a data processing flow. One obvious advantage is that everything can now be in a single language (no more having to worry about UDF integration issues). But there are others as well, as detailed here and here.

Cascading is well documented, and there is also a very entertaining series of articles titled Cascading for the Impatient that builds up a Cascading application to calculate TF-IDF of terms in a (small) corpus. The objective is to showcase the features one would need to get up and running quickly with Cascading.

Scalding is a Scala DSL built on top of Cascading. As you would expect, Cascading code is an order of magnitude shorter than equivalent Map-Reduce code. But because Java is not a functional language, implementing functional constructs leads to some verbosity in Cascading that is eliminated in Scalding, leading to even shorter and more readable code.

I was looking for something to try my newly acquired Scala skills on, so I hit upon the idea of building up a similar application to calculate TF-IDF for terms in a corpus. The table below summarizes the progression of the Cascading for the Impatient series. I’ve provided links to the original articles for the theory (which is very nicely explained there) and links to the source codes for both the Cascading and Scalding versions.

A very nice side by side comparison and likely to make you interested in Scalding.

Twitter’s Scalding – Scala and Hadoop hand in hand

Monday, August 6th, 2012

Twitter’s Scalding – Scala and Hadoop hand in hand by Istvan Szegedi.

From the post:

If you have read the paper published by Google’s Jeffrey Dean and Sanjay Ghemawat (MapReduce: Simplied Data Processing on Large Clusters), they revealed that their work was inspired by the concept of functional languages: “Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages….Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”

Given the fact the Scala is a programming language that combines objective oriented and functional progarmming and runs on JVM, it is a fairly natural evolution to introduce Scala in Hadoop environment. That is what Twitter engineers did. (See more on how Scala is used at Twitter: “Twitter on Scala” and “The Why and How of Scala at Twitter“). Scala has powerful support for mapping, filtering, pattern matching (regular expressions) so it is a pretty good fit for MapReduce jobs.

Another guide to Scalding.


Saturday, August 4th, 2012

Scalding: Powerful & Concise MapReduce Programming


Scala is a functional programming language on the JVM. Hadoop uses a functional programming model to represent large-scale distributed computation. Scala is thus a very natural match for Hadoop.

In this presentation to the San Francisco Scala User Group, Dr. Oscar Boykin and Dr. Argyris Zymnis from Twitter give us some insight on Scalding DSL and provide some example jobs for common use cases.

Twitter uses Scalding for data analysis and machine learning, particularly in cases where we need more than sql-like queries on the logs, for instance fitting models and matrix processing. It scales beautifully from simple, grep-like jobs all the way up to jobs with hundreds of map-reduce pairs.

The Alice example failed (counted the different forms of Alice differently). I am reading a regex book so that may have made the problem more obvious.

Lesson: Test code/examples before presentation. 😉

See the Github repository:

Both Scalding and the presentation are worth your time.

Why Hadoop MapReduce needs Scala

Thursday, March 29th, 2012

Why Hadoop MapReduce needs Scala – A look at Scoobi and Scalding DSLs for Hadoop by Age Mooij.

Fairly sparse slide deck but enough to get you interested enough to investigate what Scoobi and Scalding have to offer.

It may just be me but I find it easier to download the PDFs if I want to view code. The font/color just isn’t readable with online slides. Suggestion: Always allow for downloads of your slides as PDF files.




Hive, Pig, Scalding, Scoobi, Scrunch and Spark

Tuesday, March 27th, 2012

Hive, Pig, Scalding, Scoobi, Scrunch and Spark by Sami Badawi.

From the post:

Comparison of Hadoop Frameworks

I had to do simple processing of log files in a Hadoop cluster. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data. There are several high level Hadoop frameworks that make Hadoop programming easier. Here is the list of Hadoop frameworks I tried:

  • Pig
  • Scalding
  • Scoobi
  • Hive
  • Spark
  • Scrunch
  • Cascalog

The task was to read log files join with other data do some statistics on arrays of doubles. Programming this without Hadoop is simple, but caused me some grief with Hadoop.

This blog post is not a full review, but my first impression of these Hadoop frameworks.

Everyone has a favorite use case.

How does your use case fare with different frameworks for Hadoop? (We won’t ever know if you don’t say.)


Friday, February 17th, 2012

Scalding by Patrick Oscar Boykin.

From the blog:

Today, we’re excited to open source Scalding, a Scala API for Cascading. Cascading is a thin Java library and API that sits on top of Apache Hadoop’s MapReduce layer. Scalding is comprised of two main components:

  • a DSL to make MapReduce computations look very similar to Scala’s collection API
  • A wrapper for Cascading to make it simpler to define the typical use cases of jobs, tests and describing data sources on a Hadoop Distributed File System (HDFS) or local disk

Interesting find since I just mentioned Cascading yesterday.