Introducing Crunch: Easy MapReduce Pipelines for Hadoop

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 11, 2011

Introducing Crunch: Easy MapReduce Pipelines for Hadoop

Filed under: Flow-Based Programming (FBP),Hadoop,MapReduce — Patrick Durusau @ 6:08 pm

Introducing Crunch: Easy MapReduce Pipelines for Hadoop

Josh Wills writes:

As a data scientist at Cloudera, I work with customers across a wide range of industries that use Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, we use high-level tools like Pig and Hive for their convenient and powerful support for creating pipelines over structured and semi-structured records.

As Hadoop has spread from web companies to other industries, the variety of data that is stored in HDFS has expanded dramatically. Hadoop clusters are being used to process satellite images, time series data, audio files, and seismograms. These formats are not a natural fit for the data schemas imposed by Pig and Hive, in the same way that structured binary data in a relational database can be a bit awkward to work with. For these use cases, we either end up writing large, custom libraries of user-defined functions in Pig or Hive, or simply give up on our high-level tools and go back to writing MapReduces in Java. Either of these options is a serious drain on developer productivity.

Today, we’re pleased to introduce Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Crunch’s design is modeled after Google’s FlumeJava, focusing on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution.

Sounds like DataFlow Programming… or Flow-Based Programming (FBP) to me. In which case the claim that:

It’s just Java. Crunch shares a core philosophical belief with Google’s FlumeJava: novelty is the enemy of adoption.

must be true, as FBP is over forty years old now. I doubt programmers involved in Crunch would be aware of it. Programming history started with their first programming language, at least for them.

From a vendor perspective, I would turn the phrase a bit to read: novelty is the enemy of market/mind share.

Unless you are a startup, in which case, novelty is good until you reach critical mass and then novelty loses its luster.

Unnecessary novelty, like new web programming languages for their own sake, can also be a bid for market/mind share.

Interesting to see both within days of each other.

Comments (1)

1 Comment

[…] You might also be interested in: Introducing Crunch: Easy MapReduce Pipelines for Hadoop. […]

Pingback by Crunch for Dummies « Another Word For It — December 9, 2011 @ 8:21 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.