While reading Josh Wills post, Cloudera ML: New Open Source Libraries and Tools for Data Scientists, I saw that Apache Crunch became a top-level project at the Apache Software Foundation last month.
Congratulations to Josh and all the members of the Crunch community!
From the Apache Crunch homepage:
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines, and is based on Google’s FlumeJava library. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
Running on top of Hadoop MapReduce, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
You may be interested in: Crunch-133 Add Aggregator support for combineValues ops on secondary keys via maps and collections. It is an “open” issue.