Apache Crunch User Guide
From the motivation section:
Let’s start with a basic question: why should you use any high-level tool for writing data pipelines, as opposed to developing against the MapReduce, Spark, or Tez APIs directly? Doesn’t adding another layer of abstraction just increase the number of moving pieces you need to worry about, ala the Law of Leaky Abstractions?
As with any decision like this, the answer is “it depends.” For a long time, the primary payoff of using a high-level tool was being able to take advantage of the work done by other developers to support common MapReduce patterns, such as joins and aggregations, without having to learn and rewrite them yourself. If you were going to need to take advantage of these patterns often in your work, it was worth the investment to learn about how to use the tool and deal with the inevitable leaks in the tool’s abstractions.
With Hadoop 2.0, we’re beginning to see the emergence of new engines for executing data pipelines on top of data stored in HDFS. In addition to MapReduce, there are new projects like Apache Spark and Apache Tez. Developers now have more choices for how to implement and execute their pipelines, and it can be difficult to know in advance which engine is best for your problem, especially since pipelines tend to evolve over time to process more data sources and larger data volumes. This choice means that there is a new reason to use a high-level tool for expressing your data pipeline: as the tools add support for new execution frameworks, you can test the performance of your pipeline on the new framework without having to rewrite your logic against new APIs.
There are many high-level tools available for creating data pipelines on top of Apache Hadoop, and they each have pros and cons depending on the developer and the use case. Apache Hive and Apache Pig define domain-specific languages (DSLs) that are intended to make it easy for data analysts to work with data stored in Hadoop, while Cascading and Apache Crunch develop Java libraries that are aimed at developers who are building pipelines and applications with a focus on performance and testability.
So which tool is right for your problem? If most of your pipeline work involves relational data and operations, than Hive, Pig, or Cascading provide lots of high-level functionality and tools that will make your life easier. If your problem involves working with non-relational data (complex records, HBase tables, vectors, geospatial data, etc.) or requires that you write lots of custom logic via user-defined functions (UDFs), then Crunch is most likely the right choice.
As topic mappers you are likely to work with both relational as well as complex non-relational data so this should be on your reading list.
I didn’t read the prior Apache Crunch documentation so I will have to take Josh Wills at his word that:
A (largely) new and (vastly) improved user guide for Apache Crunch, including details on the new Spark-based impl:
It reads well and makes a good case for investing time in learning Apache Crunch.
I first saw this in a tweet by Josh Wills.