Apache Crunch « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 29, 2014

Kite SDK 0.15.0

Filed under: Apache Crunch,Cloudera,Kite SDK — Patrick Durusau @ 6:58 pm

What’s New in Kite SDK 0.15.0? by Ryan Blue.

From the post:

Recently, Kite SDK, the open source toolset that helps developers build systems on the Apache Hadoop ecosystem, became a 0.15.0. In this post, you’ll get an overview of several new features and bug fixes.

Covered by this quick recap:

Working with Datasets by URI

…

Improved Configuration for MR and Apache Crunch Jobs

…

Parent POM for Kite Applications

…

Java Class Hints [more informative error messages]

…

More Docs and Tutorials

The last addition this release is a new user guide on kitesdk.org, where we’re adding new tutorials and background articles. We’ve also updated the examples for the new features, which is a great place to learn more about Kite.

Also, watch this technical webinar on-demand to learn more about working with datasets in Kite.

I think you are going to like this.

Comments Off

July 8, 2014

Advanced Time-Series Pipelines

Filed under: Apache Crunch,Time Series — Patrick Durusau @ 6:52 pm

How-to: Build Advanced Time-Series Pipelines in Apache Crunch by Mirko Kämpf.

From the post:

Learn how creating dataflow pipelines for time-series analysis is a lot easier with Apache Crunch.

In a previous blog post, I described a data-driven market study based on Wikipedia access data and content. I explained how useful it is to combine several public data sources, and how this approach sheds light onto the hidden correlations across Wikipedia pages.

One major task in the above was to apply structural analysis to networks reconstructed by time-series analysis techniques. In this post, I will describe a different method: the use of Apache Crunch time-series processing pipelines for large-scale correlation and dependency analysis. The results of these operations will be a matrix or network representation of the underlying data set, which can be further processed in an Apache Hadoop (CDH) cluster via GraphLab, GraphX, or even Apache Giraph.

This article assumes that you know a bit about Crunch. If not, read the Crunch user guide first. Furthermore, this short how-to explains how to extract and re-organize data that is already stored in Apache Avro files, using a Crunch pipeline. All source code for the article is available in the crunch.TS project.

Initial Situation and Goal

In our example dataset, for each measurement period, one SequenceFile was generated. Such a file is called a \u201ctime-series bucket\u201d and contains a key-value pair of types: Text (from Hadoop) and VectorWritable (from Apache Mahout). We use data types of projects, which do not guarantee stability over time, and we are dependent on Java as a programming language because others cannot read SequenceFiles.

The dependency on external libraries, such as the VectorWritable class from Mahout, should be removed from our data representation and storage layer, so it is a good idea to store the data in an Avro file. Such files can also be organized in a directory hierarchy that fits to the concept of Apache Hive partitions. Data processing will be done in Crunch, but for fast delivery of pre-calculated results, Impala will be used.

A more general approach will be possible later on if we use Kite SDK and Apache HCatalog, as well. In order to achieve interoperability between multiple analysis tools or frameworks — and I think this is a crucial aspect in data management, even in the case of an enterprise data hub — you have to think about access patterns early.

Worth your attention as an incentive to learn more about Apache Crunch. Aside from the benefit of learning more about processing time-series data.

Comments Off

January 16, 2014

Apache Crunch User Guide (new and improved)

Filed under: Apache Crunch,Hadoop,MapReduce — Patrick Durusau @ 10:13 am

Apache Crunch User Guide

From the motivation section:

Let’s start with a basic question: why should you use any high-level tool for writing data pipelines, as opposed to developing against the MapReduce, Spark, or Tez APIs directly? Doesn’t adding another layer of abstraction just increase the number of moving pieces you need to worry about, ala the Law of Leaky Abstractions?

As with any decision like this, the answer is “it depends.” For a long time, the primary payoff of using a high-level tool was being able to take advantage of the work done by other developers to support common MapReduce patterns, such as joins and aggregations, without having to learn and rewrite them yourself. If you were going to need to take advantage of these patterns often in your work, it was worth the investment to learn about how to use the tool and deal with the inevitable leaks in the tool’s abstractions.

With Hadoop 2.0, we’re beginning to see the emergence of new engines for executing data pipelines on top of data stored in HDFS. In addition to MapReduce, there are new projects like Apache Spark and Apache Tez. Developers now have more choices for how to implement and execute their pipelines, and it can be difficult to know in advance which engine is best for your problem, especially since pipelines tend to evolve over time to process more data sources and larger data volumes. This choice means that there is a new reason to use a high-level tool for expressing your data pipeline: as the tools add support for new execution frameworks, you can test the performance of your pipeline on the new framework without having to rewrite your logic against new APIs.

There are many high-level tools available for creating data pipelines on top of Apache Hadoop, and they each have pros and cons depending on the developer and the use case. Apache Hive and Apache Pig define domain-specific languages (DSLs) that are intended to make it easy for data analysts to work with data stored in Hadoop, while Cascading and Apache Crunch develop Java libraries that are aimed at developers who are building pipelines and applications with a focus on performance and testability.

So which tool is right for your problem? If most of your pipeline work involves relational data and operations, than Hive, Pig, or Cascading provide lots of high-level functionality and tools that will make your life easier. If your problem involves working with non-relational data (complex records, HBase tables, vectors, geospatial data, etc.) or requires that you write lots of custom logic via user-defined functions (UDFs), then Crunch is most likely the right choice.
…

As topic mappers you are likely to work with both relational as well as complex non-relational data so this should be on your reading list.

I didn’t read the prior Apache Crunch documentation so I will have to take Josh Wills at his word that:

A (largely) new and (vastly) improved user guide for Apache Crunch, including details on the new Spark-based impl:

It reads well and makes a good case for investing time in learning Apache Crunch.

I first saw this in a tweet by Josh Wills.

Comments Off

March 22, 2013

Apache Crunch (Top-Level)

Filed under: Apache Crunch,Hadoop,MapReduce — Patrick Durusau @ 12:34 pm

Apache Crunch (Top Level)

While reading Josh Wills post, Cloudera ML: New Open Source Libraries and Tools for Data Scientists, I saw that Apache Crunch became a top-level project at the Apache Software Foundation last month.

Congratulations to Josh and all the members of the Crunch community!

From the Apache Crunch homepage:

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines, and is based on Google’s FlumeJava library. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Running on top of Hadoop MapReduce, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

You may be interested in: Crunch-133 Add Aggregator support for combineValues ops on secondary keys via maps and collections. It is an “open” issue.

Comments Off

October 31, 2012

One To Watch: Apache Crunch

Filed under: Apache Crunch,Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:37 pm

One To Watch: Apache Crunch by Chris Mayer.

From the post:

Over the past few years, the Apache Software Foundation has become the hub for big data-focused projects. An array of companies have recognised the worth of housing their latest innovative projects at the ASF, with Apache Hadoop and Apache Cassandra two shining examples.

Amongst the number of projects arriving in the Apache Incubator was Apache Crunch. Crunch is a Java library created to eliminate the tedium of writing a MapReduce pipeline. It aims to take hold of the entire process, making writing, testing, and running MapReduce pipelines more efficient and “even fun” (if this Cloudera blog post is to be believed).

That’s a tall order, to make MapReduce pipelines “even fun.” On the other hand, remarkable things have emerged from Apache for decades now.

A project to definitely keep in sight.

Comments Off