From the post:

Learn how creating dataflow pipelines for time-series analysis is a lot easier with Apache Crunch.

In a previous blog post, I described a data-driven market study based on Wikipedia access data and content. I explained how useful it is to combine several public data sources, and how this approach sheds light onto the hidden correlations across Wikipedia pages.

One major task in the above was to apply structural analysis to networks reconstructed by time-series analysis techniques. In this post, I will describe a different method: the use of Apache Crunch time-series processing pipelines for large-scale correlation and dependency analysis. The results of these operations will be a matrix or network representation of the underlying data set, which can be further processed in an Apache Hadoop (CDH) cluster via GraphLab, GraphX, or even Apache Giraph.

This article assumes that you know a bit about Crunch. If not, read the Crunch user guide first. Furthermore, this short how-to explains how to extract and re-organize data that is already stored in Apache Avro files, using a Crunch pipeline. All source code for the article is available in the crunch.TS project.

Initial Situation and Goal

In our example dataset, for each measurement period, one SequenceFile was generated. Such a file is called a \u201ctime-series bucket\u201d and contains a key-value pair of types: Text (from Hadoop) and VectorWritable (from Apache Mahout). We use data types of projects, which do not guarantee stability over time, and we are dependent on Java as a programming language because others cannot read SequenceFiles.

The dependency on external libraries, such as the VectorWritable class from Mahout, should be removed from our data representation and storage layer, so it is a good idea to store the data in an Avro file. Such files can also be organized in a directory hierarchy that fits to the concept of Apache Hive partitions. Data processing will be done in Crunch, but for fast delivery of pre-calculated results, Impala will be used.

A more general approach will be possible later on if we use Kite SDK and Apache HCatalog, as well. In order to achieve interoperability between multiple analysis tools or frameworks — and I think this is a crucial aspect in data management, even in the case of an enterprise data hub — you have to think about access patterns early.