Clojure on Hadoop: A New Hope

Clojure on Hadoop: A New Hope by Chun Kuk.

From the post:

Factual’s U.S. Places dataset is built from tens of billions of signals. Our raw data is stored in HDFS and processed using Hadoop.

We’re big fans of the core Hadoop stack, however there is a dark side to using Hadoop. The traditional approach to building and running Hadoop jobs can be cumbersome. As our Director of Engineering once said, “there’s no such thing as an ad-hoc Hadoop job written in Java”.

Factual is a Clojure friendly shop, and the Clojure community led us to Cascalog. We were intrigued by its strength as an agile query language and data processing framework. It was easy to get started, which is a testament to Cascalog’s creator, Nathan Marz.

We were able to leverage Cascalog’s high-level features such as built-in joins and aggregators to abstract away the complexity of commonly performed queries and QA operations.

This article aims to illustrate Cascalog basics and core strengths. We’ll focus on how easy it is to run useful queries against data stored with different text formats such as csv, json, and even raw text.

Somehow, after that lead in, I was disappointed by what followed.

Curious what others think? As far as it goes, a good article on Clojure but doesn’t really reach the “core strengths” of Cacalog does it?

Leave a Reply

You must be logged in to post a comment.