Archive for the ‘Cascalog’ Category

Cascalog 2.0 In Depth

Sunday, January 4th, 2015

Cascalog 2.0 In Depth by Sam Ritchie.

From the post:

Cascalog 2.0 has been out for over a year now, and outside of a post to the mailing list and a talk at Clojure/Conj 2013 (slides here), I’ve never written up the
startingly long list of new features brought by that release. So shameful.

This post fixes that. 2.0 was a big deal. Anonymous functions make it easy to reuse your existing, non Cascalog code. The interop story with vanilla Clojure is much better, which is huge for testing. Finally, users can access the JobConf, Cascading’s counters and other Cascading guts during operations.

Here’s a list of the features I’ll cover in this post:

  • new def*ops,
  • Anonymous function support
  • Higher order functions
  • Lifting Clojure functions into Cascalog
  • expand-query
  • Using functions as implicit filters in queries
  • prepared functions, and access to Cascading’s guts

As if that weren’t enough, 2.0 adds a standalone Cascading DSL with an API similar to Scalding’s. You can move between this Cascading API and Cascalog. This makes it easy to use Cascading’s new features, like optimized joins, that haven’t bubbled up to the Cascalog DSL.

I’ll go over the Cascading DSL and the support for non-Cascading execution environments in a later post. For now, let’s get into it.

If you want to follow along, go ahead and clone the Cascalog repo, cd into the “cascalog-core” subdirectory and run “lein repl”. To try this code out in other projects, run “lein sub install” in the root directory. This will install [cascalog/cascalog-core "3.0.0-SNAPSHOT"] locally, so you can add it to your project.clj and give the code a whirl.

Belated but welcome review of the features of Cascalog 2.0!

I particularly liked the suggested “follow along” approach of the post.

Enjoy!

“Functional Programming for…Big Data”

Wednesday, March 20th, 2013

“Functional Programming for optimization problems in Big Data” by Paco Nathan.

Interesting slide deck, even if it doesn’t start with high drama. 😉

Covers:

  1. Data Science
  2. Functional Programming
  3. Workflow Abstraction
  4. Typical Use Cases
  5. Open Data Example

The reading list mentioned in these slides makes a nice self-review course in data science.

The Open Data Example is for Palo Alto but you can substitute a city with open data closer to home.

Functional Relational Programming with Cascalog

Saturday, February 4th, 2012

Functional Relational Programming with Cascalog by Stuart Sierra.

From the post:

In 2006, Ben Mosely and Peter Marks published a paper, Out of the Tar Pit, in which they coined the term Functional Relational Programming. “Out of the Tar Pit” was influential on Clojure’s design, particularly its emphasis on immutability and the separation of state from behavior. Mosely and Marks went further, however, in recommending that data be manipulated as relations. Relations are the abstract concept behind tables in a relational database or “facts” in some logic programming systems. Clojure does not enforce a relational model, but Clojure can be used for relational programming. For example, the clojure.set namespace defines relational algebra operations such as project and join.

In the early aughts, Jeffrey Dean and Sanjay Ghemawat developed the MapReduce programming model at Google to optimize the process of ranking web pages. MapReduce works well for I/O-bound problems where the computation on each record is small but the number of records is large. It specifically addresses the performance characteristics of modern commodity hardware, especially “disk is the new tape.”

Stuart briefly traces the development of Cascalog and says it is an implementation of Functional Relational Programming.

What do you think?

Clojure on Hadoop: A New Hope

Friday, November 11th, 2011

Clojure on Hadoop: A New Hope by Chun Kuk.

From the post:

Factual’s U.S. Places dataset is built from tens of billions of signals. Our raw data is stored in HDFS and processed using Hadoop.

We’re big fans of the core Hadoop stack, however there is a dark side to using Hadoop. The traditional approach to building and running Hadoop jobs can be cumbersome. As our Director of Engineering once said, “there’s no such thing as an ad-hoc Hadoop job written in Java”.

Factual is a Clojure friendly shop, and the Clojure community led us to Cascalog. We were intrigued by its strength as an agile query language and data processing framework. It was easy to get started, which is a testament to Cascalog’s creator, Nathan Marz.

We were able to leverage Cascalog’s high-level features such as built-in joins and aggregators to abstract away the complexity of commonly performed queries and QA operations.

This article aims to illustrate Cascalog basics and core strengths. We’ll focus on how easy it is to run useful queries against data stored with different text formats such as csv, json, and even raw text.

Somehow, after that lead in, I was disappointed by what followed.

Curious what others think? As far as it goes, a good article on Clojure but doesn’t really reach the “core strengths” of Cacalog does it?

Using Lucene and Cascalog for Fast Text Processing at Scale

Monday, November 7th, 2011

Using Lucene and Cascalog for Fast Text Processing at Scale

From the post:

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure’s awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.  

The world of text exploration just gets better all the time!

Getting Creative with MapReduce

Friday, October 7th, 2011

Getting Creative with MapReduce

From the post:

One problem with many existing MapReduce abstraction layers is the utter difficulty of testing queries and workflows. End-to-end tests are maddening to craft in vanilla Hadoop and frustrating at best in Pig and Hive. The difficulty of testing MapReduce workflows makes it scary to change code, and destroys your desire to be creative. A proper testing suite is an absolute prerequisite to doing creative work in big data.

In this blog post, I aim to show how most of the difficulty of writing and testing MapReduce queries stems from the fact that Hadoop confounds application logic with decisions about data storage. These problems are the result of poorly implemented abstractions over the primitives of MapReduce, not problems with the core MapReduce algorithms.

The author advocates the use of Cacaslog and its testing suite. Comments?

Cascalog: Clojure-based Query Language for Hadoop – Post

Saturday, February 19th, 2011

Cascalog: Clojure-based Query Language for Hadoop

From the post:

Cascalog, introduced in the linked article, is a query language for Hadoop featuring:

  • Simple – Functions, filters, and aggregators all use the same syntax. Joins are implicit and natural.
  • Expressive – Logical composition is very powerful, and you can run arbitrary Clojure code in your query with little effort.
  • Interactive – Run queries from the Clojure REPL.
  • Scalable – Cascalog queries run as a series of MapReduce jobs.
  • Query anything – Query HDFS data, database data, and/or local data by making use of Cascading’s “Tap” abstraction
  • Careful handling of null values – Null values can make life difficult. Cascalog has a feature called “non-nullable variables” that makes dealing with nulls painless.
  • First class interoperability with Cascading – Operations defined for Cascalog can be used in a Cascading flow and vice-versa
  • First class interoperability with Clojure – Can use regular Clojure functions as operations or filters, and since Cascalog is a Clojure DSL, you can use it in other Clojure code.

From Alex Popescu’s myNoSQL

There are a number of NoSQL query languages.

Which should be considered alongside TMQL4J in TMQL discussions.

Cascalog

Saturday, December 11th, 2010

Cascalog

From the website:

Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.

Most query languages, like SQL, Pig, and Hive, are custom languages — and this leads to huge amounts of accidental complexity. Constructing queries dynamically by doing string manipulation is an impedance mismatch and makes usual programming techniques like abstraction and composition difficult.

Cascalog queries are first-class within Clojure and are extremely composable. Additionally, the Datalog syntax of Cascalog is simpler and more expressive than SQL-based languages.

Follow the getting started steps, check out the tutorial, and you’ll be running Cascalog queries on your local computer within 5 minutes.

Seems like I have heard the term datalog in TMQL discussions. 😉

I wonder what it would be like to define TMQL operators in Cascalog so that all the other capabilities of Cascalog are also available?

When the next draft appears that will be an interesting question to explore.