Pig as Duct Tape, Part Three: TF-IDF Topics with Cassandra, Python Streaming and Flask

Pig as Duct Tape, Part Three: TF-IDF Topics with Cassandra, Python Streaming and Flask by Russell Jurney.

From the post:

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use and their environment variables are available at https://github.com/rjurney/enron-python-flask-cassandra-pig and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.

Part one and two can get you started using Pig if you’re not familiar.

With this post in the series, “duct tape,” made it into the title.

In case you don’t know (I didn’t), Flask is a “lightweight web application framework in Python.”

Just once I would like to see a “heavyweight, cumbersome, limited and annoying web application framework in (insert language of your choice).”

Just for variety.

Rather than characterizing software, say what it does.

Sorry, I have been converting one of the most poorly edited documents I have ever seen into a csv file. Proofing will follow the conversion process but hope to finish that by the end of next week.

Comments are closed.