Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 9, 2013

A Guide to Python Frameworks for Hadoop

Filed under: Hadoop,MapReduce,Python — Patrick Durusau @ 12:03 pm

A Guide to Python Frameworks for Hadoop by Uri Laserson.

From the post:

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

  • Hadoop Streaming
  • mrjob
  • dumbo
  • hadoopy
  • pydoop
  • and others

Ultimately, in my analysis, Hadoop Streaming is the fastest and most transparent option, and the best one for text processing. mrjob is best for rapidly working on Amazon EMR, but incurs a significant performance penalty. dumbo is convenient for more complex jobs (objects as keys; multistep MapReduce) without incurring as much overhead as mrjob, but it’s still slower than Streaming.

Read on for implementation details, performance comparisons, and feature comparisons.

A non-word count Hadoop example? Who would have thought? 😉

Enjoy!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress