Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 21, 2015

Command-line tools can be 235x faster than your Hadoop cluster

Filed under: Awk,Hadoop — Patrick Durusau @ 2:57 pm

Command-line tools can be 235x faster than your Hadoop cluster by Adam Drake.

From the post:

As I was browsing the web and catching up on some sites I visit periodically, I found a cool article from Tom Hayden about using Amazon Elastic Map Reduce (EMR) and mrjob in order to compute some statistics on win/loss ratios for chess games he downloaded from the millionbase archive, and generally have fun with EMR. Since the data volume was only about 1.75GB containing around 2 million chess games, I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR. Since the problem is basically just to look at the result lines of each file and aggregate the different results, it seems ideally suited to stream processing with shell commands. I tried this out, and for the same amount of data I was able to use my laptop to get the results in about 12 seconds (processing speed of about 270MB/sec), while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec). (emphasis added)

BTW, Adam was using twice as much data as Tom in his analysis.

The lesson here is to not be a one-trick pony as a data scientist. Most solutions, Hadoop, Spark, Titan, can solve most problems. However, anyone who merits the moniker “data scientist” should be able to choose the “best” solution for a given set of circumstances. In some cases that maybe simple shell scripts.

I first saw this in a tweet by Atabey Kaygun.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress