Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 13, 2014

Spark for Data Science: A Case Study

Filed under: Linux OS,Spark — Patrick Durusau @ 7:28 pm

Spark for Data Science: A Case Study by Casey Stella.

From the post:

I’m a pretty heavy Unix user and I tend to prefer doing things the Unix Way™, which is to say, composing many small command line oriented utilities. With composability comes power and with specialization comes simplicity. Although, sometimes if two utilities are used all the time, sometimes it makes sense for either:

  • A utility that specializes in a very common use-case
  • One utility to provide basic functionality from another utility

For example, one thing that I find myself doing a lot of is searching a directory recursively for files that contain an expression:

find /path/to/root -exec grep -l "search phrase" {} \;

Despite the fact that you can do this, specialized utilities, such as ack have come up to simplify this style of querying. Turns out, there’s also power in not having to consult the man pages all the time. Another example, is the interaction between uniq and sort. uniq presumes sorted data. Of course, you need not sort your data using the Unix utility sort, but often you find yourself with a flow such as this:

sort filename.dat | uniq > uniq.dat

This is so common that a -u flag was added to sort to support this flow, like so:

sort -u filename.dat > uniq.dat

Now, obviously, uniq has utilities beyond simply providing distinct output from a stream, such as providing counts for each distinct occurrence. Even so, it’s nice for the situation where you don’t need the full power of uniq for the minimal functionality of uniq to be a part of sort. These simple motivating examples got me thinking:

  • Are there opportunities for folding another command’s basic functionality into another command as a feature (or flag) as in sort and uniq?
  • Can we answer the above question in a principled, data-driven way?

This sounds like a great challenge and an even greater opportunity to try out a new (to me) analytics platform, Apache Spark. So, I’m going to take you through a little journey doing some simple analysis and illustrate the general steps. We’re going to cover

  1. Data Gathering
  2. Data Engineering
  3. Data Analysis
  4. Presentation of Results and Conclusions

We’ll close with my impressions of using Spark as an analytics platform. Hope you enjoy!

All of that is just the setup for a very cool walk through a data analysis example with Spark.

Enjoy!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress