Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 7, 2015

Jupyter on Apache Spark [Holiday Game]

Filed under: Python,Reddit,Spark — Patrick Durusau @ 4:46 pm

Using Jupyter on Apache Spark: Step-by-Step with a Terabyte of Reddit Data by Austin Ouyang.

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how install Jupyter on your Spark cluster and use PySpark for some ad hoc analysis of reddit comment data on Amazon S3.

This following tutorial installs Jupyter on your Spark cluster in standalone mode on top of Hadoop and also walks through some transformations and queries on the reddit comment data on Amazon S3. We assume you already have an AWS EC2 cluster up with Spark 1.4.1 and Hadoop 2.7 installed. If not, you can go to our previous post on how to quickly deploy your own Spark cluster.

In Need a Bigoted, Racist Uncle for Holiday Meal? I mentioned the 1.6 billion Reddit comments that are the subject of this tutorial.

If you can’t find comments offensive to your guests in the Reddit comment collection, they are comatose and/or inanimate objects.

Big Data Holiday Game:

Divide into teams with at least one Jupyter/Apache Spark user on each team.

Play three timed rounds (time for each round dependent on your local schedule) where each team attempts to discover a Reddit comment that is the most offensive for the largest number of guests.

The winner gets bragging rights until next year, you get to show off your data mining skills, not to mention, you get a free pass on saying offensive things to your guests.

Watch for more formalized big data games of this nature by the holiday season for 2016!

Enjoy!

I first saw this in a tweet by Data Science Renee.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress