Archive for the ‘Spark’ Category

Achieving a 300% speedup in ETL with Apache Spark

Tuesday, January 3rd, 2017

Achieving a 300% speedup in ETL with Apache Spark by Eric Maynard.

From the post:

A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significantly slow down an ingest pipeline. Part of this delay is inevitable; moving large files across the network is time-consuming because of physical limitations and can’t be readily sped up. However, the rest of the basic ingest workflow described above can often be improved.

Campaign finance data suffers more from complexity and obscurity than volume.

However, there are data problems where volume and not deceit is the issue. In those cases, you may find Eric’s advice quite helpful.

Interactive 3D Clusters of all 721 Pokémon Using Spark and Plotly

Wednesday, August 3rd, 2016

Interactive 3D Clusters of all 721 Pokémon Using Spark and Plotly by Max Woolf.


My screen capture falls far short of doing justice to the 3D image, not to mention it isn’t interactive. See Max’s post if you really want to appreciate it.

From the post:

There has been a lot of talk lately about Pokémon due to the runaway success of Pokémon GO (I myself am Trainer Level 18 and on Team Valor). Players revel in the nostalgia of 1996 by now having the ability catching the original 151 Pokémon in real life.

However, while players most-fondly remember the first generation, Pokémon is currently on its sixth generation, with the seventh generation beginning later this year with Pokémon Sun and Moon. As of now, there are 721 total Pokémon in the Pokédex, from Bulbasaur to Volcanion, not counting alternate Forms of several Pokémon such as Mega Evolutions.

In the meantime, I’ve seen a few interesting data visualizations which capitalize on the frenzy. A highly-upvoted post on the Reddit subreddit /r/dataisbeautiful by /u/nvvknvvk charts the Height vs. Weight of the original 151 Pokémon. Anh Le of Duke University posted a cluster analysis of the original 151 Pokémon using principal component analysis (PCA), by compressing the 6 primary Pokémon stats into 2 dimensions.

However, those visualizations think too small, and only on a small subset of Pokémon. Why not capture every single aspect of every Pokémon and violently crush that data into three dimensions?

If you need encouragement to explore the recent release of Spark 2.0, Max’s post that in abundance!

Caveat: Pokémon is popular outside of geek/IT circles. Familiarity with Pokémon may result in social interaction with others and/or interest in Pokémon. You have been warned.

Apache Spark as a Compiler:… [This is wicked cool!]

Tuesday, May 24th, 2016

Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop by Sameer Agarwal, Davies Liu and Reynold Xin.

From the post:

When our team at Databricks planned our contributions to the upcoming Apache Spark 2.0 release, we set out with an ambitious goal by asking ourselves: Apache Spark is already pretty fast, but can we make it 10x faster?

This question led us to fundamentally rethink the way we built Spark’s physical execution layer. When you look into a modern data engine (e.g. Spark or other MPP databases), a majority of the CPU cycles are spent in useless work, such as making virtual function calls or reading or writing intermediate data to CPU cache or memory. Optimizing performance by reducing the amount of CPU cycles wasted in this useless work has been a long-time focus of modern compilers.

Apache Spark 2.0 will ship with the second generation Tungsten engine. Built upon ideas from modern compilers and MPP databases and applied to data processing queries, Tungsten emits (SPARK-12795) optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. As a result of this streamlined strategy, called “whole-stage code generation,” we significantly improve CPU efficiency and gain performance.

(emphasis in original)

How much better you ask?

cost per row (in nanoseconds, single thread)

primitive Spark 1.6 Spark 2.0
filter 15 ns 1.1 ns
sum w/o group 14 ns 0.9 ns
sum w/ group 79 ns 10.7 ns
hash join 115 ns 4.0 ns
sort (8-bit entropy) 620 ns 5.3 ns
sort (64-bit entropy) 620 ns 40 ns
sort-merge join 750 ns 700 ns
Parquet decoding (single int column) 120 ns 13 ns

Don’t just stare at the numbers:

Try the whole-stage code generation notebook in Databricks Community Edition

What’s the matter?

Haven’t you ever seen a 1 billion record join in 0.8 seconds? (Down from 61.7 seconds.)

If all that weren’t impressive enough, the post walks you through the dominate (currently) query evaluation strategy as a setup to Spark 2.0 and then into why “whole-stage code generation is so powerful.”

A must read!

Advanced Data Mining with Weka – Starts 25 April 2016

Wednesday, April 6th, 2016

Advanced Data Mining with Weka by Ian Witten.

From the webpage:

This course follows on from Data Mining with Weka and More Data Mining with Weka. It provides a deeper account of specialized data mining tools and techniques. Again the emphasis is on principles and practical data mining using Weka, rather than mathematical theory or advanced details of particular algorithms. Students will analyse time series data, mine data streams, use Weka to access other data mining packages including the popular R statistical computing language, script Weka in Python, and deploy it within a cluster computing framework. The course also includes case studies of applications such as classifying tweets, functional MRI data, image classification, and signal peptide prediction.

The syllabus:

Advanced Data Mining with Weka is open for enrollment and starts 25 April 2016.

Five very intense weeks await!

Will you be there?

I first saw this in a tweet by Alyona Medelyan.

Self-Learn Yourself Apache Spark in 21 Blogs – #1

Wednesday, January 13th, 2016

Self-Learn Yourself Apache Spark in 21 Blogs – #1 by Kumar Chinnakali.

From the post:

We have received many requests from friends who are constantly reading our blogs to provide them a complete guide to sparkle in Apache Spark. So here we have come up with learning initiative called “Self-Learn Yourself Apache Spark in 21 Blogs”.

We have drilled down various sources and archives to provide a perfect learning path for you to understand and excel in Apache Spark. These 21 blogs which will be written over a course of time will be a complete guide for you to understand and work on Apache Spark quickly and efficiently.

We wish you all a Happy New Year 2016 and start the year with rich knowledge. From dataottam we wish you good luck to “ROCK Apache Spark & the New Year 2016”

I’m not sure what to say about this series of posts. The title is promising enough but it takes until post #4 before you get any substantive content and not much then. Perhaps it will pickup as time goes by.

Worth a look but too soon to be excited about it.

I first saw this in a tweet by Kirk Borne.

Jupyter on Apache Spark [Holiday Game]

Monday, December 7th, 2015

Using Jupyter on Apache Spark: Step-by-Step with a Terabyte of Reddit Data by Austin Ouyang.

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how install Jupyter on your Spark cluster and use PySpark for some ad hoc analysis of reddit comment data on Amazon S3.

This following tutorial installs Jupyter on your Spark cluster in standalone mode on top of Hadoop and also walks through some transformations and queries on the reddit comment data on Amazon S3. We assume you already have an AWS EC2 cluster up with Spark 1.4.1 and Hadoop 2.7 installed. If not, you can go to our previous post on how to quickly deploy your own Spark cluster.

In Need a Bigoted, Racist Uncle for Holiday Meal? I mentioned the 1.6 billion Reddit comments that are the subject of this tutorial.

If you can’t find comments offensive to your guests in the Reddit comment collection, they are comatose and/or inanimate objects.

Big Data Holiday Game:

Divide into teams with at least one Jupyter/Apache Spark user on each team.

Play three timed rounds (time for each round dependent on your local schedule) where each team attempts to discover a Reddit comment that is the most offensive for the largest number of guests.

The winner gets bragging rights until next year, you get to show off your data mining skills, not to mention, you get a free pass on saying offensive things to your guests.

Watch for more formalized big data games of this nature by the holiday season for 2016!


I first saw this in a tweet by Data Science Renee.

Spinning up a Spark Cluster on Spot Instances: Step by Step [$0.69 for 6 hours]

Thursday, October 29th, 2015

Spinning up a Spark Cluster on Spot Instances: Step by Step by Austin Ouyang.

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how to deploy a Spark Standalone cluster on AWS Spot Instances for less than $1. In a follow up post, we will show you how to use a Jupyter notebook on Spark for ad hoc analysis of reddit comment data on Amazon S3.

One of the significant hurdles in learning to build distributed systems is understanding how these various technologies are installed and their inter-dependencies. In our experience, the best way to get started with these technologies is to roll up your sleeves and build projects you are passionate about.

This following tutorial shows how you can deploy your own Spark cluster in standalone mode on top of Hadoop. Due to Spark’s memory demand, we recommend using m4.large spot instances with 200GB of magnetic hard drive space each.

m4.large spot instances are not within the free-tier package on AWS, so this tutorial will incur a small cost. The tutorial should not take any longer than a couple hours, but if we allot 6 hours for your 4 node spot cluster, the total cost should run around $0.69 depending on the region of your cluster. If you run this cluster for an entire month we can look at a bill of around $80, so be sure to spin down you cluster after you are finished using it.

How does $0.69 to improve your experience with distributed systems sound?

It’s hard to imagine a better deal.

The only reason to lack experience with distributed systems is lack of interest.

Odd I know but it does happen (or so I have heard). 😉

I first saw this in a tweet by Kirk Borne.

Congressional PageRank… [How To Avoid Bribery Charges]

Saturday, October 17th, 2015

Congressional PageRank – Analyzing US Congress With Neo4j and Apache Spark by William Lyon.

From the post:

As we saw previously, legis-graph is an open source software project that imports US Congressional data from Govtrack into the Neo4j graph database. This post shows how we can apply graph analytics to US Congressional data to find influential legislators in Congress. Using the Mazerunner open source graph analytics project we are able to use Apache Spark GraphX alongside Neo4j to run the PageRank algorithm on a collaboration graph of US Congress.

While Neo4j is a powerful graph database that allows for efficient OLTP queries and graph traversals using the Cypher query language, it is not optimized for global graph algorithms, such as PageRank. Apache Spark is a distributed in-memory large-scale data processing engine with a graph processing framework called GraphX. GraphX with Apache Spark is very efficient at performing global graph operations, like the PageRank algorithm. By using Spark alongside Neo4j we can enhance our analysis of US Congress using legis-graph.

Excellent walk-through to get you started on analyzing influence in congress, with modern data analysis tools. Getting a good grip on all these tools with be valuable.

Political scientists, among others, have studied the question of influence in Congress for decades so if you don’t want to repeat the results of others, being by consulting the American Political Science Review for prior work in this area.

An article that reports counter-intuitive results is: The Influence of Campaign Contributions on the Legislative Process by Lynda W. Powell.

From the introduction:

Do campaign donors gain disproportionate influence in the legislative process? Perhaps surprisingly, political scientists have struggled to answer this question. Much of the research has not identified an effect of contributions on policy; some political scientists have concluded that money does not matter; and this bottom line has been picked up by reporters and public intellectuals.1 It is essential to answer this question correctly because the result is of great normative importance in a democracy.

It is important to understand why so many studies find no causal link between contributions and policy outcomes. (emphasis added)

Linda cites much of the existing work on the influence of donations on process so her work makes a great starting point for further research.

As far as the lack of a “casual link between contributions and policy outcomes,” I think the answer is far simpler than Linda suspects.

The existence of a quid-pro-quo, the exchange of value for a vote on a particular bill, is the essence of the crime of public bribery. For the details (in the United States), see: 18 U.S. Code § 201 – Bribery of public officials and witnesses

What isn’t public bribery is to donate funds to an office holder on a regular basis, unrelated to any particular vote or act on the part of that official. Think of it as bribery on an installment plan.

When U.S. officials, such as former Secretary of State Hillary Clinton complain of corruption in other governments, they are criticizing quid-pro-quo bribery and not installment plan bribery as it is practiced in the United States.

Regular contributions gains ready access to legislators and, not surprisingly, more votes will go in your favor than random chance would allow.

Regular contributions are more expensive than direct bribes but avoiding the “causal link” is essential for all involved.

Announcing Spark 1.5

Sunday, September 20th, 2015

Announcing Spark 1.5 by Reynold Xin and Patrick Wendell.

From the post:

Today we are happy to announce the availability of Apache Spark’s 1.5 release! In this post, we outline the major development themes in Spark 1.5 and some of the new features we are most excited about. In the coming weeks, our blog will feature more detailed posts on specific components of Spark 1.5. For a comprehensive list of features in Spark 1.5, you can also find the detailed Apache release notes below.

Many of the major changes in Spark 1.5 are under-the-hood changes to improve Spark’s performance, usability, and operational stability. Spark 1.5 ships major pieces of Project Tungsten, an initiative focused on increasing Spark’s performance through several low-level architectural optimizations. The release also adds operational features for the streaming component, such as backpressure support. Another major theme of this release is data science: Spark 1.5 ships several new machine learning algorithms and utilities, and extends Spark’s new R API.

One interesting tidbit is that in Spark 1.5, we have crossed the 10,000 mark for JIRA number (i.e. more than 10,000 tickets have been filed to request features or report bugs). Hopefully the added digit won’t slow down our development too much!

It’s time to upgrade your Spark installation again!


Spark Release 1.5.0

Thursday, September 10th, 2015

Spark Release 1.5.0

From the post:

Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page.

You can consult JIRA for the detailed changes. We have curated a list of high level changes here:

Time for your Fall Spark Upgrade!


Get Smarter About Apache Spark

Tuesday, July 21st, 2015

Get Smarter About Apache Spark by Luis Arellano.

From the post:

We often forget how new Spark is. While it was invented much earlier, Apache Spark only became a top-level Apache project in February 2014 (generally indicating it’s ready for anyone to use), which is just 18 months ago. I might have a toothbrush that is older than Apache Spark!

Since then, Spark has generated tremendous interest because the new data processing platforms scales so well, is high performance (up to 100 times faster than alternatives), and is more flexible than other alternatives, both open source and commercial. (If you’re interested, see the trends on both Google searches and Indeed job postings.)

Spark gives the Data Scientist, Business Analyst, and Developer a new platform to manage data and build services as it provides the ability to compute in real-time via in-memory processing. The project is extremely active with ongoing development, and has serious investment from IBM and key players in Silicon Valley.

Luis has collected up links for absolute beginners, understanding the basics, intermediate learning and finally reaching the expert level.

None of the lists are overwhelming so give them a try.


Friday, July 10th, 2015

SparkHub: A Community Site for Apache Spark

The start of a community site for Apache Spark.

I say the “start of” because under videos you will find only Spark Summit 2015 and Spark Summit East 2015.

No promises on accuracy but searching for “Apache Spark” at YouTube results in 4,900 “hits” as of today (10 July 2015).

Make no mistake, Databricks (the force behind SparkHub) has a well deserved and formidable reputation when it comes to Spark.

And the site was just released today, but still, one would expect more diverse content than is found on the site today.

Given the widespread interest in Spark, widespread if mailing list traffic is any indication of interest, a curated site that does more than list videos, articles and post by title would be a real community asset.

Think of a Spark site that identifies issues by time marks in videos and links those snippets to bug/improvement issues and any discussion.

Make Spark discussions a seamless web as opposed to a landscape of dead-ends, infinitely deep potholes and diamonds, diamonds that once discovered, are covered back up again.

It doesn’t have to be that way.

Spark in Clojure

Tuesday, June 30th, 2015

Spark in Clojure by Mykhailo Kozik.

From the post:

Apache Spark is a fast and general engine for large-scale data processing.

100 times faster than Hadoop.

Everyone knows SQL. But traditional databases are not good in hadling big amount of data. Nevertheless, SQL is a good DSL for data processing and it is much easier to understand Spark if you have similar query implemented in SQL.

This article shows how common SQL queries implemented in Spark.

Another long holiday weekend appropriate posting.

Good big data practice too.

15 Easy Solutions To Your Data Frame Problems In R

Monday, June 15th, 2015

15 Easy Solutions To Your Data Frame Problems In R.

From the post:

R’s data frames regularly create somewhat of a furor on public forums like Stack Overflow and Reddit. Starting R users often experience problems with the data frame in R and it doesn’t always seem to be straightforward. But does it really need to be so?

Well, not necessarily.

With today’s post, DataCamp wants to show you that data frames don’t need to be hard: we offer you 15 easy, straightforward solutions to the most frequently occurring problems with data.frame. These issues have been selected from the most recent and sticky or upvoted Stack Overflow posts. If, however, you are more interested in getting an elaborate introduction to data frames, you might consider taking a look at our Introduction to R course.

If you are having trouble with frames in R, you are going to have trouble with frames in Spark.

Questions and solutions you will see here:

  • How To Create A Simple Data Frame in R
  • How To Change A Data Frame’s Row And Column Names
  • How To Check A Data Frame’s Dimensions
  • How To Access And Change A Data Frame’s Values …. Through The Variable Names
  • … Through The [,] and $ Notations
  • Why And How To Attach Data Frames
  • How To Apply Functions To Data Frames
  • How To Create An Empty Data Frame
  • How To Extract Rows And Colums, Subseting Your Data Frame
  • How To Remove Columns And Rows From A Data Frame
  • How To Add Rows And Columns To A Data Frame
  • Why And How To Reshape A Data Frame From Wide To Long Format And Vice Versa
  • Using stack() For Simply Structured Data Frames
  • Using reshape() For Complex Data Frames
  • Reshaping Data Frames With tidyr
  • Reshaping Data Frames With reshape2
  • How To Sort A Data Frame
  • How To Merge Data Frames
  • Merging Data Frames On Row Names
  • How To Remove Data Frames’ Rows And Columns With NA-Values
  • How To Convert Lists Or Matrices To Data Frames And Back
  • Changing A Data Frame To A Matrix Or List

Rather than looking for a “cheatsheet” on data frames, suggest you work your way through these solutions, more than once. Over time you will learn the ones relevant to your particular domain.


Announcing SparkR: R on Spark [Spark Summit next week – free live streaming]

Wednesday, June 10th, 2015

Announcing SparkR: R on Spark by Shivaram Venkataraman.

From the post:

I am excited to announce that the upcoming Apache Spark 1.4 release will include SparkR, an R package that allows data scientists to analyze large datasets and interactively run jobs on them from the R shell.

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the runtime is single-threaded and can only process data sets that fit in a single machine’s memory. SparkR, an R package initially developed at the AMPLab, provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows us to run large scale data analysis from the R shell.

The short news here or go to the Spark Summit to get the full story. (Code Databricks20 gets a 20% discount) (That’s next week, June 15 – 17, San Francisco. You need to act quickly.)

BTW, you can register for free live streaming!

Looking forward to this!

Apache Spark on HDP: Learn, Try and Do

Thursday, June 4th, 2015

Apache Spark on HDP: Learn, Try and Do by Jules S. Damji.

I wanted to leave you with something fun to enjoy this evening. I am off to read a forty-eight (48) page bill that would make your ninth (9th) grade English teacher hurl. It’s really grim stuff that boils down to a lot of nothing but you have to parse through it to make that evident. More on that tomorrow.

From the post:

Not a day passes without someone tweeting or re-tweeting a blog on the virtues of Apache Spark.

At a Memorial Day BBQ, an old friend proclaimed: “Spark is the new rub, just as Java was two decades ago. It’s a developers’ delight.”

Spark as a distributed data processing and computing platform offers much of what developers’ desire and delight—and much more. To the ETL application developer Spark offers expressive APIs for transforming data; to the data scientists it offers machine libraries, MLlib component; and to data analysts it offers SQL capabilities for inquiry.

In this blog, I summarize how you can get started, enjoy Spark’s delight, and commence on a quick journey to Learn, Try, and Do Spark on HDP, with a set of tutorials.

I don’t know which is more disturbing. That Spark was being discussed at a Memorial Day BBQ or that anyone was sober enough to remember it. Life seems to change when you are older than the average cardiologist.

Sorry! Where were we, oh, yes, Saptak Sen has collected a set of tutorials to introduce you to Spark on the HDP Sandbox.

Near the bottom of the page, Apache Zeppelin (incubating) is mentioned along with Spark. Could use it to enable exploration of a data set. Could also use it so that users “discover” on their own that your analysis of the data is indeed correct. 😉

Due diligence means not only seeing the data as processed but the data from where that data was drawn, what pre-processing was done on that data, the circumstances under which the “original” data came into being, the algorithms applied at all stages, to name only a few considerations.

The demonstration of a result merits, “that’s interesting” until you have had time to verify it. “Trust” comes after verification.

Statistical and Mathematical Functions with DataFrames in Spark

Tuesday, June 2nd, 2015

Statistical and Mathematical Functions with DataFrames in Spark by Burak Yavuz and Reynold Xin.

From the post:

We introduced DataFrames in Spark 1.3 to make Apache Spark much easier to use. Inspired by data frames in R and Python, DataFrames in Spark expose an API that’s similar to the single-node data tools that data scientists are already familiar with. Statistics is an important part of everyday data science. We are happy to announce improved support for statistical and mathematical functions in the upcoming 1.4 release.

In this blog post, we walk through some of the important functions, including:

  1. Random data generation
  2. Summary and descriptive statistics
  3. Sample covariance and correlation
  4. Cross tabulation (a.k.a. contingency table)
  5. Frequent items
  6. Mathematical functions

We use Python in our examples. However, similar APIs exist for Scala and Java users as well.

You do know you have to build Spark yourself to find these features before the release of 1.4. Yes? For that:

Have you ever heard the expression “used in anger?”

That’s what Spark and its components deserve, to be “used in anger.”


Data Science on Spark

Tuesday, June 2nd, 2015

Databricks Launches MOOC: Data Science on Spark by Ameet Talwalkar and Anthony Joseph.

From the post:

For the past several months, we have been working in collaboration with professors from the University of California Berkeley and University of California Los Angeles to produce two freely available Massive Open Online Courses (MOOCs). We are proud to announce that both MOOCs will launch in June on the edX platform!

The first course, called Introduction to Big Data with Apache Spark, begins today [June 1, 2015] and teaches students about Apache Spark and performing data analysis. The second course, called Scalable Machine Learning, will begin on June 29th and will introduce the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines, and provides hands-on experience using Spark. Both courses will be freely available on the edX MOOC platform, and edX Verified Certificates are also available for a fee.

Both courses are available for free on the edX website, and you can sign up for them today:

  1. Introduction to Big Data with Apache Spark
  2. Scalable Machine Learning

It is our mission to enable data scientists and engineers around the world to leverage the power of Big Data, and an important part of this mission is to educate the next generation.

If you believe in the wisdom of crowds, some 80K enrolled students as of yesterday.

So, what are you waiting for?


Announcing KeystoneML

Saturday, May 30th, 2015

Announcing KeystoneML

From the post:

We’ve written about machine learning pipelines in this space in the past. At the AMPLab Retreat this week, we released (live, on stage!) KeystoneML, a software framework designed to simplify the construction of large scale, end-to-end, machine learning pipelines in Apache Spark. KeystoneML is alpha software, but we’re releasing it now to get feedback from users and to collect more use cases.

Included in the package is a type-safe API for building robust pipelines and example operators used to construct them in the domains of natural language processing, computer vision, and speech. Additionally, we’ve included and linked to several scalable and robust statistical operators and machine learning algorithms which can be reused by many workflows.

Also included in the code are several example pipelines that demonstrate how to use the software to reproduce recent academic results in computer vision, natural language processing, and speech processing….

In case you don’t have plans for the rest of the weekend! 😉

Being mindful of Emmett McQuinn’s post, Amazon Machine Learning is not for your average developer – yet, doesn’t mean you have to remain an “average” developer.

You can wait for a cookie cutter solution from Amazon or you can get ahead of the curve. Your call.

First experiments with Apache Spark at Snowplow

Friday, May 22nd, 2015

First experiments with Apache Spark at Snowplow by Justin Courty.

From the post:

As we talked about in our May post on the Spark Example Project release, at Snowplow we are very interested in Apache Spark for three things:

  1. Data modeling i.e. applying business rules to aggregate up event-level data into a format suitable for ingesting into a business intelligence / reporting / OLAP tool
  2. Real-time aggregation of data for real-time dashboards
  3. Running machine-learning algorithms on event-level data

We’re just at the beginning of our journey getting familiar with Apache Spark. I’ve been using Spark for the first time over the past few weeks. In this post I’ll share back with the community what I’ve learnt, and will cover:

  1. Loading Snowplow data into Spark
  2. Performing simple aggregations on Snowplow data in Spark
  3. Performing funnel analysis on Snowplow data

I’ve tried to write the post in a way that’s easy to follow-along for other people interested in getting up the Spark learning curve.

What a great post to find just before the weekend!

You will enjoy this one and others in this series.

Have you every considered aggregation into business dashboard to include what is known about particular subjects? We have all seen the dashboards with increasing counts, graphs, charts, etc. but what about non-tabular data?

A non-tabular dashboard?

Running Spark GraphX algorithms on Library of Congress subject heading SKOS

Monday, May 4th, 2015

Running Spark GraphX algorithms on Library of Congress subject heading SKOS by Bob Ducharme.

From the post:

Well, one algorithm, but a very cool one.

Last month, in Spark and SPARQL; RDF Graphs and GraphX, I described how Apache Spark has emerged as a more efficient alternative to MapReduce for distributing computing jobs across clusters. I also described how Spark’s GraphX library lets you do this kind of computing on graph data structures and how I had some ideas for using it with RDF data. My goal was to use RDF technology on GraphX data and vice versa to demonstrate how they could help each other, and I demonstrated the former with a Scala program that output some GraphX data as RDF and then showed some SPARQL queries to run on that RDF.

Today I’m demonstrating the latter by reading in a well-known RDF dataset and executing GraphX’s Connected Components algorithm on it. This algorithm collects nodes into groupings that connect to each other but not to any other nodes. In classic Big Data scenarios, this helps applications perform tasks such as the identification of subnetworks of people within larger networks, giving clues about which products or cat videos to suggest to those people based on what their friends liked.

As so typically happens when you are reading one Bob DuCharme post, you see another that one requires reading!

Bob covers storing RDF in RDD (Resilient Distributed Dataset), the basic Spark data structure, creating the report on connected components and ends with heavily commented code for his program.

Sadly the “related” values assigned by the Library of Congress don’t say how or why the values are related, such as:

“Hiding places”





Related values could be useful in some cases but if I am searching on “privacy,” as in the sense of being free from government intrusion, then “solitude,” “loneliness,” and “hiding places” aren’t likely to be helpful.

That’s not a problem with Spark or SKOS, but a limitation of the data being provided.

Distributed Machine Learning with Apache Mahout

Monday, May 4th, 2015

Distributed Machine Learning with Apache Mahout by Ian Pointer and Dr. Ir. Linda Terlouw.

The Refcard for Mahout takes a different approach from many other DZone Refcards.

Instead of a plethora of switches and commands, it covers two basis tasks:

  • Training and testing a Random Forest for handwriting recognition using Amazon Web Services EMR
  • Running a recommendation engine on a standalone Spark cluster

Different style from the usual Refcard but a welcome addition to the documentation available for Apache Mahout!


Getting Started with Spark (in Python)

Sunday, April 26th, 2015

Getting Started with Spark (in Python) by Benjamin Bengfort.

From the post:

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).

These two ideas have been the prime drivers for the advent of scaling analytics, large scale machine learning, and other big data appliances for the last ten years! However, in technology terms, ten years is an incredibly long time, and there are some well-known limitations that exist, with MapReduce in particular. Notably, programming MapReduce is difficult. You have to chain Map and Reduce tasks together in multiple steps for most analytics. This has resulted in specialized systems for performing SQL-like computations or machine learning. Worse, MapReduce requires data to be serialized to disk between each step, which means that the I/O cost of a MapReduce job is high, making interactive analysis and iterative algorithms very expensive; and the thing is, almost all optimization and machine learning is iterative.

To address these problems, Hadoop has been moving to a more general resource management framework for computation, YARN (Yet Another Resource Negotiator). YARN implements the next generation of MapReduce, but also allows applications to leverage distributed resources without having to compute with MapReduce. By generalizing the management of the cluster, research has moved toward generalizations of distributed computation, expanding the ideas first imagined in MapReduce.

Spark is the first fast, general purpose distributed computing paradigm resulting from this shift and is gaining popularity rapidly. Spark extends the MapReduce model to support more types of computations using a functional programming paradigm, and it can cover a wide range of workflows that previously were implemented as specialized systems built on top of Hadoop. Spark uses in-memory caching to improve performance and, therefore, is fast enough to allow for interactive analysis (as though you were sitting on the Python interpreter, interacting with the cluster). Caching also improves the performance of iterative algorithms, which makes it great for data theoretic tasks, especially machine learning.

In this post we will first discuss how to set up Spark to start easily performing analytics, either simply on your local machine or in a cluster on EC2. We then will explore Spark at an introductory level, moving towards an understanding of what Spark is and how it works (hopefully motivating further exploration). In the last two sections we will start to interact with Spark on the command line and then demo how to write a Spark application in Python and submit it to the cluster as a Spark job.

Be forewarned, this post uses the “F” word (functional) to describe the programming paradigm of Spark. Just so you know. 😉

If you aren’t already using Spark, this is about as easy a learning curve as can be expected.


I first saw this in a tweet by DataMining.

Apache Spark, Now GA on Hortonworks Data Platform

Tuesday, April 14th, 2015

Apache Spark, Now GA on Hortonworks Data Platform by Vinay Shukla.

From the post:

Hortonworks is pleased to announce the general availability of Apache Spark in Hortonworks Data Platform (HDP)— now available on our downloads page. With HDP 2.2.4 Hortonworks now offers support for your developers and data scientists using Apache Spark 1.2.1.

HDP’s YARN-based architecture enables multiple applications to share a common cluster and dataset while ensuring consistent levels of service and response. Now Spark is one of the many data access engines that works with YARN and that is supported in an HDP enterprise data lake. Spark provides HDP subscribers yet another way to derive value from any data, any application, anywhere.

What more need I say?

Get thee to the downloads page!

Using Spark DataFrames for large scale data science

Friday, March 27th, 2015

Using Spark DataFrames for large scale data science by Reynold Xin.

From the post:

When we first open sourced Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API—tasks that used to take thousands of lines of code to express could be reduced to dozens.

As Spark continues to grow, we want to enable wider audiences beyond big data engineers to leverage the power of distributed processing. The new DataFrame API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature:

  • Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
  • Support for a wide array of data formats and storage systems
  • State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
  • Seamless integration with all big data tooling and infrastructure via Spark
  • APIs for Python, Java, Scala, and R (in development via SparkR)

For new users familiar with data frames in other programming languages, this API should make them feel at home. For existing Spark users, this extended API will make Spark easier to program, and at the same time improve performance through intelligent optimizations and code-generation.

If you don’t know Spark DataFrames, you are missing out on important Spark capabilities! This post will have to well on the way to recovery.

Even though the reading of data from other sources is “easy” in many cases and support for more is growing, I am troubled by statements like:

DataFrames’ support for data sources enables applications to easily combine data from disparate sources (known as federated query processing in database systems). For example, the following code snippet joins a site’s textual traffic log stored in S3 with a PostgreSQL database to count the number of times each user has visited the site.

That goes well beyond reading data and introduces the concept of combining data, which isn’t the same thing.

For any two data sets that are trivially transparent to you (caveat what is transparent to you may/may not be transparent to others), that example works.

That example fails where data scientists spend 50 to 80 percent of their time: “collecting and preparing unruly digital data.” For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.

If your handlers are content to spend 50 to 80 percent of your time munging data, enjoy. Not that munging data will ever go away, but documenting the semantics of your data can enable you to spend less time munging and more time on enjoyable tasks.

How to install Spark 1.2 on Azure HDInsight clusters

Friday, March 20th, 2015

How to install Spark 1.2 on Azure HDInsight clusters by Maxim Lukiyanov.

From the post:

Today we are pleased to announce the refresh of the Apache Spark support on Azure HDInsight clusters. Spark is available on HDInsight through custom script action and today we are updating it to support the latest version of Spark 1.2. The previous version supported version 1.0. This update also adds Spark SQL support to the package.

Spark 1.2 script action requires latest version of HDInsight clusters 3.2. Older HDInsight clusters will get previous version of Spark 1.0 when customized with Spark script action.

Follow the below steps to create Spark cluster using Azure Portal:

The only remaining questions are: How good are you with Spark? and How big of a Spark cluster do you neeed? (or can afford).


Can Spark Streaming survive Chaos Monkey?

Tuesday, March 17th, 2015

Can Spark Streaming survive Chaos Monkey? by Bharat Venkat, Prasanna Padmanabhan, Antony Arokiasamy, Raju Uppalap.

From the post:

Netflix is a data-driven organization that places emphasis on the quality of data collected and processed. In our previous blog post, we highlighted our use cases for real-time stream processing in the context of online recommendations and data monitoring. With Spark Streaming as our choice of stream processor, we set out to evaluate and share the resiliency story for Spark Streaming in the AWS cloud environment. A Chaos Monkey based approach, which randomly terminated instances or processes, was employed to simulate failures.

Spark on Amazon Web Services (AWS) is relevant to us as Netflix delivers its service primarily out of the AWS cloud. Stream processing systems need to be operational 24/7 and be tolerant to failures. Instances on AWS are ephemeral, which makes it imperative to ensure Spark’s resiliency.

If Spark was commercial product this is where you would see in bold, not a vendor report, from a customer.

You need to see the post for the details but so you know what to expect:

Behaviour on Component Failure
Client Mode: The entire application is killed
Cluster Mode with supervise: The Driver is restarted on a different Worker node
Single Master: The entire application is killed
Multi Master: A STANDBY master is elected ACTIVE
Worker Process
All child processes (executor or driver) are also terminated and a new worker process is launched
A new executor is launched by the Worker process
Same as Executor as they are long running tasks inside the Executor
Worker Node
Worker, Executor and Driver processes run on Worker nodes and the behavior is same as killing them individually

I can think of few things more annoying that software that works, sometimes. If you want users to rely upon you, then your service will have to be reliable.

A performance post by Netflix is rumored to be in the offing!


Announcing Spark 1.3!

Saturday, March 14th, 2015

Announcing Spark 1.3! by Patrick Wendell.

From the post:

Today I’m excited to announce the general availability of Spark 1.3! Spark 1.3 introduces the widely anticipated DataFrame API, an evolution of Spark’s RDD abstraction designed to make crunching large datasets simple and fast. Spark 1.3 also boasts a large number of improvements across the stack, from Streaming, to ML, to SQL. The release has been posted today on the Apache Spark website.

We’ll be publishing in depth overview posts covering Spark’s new features over the coming weeks. Some of the salient features of this release are:

A new DataFrame API

Spark SQL Graduates from Alpha

Built-in Support for Spark Packages

Lower Level Kafka Support in Spark Streaming

New Algorithms in MLlib

See Patrick’s post and/or the release notes for full details!

BTW, Patrick promises more posts to follow covering Spark 1.3 in detail.

I first saw this in a tweet by Vidya.

Getting Started with Apache Spark and Neo4j Using Docker Compose

Wednesday, March 11th, 2015

Getting Started with Apache Spark and Neo4j Using Docker Compose by Kenny Bastani.

From the post:

I’ve received a lot of interest since announcing Neo4j Mazerunner. People from around the world have reached out to me and are excited about the possibilities of using Apache Spark and Neo4j together. From authors who are writing new books about big data to PhD researchers who need it to solve the world’s most challenging problems.

I’m glad to see such a wide range of needs for a simple integration like this. Spark and Neo4j are two great open source projects that are focusing on doing one thing very well. Integrating both products together makes for an awesome result.

Less is always more, simpler is always better.

Both Apache Spark and Neo4j are two tremendously useful tools. I’ve seen how both of these two tools give their users a way to transform problems that start out both large and complex into problems that become simpler and easier to solve. That’s what the companies behind these platforms are getting at. They are two sides of the same coin.

One tool solves for scaling the size, complexity, and retrieval of data, while the other is solving for the complexity of processing the enormity of data by distributed computation at scale. Both of these products are achieving this without sacrificing ease of use.

Inspired by this, I’ve been working to make the integration in Neo4j Mazerunner easier to install and deploy. I believe I’ve taken a step forward in this and I’m excited to announce it in this blog post.


Now for something a bit more esoteric than CSV. 😉

This guide will get you into Docker land as well.

Please share and forward.



Wednesday, March 11th, 2015


From the webpage:

Convert text from a file or from stdin into SQL table and query it instantly. Uses sqlite as backend. The idea is to make SQL into a tool on the command line or in scripts.

Online manual:

So what can it do?

  • convert text/CSV files into sqlite database/table
  • work on stdin data on-the-fly
  • it can be used as swiss army knife kind of tool for extracting information from other processes that send their information to termsql via a pipe on the command line or in scripts
  • termsql can also pipe into another termsql of course
  • you can quickly sort and extract data
  • creates string/integer/float column types automatically
  • gives you the syntax and power of SQL on the command line

Sometimes you need the esoteric and sometimes not!


I first saw this in a tweet by Christophe Lalanne.