Statistical and Mathematical Functions with DataFrames in Spark by Burak Yavuz and Reynold Xin.
From the post:
We introduced DataFrames in Spark 1.3 to make Apache Spark much easier to use. Inspired by data frames in R and Python, DataFrames in Spark expose an API that’s similar to the single-node data tools that data scientists are already familiar with. Statistics is an important part of everyday data science. We are happy to announce improved support for statistical and mathematical functions in the upcoming 1.4 release.
In this blog post, we walk through some of the important functions, including:
- Random data generation
- Summary and descriptive statistics
- Sample covariance and correlation
- Cross tabulation (a.k.a. contingency table)
- Frequent items
- Mathematical functions
We use Python in our examples. However, similar APIs exist for Scala and Java users as well.
You do know you have to build Spark yourself to find these features before the release of 1.4. Yes? For that: https://github.com/apache/spark/tree/branch-1.4.
Have you ever heard the expression “used in anger?”
That’s what Spark and its components deserve, to be “used in anger.”
Enjoy!