Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 2, 2015

Statistical and Mathematical Functions with DataFrames in Spark

Filed under: Data Frames,Python,Spark — Patrick Durusau @ 2:59 pm

Statistical and Mathematical Functions with DataFrames in Spark by Burak Yavuz and Reynold Xin.

From the post:

We introduced DataFrames in Spark 1.3 to make Apache Spark much easier to use. Inspired by data frames in R and Python, DataFrames in Spark expose an API that’s similar to the single-node data tools that data scientists are already familiar with. Statistics is an important part of everyday data science. We are happy to announce improved support for statistical and mathematical functions in the upcoming 1.4 release.

In this blog post, we walk through some of the important functions, including:

  1. Random data generation
  2. Summary and descriptive statistics
  3. Sample covariance and correlation
  4. Cross tabulation (a.k.a. contingency table)
  5. Frequent items
  6. Mathematical functions

We use Python in our examples. However, similar APIs exist for Scala and Java users as well.

You do know you have to build Spark yourself to find these features before the release of 1.4. Yes? For that: https://github.com/apache/spark/tree/branch-1.4.

Have you ever heard the expression “used in anger?”

That’s what Spark and its components deserve, to be “used in anger.”

Enjoy!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress