Archive for the ‘Data Frames’ Category

Spark Release 1.5.0

Thursday, September 10th, 2015

Spark Release 1.5.0

From the post:

Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page.

You can consult JIRA for the detailed changes. We have curated a list of high level changes here:

Time for your Fall Spark Upgrade!

Enjoy!

15 Easy Solutions To Your Data Frame Problems In R

Monday, June 15th, 2015

15 Easy Solutions To Your Data Frame Problems In R.

From the post:

R’s data frames regularly create somewhat of a furor on public forums like Stack Overflow and Reddit. Starting R users often experience problems with the data frame in R and it doesn’t always seem to be straightforward. But does it really need to be so?

Well, not necessarily.

With today’s post, DataCamp wants to show you that data frames don’t need to be hard: we offer you 15 easy, straightforward solutions to the most frequently occurring problems with data.frame. These issues have been selected from the most recent and sticky or upvoted Stack Overflow posts. If, however, you are more interested in getting an elaborate introduction to data frames, you might consider taking a look at our Introduction to R course.

If you are having trouble with frames in R, you are going to have trouble with frames in Spark.

Questions and solutions you will see here:

  • How To Create A Simple Data Frame in R
  • How To Change A Data Frame’s Row And Column Names
  • How To Check A Data Frame’s Dimensions
  • How To Access And Change A Data Frame’s Values …. Through The Variable Names
  • … Through The [,] and $ Notations
  • Why And How To Attach Data Frames
  • How To Apply Functions To Data Frames
  • How To Create An Empty Data Frame
  • How To Extract Rows And Colums, Subseting Your Data Frame
  • How To Remove Columns And Rows From A Data Frame
  • How To Add Rows And Columns To A Data Frame
  • Why And How To Reshape A Data Frame From Wide To Long Format And Vice Versa
  • Using stack() For Simply Structured Data Frames
  • Using reshape() For Complex Data Frames
  • Reshaping Data Frames With tidyr
  • Reshaping Data Frames With reshape2
  • How To Sort A Data Frame
  • How To Merge Data Frames
  • Merging Data Frames On Row Names
  • How To Remove Data Frames’ Rows And Columns With NA-Values
  • How To Convert Lists Or Matrices To Data Frames And Back
  • Changing A Data Frame To A Matrix Or List

Rather than looking for a “cheatsheet” on data frames, suggest you work your way through these solutions, more than once. Over time you will learn the ones relevant to your particular domain.

Enjoy!

Statistical and Mathematical Functions with DataFrames in Spark

Tuesday, June 2nd, 2015

Statistical and Mathematical Functions with DataFrames in Spark by Burak Yavuz and Reynold Xin.

From the post:

We introduced DataFrames in Spark 1.3 to make Apache Spark much easier to use. Inspired by data frames in R and Python, DataFrames in Spark expose an API that’s similar to the single-node data tools that data scientists are already familiar with. Statistics is an important part of everyday data science. We are happy to announce improved support for statistical and mathematical functions in the upcoming 1.4 release.

In this blog post, we walk through some of the important functions, including:

  1. Random data generation
  2. Summary and descriptive statistics
  3. Sample covariance and correlation
  4. Cross tabulation (a.k.a. contingency table)
  5. Frequent items
  6. Mathematical functions

We use Python in our examples. However, similar APIs exist for Scala and Java users as well.

You do know you have to build Spark yourself to find these features before the release of 1.4. Yes? For that: https://github.com/apache/spark/tree/branch-1.4.

Have you ever heard the expression “used in anger?”

That’s what Spark and its components deserve, to be “used in anger.”

Enjoy!

On The Bleeding Edge – PySpark, DataFrames, and Cassandra

Saturday, May 2nd, 2015

On The Bleeding Edge – PySpark, DataFrames, and Cassandra.

From the post:

A few months ago I wrote a post on Getting Started with Cassandra and Spark.

I’ve worked with Pandas for some small personal projects and found it very useful. The key feature is the data frame, which comes from R. Data Frames are new in Spark 1.3 and was covered in this blog post. Till now I’ve had to write Scala in order to use Spark. This has resulted in me spending a lot of time looking for libraries that would normally take me less than a second to recall the proper Python library (JSON being an example) since I don’t know Scala very well.

If you need help deciding whether to read this post, take a look at Spark SQL and DataFrame Guide to see what you stand to gain.

Enjoy!

Using Spark DataFrames for large scale data science

Friday, March 27th, 2015

Using Spark DataFrames for large scale data science by Reynold Xin.

From the post:

When we first open sourced Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API—tasks that used to take thousands of lines of code to express could be reduced to dozens.

As Spark continues to grow, we want to enable wider audiences beyond big data engineers to leverage the power of distributed processing. The new DataFrame API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature:

  • Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
  • Support for a wide array of data formats and storage systems
  • State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
  • Seamless integration with all big data tooling and infrastructure via Spark
  • APIs for Python, Java, Scala, and R (in development via SparkR)

For new users familiar with data frames in other programming languages, this API should make them feel at home. For existing Spark users, this extended API will make Spark easier to program, and at the same time improve performance through intelligent optimizations and code-generation.

If you don’t know Spark DataFrames, you are missing out on important Spark capabilities! This post will have to well on the way to recovery.

Even though the reading of data from other sources is “easy” in many cases and support for more is growing, I am troubled by statements like:


DataFrames’ support for data sources enables applications to easily combine data from disparate sources (known as federated query processing in database systems). For example, the following code snippet joins a site’s textual traffic log stored in S3 with a PostgreSQL database to count the number of times each user has visited the site.

That goes well beyond reading data and introduces the concept of combining data, which isn’t the same thing.

For any two data sets that are trivially transparent to you (caveat what is transparent to you may/may not be transparent to others), that example works.

That example fails where data scientists spend 50 to 80 percent of their time: “collecting and preparing unruly digital data.” For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.

If your handlers are content to spend 50 to 80 percent of your time munging data, enjoy. Not that munging data will ever go away, but documenting the semantics of your data can enable you to spend less time munging and more time on enjoyable tasks.

Introducing DataFrames in Spark for Large Scale Data Science

Thursday, February 19th, 2015

Introducing DataFrames in Spark for Large Scale Data Science by Reynold Xin, Michael Armbrust and Davies Liu.

From the post:

Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience.

When we first open sourced Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API: tasks that used to take thousands of lines of code to express could be reduced to dozens.

As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. The new DataFrames API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature:

  • Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
  • Support for a wide array of data formats and storage systems
  • State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
  • Seamless integration with all big data tooling and infrastructure via Spark
  • APIs for Python, Java, Scala, and R (in development via SparkR)

For new users familiar with data frames in other programming languages, this API should make them feel at home. For existing Spark users, this extended API will make Spark easier to program, and at the same time improve performance through intelligent optimizations and code-generation.

The dataframe API will be released for Spark 1.3 in early March.

BTW, the act of using a dataframe creates a new subject, yes? How are you going to document the semantics of such subjects? I didn’t notice a place to write down that information.

That’s a good question of ask of many of the emerging big/large/ginormous data tools. I have trouble remembering what I meant from notes yesterday and that’s not an uncommon experience. Imagine six months from now. Or when you are at your third client this month and the first one calls for help.

Remember: To many eyes undocumented subjects are opaque.

I first saw this in a tweet by Sebastian Rascha