Using Spark DataFrames for large scale data science by Reynold Xin.
From the post:
When we first open sourced Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API—tasks that used to take thousands of lines of code to express could be reduced to dozens.
As Spark continues to grow, we want to enable wider audiences beyond big data engineers to leverage the power of distributed processing. The new DataFrame API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature:
- Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
- Support for a wide array of data formats and storage systems
- State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
- Seamless integration with all big data tooling and infrastructure via Spark
- APIs for Python, Java, Scala, and R (in development via SparkR)
For new users familiar with data frames in other programming languages, this API should make them feel at home. For existing Spark users, this extended API will make Spark easier to program, and at the same time improve performance through intelligent optimizations and code-generation.
If you don’t know Spark DataFrames, you are missing out on important Spark capabilities! This post will have to well on the way to recovery.
Even though the reading of data from other sources is “easy” in many cases and support for more is growing, I am troubled by statements like:
DataFrames’ support for data sources enables applications to easily combine data from disparate sources (known as federated query processing in database systems). For example, the following code snippet joins a site’s textual traffic log stored in S3 with a PostgreSQL database to count the number of times each user has visited the site.
That goes well beyond reading data and introduces the concept of combining data, which isn’t the same thing.
For any two data sets that are trivially transparent to you (caveat what is transparent to you may/may not be transparent to others), that example works.
That example fails where data scientists spend 50 to 80 percent of their time: “collecting and preparing unruly digital data.” For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.
If your handlers are content to spend 50 to 80 percent of your time munging data, enjoy. Not that munging data will ever go away, but documenting the semantics of your data can enable you to spend less time munging and more time on enjoyable tasks.