Introducing DataFrames in Spark for Large Scale Data Science

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 19, 2015

Introducing DataFrames in Spark for Large Scale Data Science

Filed under: Data Frames,Spark — Patrick Durusau @ 10:12 am

Introducing DataFrames in Spark for Large Scale Data Science by Reynold Xin, Michael Armbrust and Davies Liu.

From the post:

Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience.

When we first open sourced Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API: tasks that used to take thousands of lines of code to express could be reduced to dozens.

As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. The new DataFrames API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature:

Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster

Support for a wide array of data formats and storage systems

State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer

Seamless integration with all big data tooling and infrastructure via Spark

APIs for Python, Java, Scala, and R (in development via SparkR)

For new users familiar with data frames in other programming languages, this API should make them feel at home. For existing Spark users, this extended API will make Spark easier to program, and at the same time improve performance through intelligent optimizations and code-generation.

…

The dataframe API will be released for Spark 1.3 in early March.

BTW, the act of using a dataframe creates a new subject, yes? How are you going to document the semantics of such subjects? I didn’t notice a place to write down that information.

That’s a good question of ask of many of the emerging big/large/ginormous data tools. I have trouble remembering what I meant from notes yesterday and that’s not an uncommon experience. Imagine six months from now. Or when you are at your third client this month and the first one calls for help.

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 19, 2015

Introducing DataFrames in Spark for Large Scale Data Science

No Comments