Data Frames « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 10, 2015

Spark Release 1.5.0

Filed under: Data Frames,GraphX,Machine Learning,R,Spark,Streams — Patrick Durusau @ 1:42 pm

From the post:

Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page.

You can consult JIRA for the detailed changes. We have curated a list of high level changes here:

APIs: RDD, DataFrame and SQL

Backend Execution: DataFrame and SQL

Integrations: Data Sources, Hive, Hadoop, Mesos and Cluster Management

R Language

Machine Learning and Advanced Analytics

Spark Streaming

Deprecations, Removals, Configs, and Behavior Changes

Spark Core

Spark SQL & DataFrames

Spark Streaming

MLlib

Known Issues

SQL/DataFrame

Streaming

Credits

…

Time for your Fall Spark Upgrade!

Enjoy!

Comments Off

June 15, 2015

15 Easy Solutions To Your Data Frame Problems In R

Filed under: Data Frames,R,Spark — Patrick Durusau @ 3:40 pm

15 Easy Solutions To Your Data Frame Problems In R.

From the post:

R’s data frames regularly create somewhat of a furor on public forums like Stack Overflow and Reddit. Starting R users often experience problems with the data frame in R and it doesn’t always seem to be straightforward. But does it really need to be so?

Well, not necessarily.

With today’s post, DataCamp wants to show you that data frames don’t need to be hard: we offer you 15 easy, straightforward solutions to the most frequently occurring problems with data.frame. These issues have been selected from the most recent and sticky or upvoted Stack Overflow posts. If, however, you are more interested in getting an elaborate introduction to data frames, you might consider taking a look at our Introduction to R course.

If you are having trouble with frames in R, you are going to have trouble with frames in Spark.

Questions and solutions you will see here:

How To Create A Simple Data Frame in R
How To Change A Data Frame’s Row And Column Names
How To Check A Data Frame’s Dimensions
How To Access And Change A Data Frame’s Values …. Through The Variable Names
… Through The [,] and $ Notations
Why And How To Attach Data Frames
How To Apply Functions To Data Frames
How To Create An Empty Data Frame
How To Extract Rows And Colums, Subseting Your Data Frame
How To Remove Columns And Rows From A Data Frame
How To Add Rows And Columns To A Data Frame
Why And How To Reshape A Data Frame From Wide To Long Format And Vice Versa
Using stack() For Simply Structured Data Frames
Using reshape() For Complex Data Frames
Reshaping Data Frames With tidyr
Reshaping Data Frames With reshape2
How To Sort A Data Frame
How To Merge Data Frames
Merging Data Frames On Row Names
How To Remove Data Frames’ Rows And Columns With NA-Values
How To Convert Lists Or Matrices To Data Frames And Back
Changing A Data Frame To A Matrix Or List

Rather than looking for a “cheatsheet” on data frames, suggest you work your way through these solutions, more than once. Over time you will learn the ones relevant to your particular domain.

Enjoy!

Comments Off

June 2, 2015

Statistical and Mathematical Functions with DataFrames in Spark

Filed under: Data Frames,Python,Spark — Patrick Durusau @ 2:59 pm

Statistical and Mathematical Functions with DataFrames in Spark by Burak Yavuz and Reynold Xin.

From the post:

We introduced DataFrames in Spark 1.3 to make Apache Spark much easier to use. Inspired by data frames in R and Python, DataFrames in Spark expose an API that’s similar to the single-node data tools that data scientists are already familiar with. Statistics is an important part of everyday data science. We are happy to announce improved support for statistical and mathematical functions in the upcoming 1.4 release.

In this blog post, we walk through some of the important functions, including:

Random data generation

Summary and descriptive statistics

Sample covariance and correlation

Cross tabulation (a.k.a. contingency table)

Frequent items

Mathematical functions

We use Python in our examples. However, similar APIs exist for Scala and Java users as well.

You do know you have to build Spark yourself to find these features before the release of 1.4. Yes? For that: https://github.com/apache/spark/tree/branch-1.4.

Have you ever heard the expression “used in anger?”

That’s what Spark and its components deserve, to be “used in anger.”

Enjoy!

Comments Off

May 2, 2015

On The Bleeding Edge – PySpark, DataFrames, and Cassandra

Filed under: Cassandra,Data Frames,Python — Patrick Durusau @ 8:17 pm

On The Bleeding Edge – PySpark, DataFrames, and Cassandra.

From the post:

A few months ago I wrote a post on Getting Started with Cassandra and Spark.

I’ve worked with Pandas for some small personal projects and found it very useful. The key feature is the data frame, which comes from R. Data Frames are new in Spark 1.3 and was covered in this blog post. Till now I’ve had to write Scala in order to use Spark. This has resulted in me spending a lot of time looking for libraries that would normally take me less than a second to recall the proper Python library (JSON being an example) since I don’t know Scala very well.
…

If you need help deciding whether to read this post, take a look at Spark SQL and DataFrame Guide to see what you stand to gain.

Enjoy!

Comments Off

March 27, 2015

Using Spark DataFrames for large scale data science

Filed under: BigData,Data Frames,Spark — Patrick Durusau @ 7:33 pm

Using Spark DataFrames for large scale data science by Reynold Xin.

From the post:

When we first open sourced Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API—tasks that used to take thousands of lines of code to express could be reduced to dozens.

As Spark continues to grow, we want to enable wider audiences beyond big data engineers to leverage the power of distributed processing. The new DataFrame API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature:

Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster

Support for a wide array of data formats and storage systems

State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer

Seamless integration with all big data tooling and infrastructure via Spark

APIs for Python, Java, Scala, and R (in development via SparkR)

For new users familiar with data frames in other programming languages, this API should make them feel at home. For existing Spark users, this extended API will make Spark easier to program, and at the same time improve performance through intelligent optimizations and code-generation.

If you don’t know Spark DataFrames, you are missing out on important Spark capabilities! This post will have to well on the way to recovery.

Even though the reading of data from other sources is “easy” in many cases and support for more is growing, I am troubled by statements like:

…
DataFrames’ support for data sources enables applications to easily combine data from disparate sources (known as federated query processing in database systems). For example, the following code snippet joins a site’s textual traffic log stored in S3 with a PostgreSQL database to count the number of times each user has visited the site.
…

That goes well beyond reading data and introduces the concept of combining data, which isn’t the same thing.

For any two data sets that are trivially transparent to you (caveat what is transparent to you may/may not be transparent to others), that example works.

That example fails where data scientists spend 50 to 80 percent of their time: “collecting and preparing unruly digital data.” For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.

If your handlers are content to spend 50 to 80 percent of your time munging data, enjoy. Not that munging data will ever go away, but documenting the semantics of your data can enable you to spend less time munging and more time on enjoyable tasks.

Comments Off

February 19, 2015

Introducing DataFrames in Spark for Large Scale Data Science

Filed under: Data Frames,Spark — Patrick Durusau @ 10:12 am

Introducing DataFrames in Spark for Large Scale Data Science by Reynold Xin, Michael Armbrust and Davies Liu.

From the post:

Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience.

When we first open sourced Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API: tasks that used to take thousands of lines of code to express could be reduced to dozens.

As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. The new DataFrames API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature:

Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster

Support for a wide array of data formats and storage systems

State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer

Seamless integration with all big data tooling and infrastructure via Spark

APIs for Python, Java, Scala, and R (in development via SparkR)

For new users familiar with data frames in other programming languages, this API should make them feel at home. For existing Spark users, this extended API will make Spark easier to program, and at the same time improve performance through intelligent optimizations and code-generation.

…

The dataframe API will be released for Spark 1.3 in early March.

BTW, the act of using a dataframe creates a new subject, yes? How are you going to document the semantics of such subjects? I didn’t notice a place to write down that information.

That’s a good question of ask of many of the emerging big/large/ginormous data tools. I have trouble remembering what I meant from notes yesterday and that’s not an uncommon experience. Imagine six months from now. Or when you are at your third client this month and the first one calls for help.

Remember: To many eyes undocumented subjects are opaque.

I first saw this in a tweet by Sebastian Rascha

Comments Off