Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 26, 2018

pandas: powerful Python data analysis toolkit & Data Skepticism

Filed under: Pandas,Python,Skepticism — Patrick Durusau @ 12:52 pm

pandas: powerful Python data analysis toolkit

From the webpage:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

[if you need more enticement]

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

I need to spend more time with pandas but have to confess that meta-issues with data interest me more than “alleged” data distributed by governments, corporations and others.

I saw “alleged” data because unless you know the means by which it was collected, the criteria for that collection, what was available but excluded from collection, plus a host of other questions about any data set, about all you know is that X claims the “alleged” data means “something.”

The “something” claimed for data varies on who is reporting it and what purpose they have in telling you. I immediately discount explanations that involve my or the public’s benefit. No, rather say the data was released in hopes that I or the public would see it as a benefit. A bit closer to the truth.

All that said, there are any number of interesting ways that processing data shades it as well, so a deep appreciate for pandas will help you spot those tricks as well.

PS: I don’t mean to contend we can ever be bias free, but I do think we can aspire to expose the biases of others.

I first saw this in a tweet by Kirk Borne

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress