Archive for the ‘Data Analysis’ Category

The Ethics of Data Analytics

Sunday, August 21st, 2016

The Ethics of Data Analytics by Kaiser Fung.

Twenty-one slides on ethics by Kaiser Fung, author of: Junk Charts (data visualization blog), and Big Data, Plainly Spoken (comments on media use of statistics).

Fung challenges you to reach your own ethical decisions and acknowledges there are a number of guides to such decision making.

Unfortunately, Fung does not include professional responsibility requirements, such as the now out-dated Canon 7 of the ABA Model Code Of Professional Responsibility:

A Lawyer Should Represent a Client Zealously Within the Bounds of the Law

That canon has a much storied history, which is capably summarized in Whatever Happened To ‘Zealous Advocacy’? by Paul C. Sanders.

In what became known as Queen Caroline’s Case, the House of Lords sought to dissolve the marriage of King George the IV

on the grounds of her adultery. Effectively removing her as queen of England.

Queen Caroline was represented by Lord Brougham, who had evidence of a secret prior marriage by King George the IV to Catholic (which was illegal), Mrs Fitzherbert.

Brougham’s speech is worth your reading in full but the portion most often cited for zealous defense reads as follows:

I once before took leave to remind your lordships — which was unnecessary, but there are many whom it may be needful to remind — that an advocate, by the sacred duty of his connection with his client, knows, in the discharge of that office, but one person in the world, that client and none other. To save that client by all expedient means — to protect that client at all hazards and costs to all others, and among others to himself — is the highest and most unquestioned of his duties; and he must not regard the alarm, the suffering, the torment, the destruction, which he may bring upon any other; nay, separating even the duties of a patriot from those of an advocate, he must go on reckless of the consequences, if his fate it should unhappily be, to involve his country in confusion for his client.

The name Mrs. Fitzherbert never slips Lord Brougham’s lips but the House of Lords has been warned that may not remain to be the case, should it choose to proceed. The House of Lords did grant the divorce but didn’t enforce it. Saving fact one supposes. Queen Caroline died less than a month after the coronation of George IV.

For data analysis, cybersecurity, or any of the other topics I touch on in this blog, I take the last line of Lord Brougham’s speech:

To save that client by all expedient means — to protect that client at all hazards and costs to all others, and among others to himself — is the highest and most unquestioned of his duties; and he must not regard the alarm, the suffering, the torment, the destruction, which he may bring upon any other; nay, separating even the duties of a patriot from those of an advocate, he must go on reckless of the consequences, if his fate it should unhappily be, to involve his country in confusion for his client.

as the height of professionalism.

Post-engagement of course.

If ethics are your concern, have that discussion with your prospective client before you are hired.

Otherwise, clients have goals and the task of a professional is how to achieve them. Nothing more.

Volumetric Data Analysis – yt

Friday, June 17th, 2016

One of those rotating homepages:

Volumetric Data Analysis – yt

yt is a python package for analyzing and visualizing volumetric, multi-resolution data from astrophysical simulations, radio telescopes, and a burgeoning interdisciplinary community.

Quantitative Analysis and Visualization

yt is more than a visualization package: it is a tool to seamlessly handle simulation output files to make analysis simple. yt can easily knit together volumetric data to investigate phase-space distributions, averages, line integrals, streamline queries, region selection, halo finding, contour identification, surface extraction and more.

Many formats, one language

yt aims to provide a simple uniform way of handling volumetric data, regardless of where it is generated. yt currently supports FLASH, Enzo, Boxlib, Athena, arbitrary volumes, Gadget, Tipsy, ART, RAMSES and MOAB. If your data isn’t already supported, why not add it?

From the non-rotating part of the homepage:

To get started using yt to explore data, we provide resources including documentation, workshop material, and even a fully-executable quick start guide demonstrating many of yt’s capabilities.

But if you just want to dive in and start using yt, we have a long list of recipes demonstrating how to do various tasks in yt. We even have sample datasets from all of our supported codes on which you can test these recipes. While yt should just work with your data, here are some instructions on loading in datasets from our supported codes and formats.

Professional astronomical data and tools like yt put exploration of the universe at your fingertips!

Enjoy!

Where You Look – Determines What You See

Friday, April 22nd, 2016

Abstract:

This article argues that maps of the Web’s structure based solely on technical infrastructure such as hyperlinks may bear little resemblance to maps based on Web usage, as cultural factors drive the latter to a larger extent. To test this thesis, the study constructs two network maps of 1000 globally most popular Web domains, one based on hyperlinks and the other using an “audience-centric” approach with ties based on shared audience traffic between these domains. Analyses of the two networks reveal that unlike the centralized structure of the hyperlink network with few dominant “core” Websites, the audience network is more decentralized and clustered to a larger extent along geo-linguistic lines.

Apologies but the article is behind a firewall.

A good example of what you look for determining your results. And an example of how firewalls prevent meaningful discussion of such research.

Unless you know of a site like sci-hub.io of course.

Enjoy!

PS: This is what an audience-centric web mapping looks like:

Impressive work!

Using ‘R’ for betting analysis [Data Science For The Rest Of Us]

Wednesday, January 13th, 2016

From the post:

Gaining an edge in betting often boils down to intelligent data analysis, but faced with daunting amounts of data it can be hard to know where to start. If this sounds familiar, R – an increasingly popular statistical programming language widely used for data analysis – could be just what you’re looking for.

What is R?

R is a statistical programming language that is used to visualize and analyse data. Okay, this sounds a little intimidating but actually it isn’t as scary as it may appear. Its creators – two professors from New Zealand – wanted an intuitive statistical platform that their students could use to slice and dice data and create interesting visual representation like 3D graphs.

Given its relative simplicity but endless scope for applications (packages) R has steadily gained momentum amongst the world’s brightest statisticians and data scientists. Facebook use R for statistical analysis of status updates and many of the complex word clouds you might see online are powered by R.

There are now thousands of user created libraries to enhance R functionality and given how much successful betting boils down to effective data analysis, packages are being created to perform betting related analysis and strategies.

Extracting insights from the shape of complex data using topology

Thursday, November 6th, 2014

Extracting insights from the shape of complex data using topology by P. Y. Lum, et al. (Scientific Reports 3, Article number: 1236 doi:10.1038/srep01236)

Abstract:

This paper applies topological methods to study complex high dimensional data sets by extracting shapes (patterns) and obtaining insights about them. Our method combines the best features of existing standard methodologies such as principal component and cluster analyses to provide a geometric representation of complex data sets. Through this hybrid method, we often find subgroups in data sets that traditional methodologies fail to find. Our method also permits the analysis of individual data sets as well as the analysis of relationships between related data sets. We illustrate the use of our method by applying it to three very different kinds of data, namely gene expression from breast tumors, voting data from the United States House of Representatives and player performance data from the NBA, in each case finding stratifications of the data which are more refined than those produced by standard methods.

In order to identify subjects you must first discover them.

Does the available financial contribution data on members of the United States House of Representatives correspond with the clustering analysis here? (Asking because I don’t know but would be interested in finding out.)

I first saw this in a tweet by Stian Danenbarger.

Intriguing properties of neural networks [Gaming Neural Networks]

Thursday, October 9th, 2014

Intriguing properties of neural networks by Christian Szegedy, et al.

Abstract:

Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties.

First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks.

Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. Specifically, we find that we can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network’s prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

Both findings are of interest but the discovery of “adversarial examples” that can cause a trained network to misclassify images, is the more intriguing of the two.

How do you validate a result from a neural network? Possessing the same network and data isn’t going to help if it contains “adversarial examples.” I suppose you could “spot” a misclassification but one assumes a neural network is being used because physical inspection by a person isn’t feasible.

What “adversarial examples” work best against particular neural networks? How to best generate such examples?

How do users of off-the-shelf neural networks guard against “adversarial examples?” (One of those cases where “shrink-wrap” data services may not be a good choice.)

I first saw this in a tweet by Xavier Amatriain

MRAPs And Bayonets: What We Know About The Pentagon’s 1033 Program

Thursday, September 11th, 2014

MRAPs And Bayonets: What We Know About The Pentagon’s 1033 Program by by Arezou Rezvani, Jessica Pupovac, David Eads, and Tyler Fisher. (NPR)

From the post:

Amid widespread criticism of the deployment of military-grade weapons and vehicles by police officers in Ferguson, Mo., President Obama recently ordered a review of federal efforts supplying equipment to local law enforcement agencies across the country.

So, we decided to take a look at what the president might find.

NPR obtained data from the Pentagon on every military item sent to local, state and federal agencies through the Pentagon’s Law Enforcement Support Office — known as the 1033 program — from 2006 through April 23, 2014. The Department of Defense does not publicly report which agencies receive each piece of equipment, but they have identified the counties that the items were shipped to, a description of each, and the amount the Pentagon initially paid for them.

We took the raw data, analyzed it and have organized it to make it more accessible. We are making that data set available to the public today.

This is a data set that raises more questions than it answers, as the post points out.

The top ten categories of items distributed (valued in the \$millions): vehicles, aircraft, comm. & detection, clothing, construction, fire control, weapons, electric wire, medical equipment, and tractors.

Tractors? I can understand the military having tractors since it is entirely self-reliance during military operations. Why any local law enforcement office needs a tractor is less clear. Or bayonets (11,959 of them).

The NPR post does a good job of raising questions but since there are 3,143 counties or their equivalents in the United States, connecting the dots with particular local agencies, uses, etc. falls on your shoulders.

Could be quite interesting. Is your local sheriff “training” on an amphibious vehicle to reach his deer blind during hunting season? (Utter speculation on my part. I don’t know if your local sheriff likes to hunt deer.)

FCC Net Neutrality Plan – 800,000 Comments

Wednesday, September 3rd, 2014

What can we learn from 800,000 public comments on the FCC’s net neutrality plan? by Bob Lannon and Andrew Pendleton.

From the post:

On Aug. 5, the Federal Communications Commission announced the bulk release of the comments from its largest-ever public comment collection. We’ve spent the last three weeks cleaning and preparing the data and leveraging our experience in machine learning and natural language processing to try and make sense of the hundreds-of-thousands of comments in the docket. Here is a high-level overview, as well as our cleaned version of the full corpus which is available for download in the hopes of making further research easier.

A great story of cleaning dirty data. Beyond eliminating both Les Misérables and War and Peace as comments, the authors detected statements by experts, form letters, etc.

If you’re interested in doing your own analysis with this data, you can download our cleaned-up versions below. We’ve taken the six XML files released by the FCC and split them out into individual files in JSON format, one per comment, then compressed them into archives, one for each of XML file. Additionally, we’ve taken several individual records from the FCC data that represented multiple submissions grouped together, and split them out into individual files (these JSON files will have hyphens in their filenames, where the value before the hyphen represents the original record ID). This includes email messages to openinternet@fcc.gov, which had been aggregated into bulk submissions, as well as mass submissions from CREDO Mobile, Sen. Bernie Sanders’ office and others. We would be happy to answer any questions you may have about how these files were generated, or how to use them.

All the code use in the project is available at: https://github.com/sunlightlabs/fcc-net-neutrality-comments

I first saw this in a tweet by Scott Chamberlain.

Test Your Analysis With Random Numbers

Tuesday, August 26th, 2014

A critical reanalysis of the relationship between genomics and well-being by Nicholas J. L. Brown, et al. (Nicholas J. L. Brown, doi: 10.1073/pnas.1407057111)

Abstract:

Fredrickson et al. [Fredrickson BL, et al. (2013) Proc Natl Acad Sci USA 110(33):13684–13689] claimed to have observed significant differences in gene expression related to hedonic and eudaimonic dimensions of well-being. Having closely examined both their claims and their data, we draw substantially different conclusions. After identifying some important conceptual and methodological flaws in their argument, we report the results of a series of reanalyses of their dataset. We first applied a variety of exploratory and confirmatory factor analysis techniques to their self-reported well-being data. A number of plausible factor solutions emerged, but none of these corresponded to Fredrickson et al.’s claimed hedonic and eudaimonic dimensions. We next examined the regression analyses that purportedly yielded distinct differential profiles of gene expression associated with the two well-being dimensions. Using the best-fitting two-factor solution that we identified, we obtained effects almost twice as large as those found by Fredrickson et al. using their questionable hedonic and eudaimonic factors. Next, we conducted regression analyses for all possible two-factor solutions of the psychometric data; we found that 69.2% of these gave statistically significant results for both factors, whereas only 0.25% would be expected to do so if the regression process was really able to identify independent differential gene expression effects. Finally, we replaced Fredrickson et al.’s psychometric data with random numbers and continued to find very large numbers of apparently statistically significant effects. We conclude that Fredrickson et al.’s widely publicized claims about the effects of different dimensions of well-being on health-related gene expression are merely artifacts of dubious analyses and erroneous methodology. (emphasis added)

To see the details you will need a subscription the the Proceedings of the National Academy of Sciences.

However, you can take this data analysis lesson from the abstract:

If your data can be replaced with random numbers and still yield statistically significant results, stop the publication process. Something is seriously wrong with your methodology.

I first saw this in a tweet by WvSchaik.

Awesome Machine Learning

Wednesday, July 30th, 2014

Awesome Machine Learning by Joseph Misiti.

From the webpage:

A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. Other awesome lists can be found in the awesome-awesomeness list.

If you want to contribute to this list (please do), send me a pull request or contact me @josephmisiti

Not strictly limited to “machine learning” as it offers resources on data analysis, visualization, etc.

With a list of 576 resources, I am sure you will find something new!

Advanced Data Analysis from an Elementary Point of View (update)

Friday, July 25th, 2014

Advanced Data Analysis from an Elementary Point of View by Cosma Rohilla Shalizi. (8 January 2014)

From the introduction:

These are the notes for 36-402, Advanced Data Analysis, at Carnegie Mellon. If you are not enrolled in the class, you should know that it’s the methodological capstone of the core statistics sequence taken by our undergraduate majors (usually in their third year), and by students from a range of other departments. By this point, they have taken classes in introductory statistics and data analysis, probability theory, mathematical statistics, and modern linear regression (“401”). This class does not presume that you have learned but forgotten the material from the pre-requisites; it presumes that you know that material and can go beyond it. The class also presumes a ﬁrm grasp on linear algebra and multivariable calculus, and that you can read and write simple functions in R. If you are lacking in any of these areas, now would be an excellent time to leave.

I last reported on this draft in 2012 at: Advanced Data Analysis from an Elementary Point of View

Looking forward to this works publication by Cambridge University Press.

I first saw this in a tweet by Mark Patterson.

First complex, then simple

Saturday, July 19th, 2014

First complex, then simple by James D Malley and Jason H Moore. (BioData Mining 2014, 7:13)

Abstract:

At the start of a data analysis project it is often suggested that the researcher look first at multiple simple models. That is, always begin with simple, one variable at a time analyses, such as multiple single-variable tests for association or significance. Then, later, somehow (how?) pull all the separate pieces together into a single comprehensive framework, an inclusive data narrative. For detecting true compound effects with more than just marginal associations, this is easily defeated with simple examples. But more critically, it is looking through the data telescope from wrong end.

That is the author’s start with no formal theories about what data will prove and upon finding signals in the data, then generate simple models to explain the signals.

I am sure their questions of the data are driven by a suspicion of what the data may prove, but that isn’t the same thing as asking questions designed to prove a model generated before the data is queried.

Introduction to Python for Econometrics, Statistics and Data Analysis

Tuesday, July 1st, 2014

Introduction to Python for Econometrics, Statistics and Data Analysis by Kevin Sheppard.

From the introduction:

These notes are designed for someone new to statistical computing wishing to develop a set of skills necessary to perform original research using Python. They should also be useful for students, researchers or practitioners who require a versatile platform for econometrics, statistics or general numerical analysis (e.g. numeric solutions to economic models or model simulation).

Python is a popular general purpose programming language which is well suited to a wide range of problems. 1 Recent developments have extended Python’s range of applicability to econometrics, statistics and general numerical analysis. Python – with the right set of add-ons – is comparable to domain-specific languages such as MATLAB and R. If you are wondering whether you should bother with Python (or another language), a very incomplete list of considerations includes:

One of the more even-handed introductions I have read in a long time.

Enough examples and exercises to build some keyboard memory into your fingers! 😉

Bookmark this text so you can forward the link to others.

I first saw this in a tweet by yhat.

Saturday, June 21st, 2014

The Analyst’s Toolbox by Simon Raper.

From the post:

There are hundreds, maybe thousands, of open source/free/online tools out there that form part of the analyst’s toolbox. Here’s what I have on my mac for day to day work. Click on the leaf node labels to be redirected to the relevant sites. Visualisation in D3.

Tools in day to day use by a live data analyst. Nice presentation as well.

…Data Analytics Hackathon

Saturday, May 24th, 2014

Elasticsearch Teams up with MIT Sloan for Data Analytics Hackathon by Sejal Korenromp.

From the post:

Following from the success and popularity of the Hopper Hackathon we participated in late last year, last week we sponsored the MIT Sloan Data Analytics Club Hackathon for our latest offering to Elasticsearch aficionados. More than 50 software engineers, business students and other open source software enthusiasts signed up to participate, and on a Saturday to boot! The full day’s festivities included access to a huge storage and computing cluster, and everyone was set free to create something awesome using Elasticsearch.

Hacks from the finalists:

• Quimbly – A Digital Library
• Brand Sentiment Analysis
• Conference Data
• Statistics on Movies and Wikipedia

See Sejal’s post for the details of each hack and the winner.

I noticed several very good ideas in these hacks, no doubt you will notice even more.

Enjoy!

Data Analytics Handbook

Friday, May 23rd, 2014

Data Analytics Handbook

The “handbook” appears in three parts, the first of which you download, while links to parts 2 and 3 are emailed to you for participating in a short survey. The survey collects your name, email address, educational background (STEM or not), and whether you are interested in a new resource that is being created to teach data analysis.

Let’s be clear up front that this is NOT a technical handbook.

Rather all three parts are interviews with:

Part 1: Data Analysts + Data Scientists

Part 2: CEO’s + Managers

Technical handbooks abound but this is one of the few (only?) books that covers the “soft” side of data analytics. By the “soft” side I mean the people and personal relationships that make up the data analytics industry. Technical knowledge is a must but being able to work well with others is as if not more important.

The interviews are wide ranging and don’t attempt to provide cut-n-dried answers. Readers will need to be inspired by and adapt the reported experiences to their own circumstances.

Of all the features of the books, I suspect I liked the “Top 5 Take Aways” the best.

In the interest of full disclosure, that maybe because part 1 reported:

2. The biggest challenge for a data analyst isn’t modeling, it’s cleaning and collecting

Data analysts spend most of their time collecting and cleaning the data required for analysis. Answering questions like “where do you collect the data?”, “how do you collect the data?”, and “how should you clean the data?”, require much more time than the actual analysis itself.

Well, when someone puts your favorite hobby horse at #2, see how you react. 😉

I first saw this in a tweet by Marin Dimitrov.