Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 10, 2012

Factor Analysis at 100:… [Two Subjects – One Name]

Filed under: Factor Analysis,Statistics — Patrick Durusau @ 3:53 am

Factor Analysis at 100: Historical Developments And Future Directions (Cudeck, MacCallum, Lawrence Erlbaum Associates Inc, 2007. (384 pp.)) was mentioned by Christophe Lalanne in Some Random Notes, as one of his recent book acquisitions.

While searching for that volume, I encountered a conference with the same name: Factor Analysis at 100: Historical Developments And Future Directions [Conference, 2004] .

At the conference site you will find links to materials from thirteen speakers, plus a “Factor Analysis Genealogy” and “Factor Analysis Timeline.”

The presentations from the conference became papers that appear in the volume Christophe recently purchased.

Charles Spearman’s paper, “General Intelligence, Objectively Determined and Measured,” in the American Journal of Psychology [PDF version] [HTML version] (1904) was posted to the conference homepage.


The relevant subject identifiers are obvious. What else would you add to topics representing these subjects? Why?

August 3, 2012

Column Statistics in Hive

Filed under: Cloudera,Hive,Merging,Statistics — Patrick Durusau @ 2:48 pm

Column Statistics in Hive by Shreepadma Venugopalan.

From the post:

Over the last couple of months the Hive team at Cloudera has been working hard to bring a bunch of exciting new features to Hive. In this blog post, I’m going to talk about one such feature – Column Statistics in Hive – and how Hive’s query processing engine can benefit from it. The feature is currently a work in progress but we expect it to be available for review imminently.

Motivation

While there are many possible execution plans for a query, some plans are more optimal than others. The query optimizer is responsible for generating an efficient execution plan for a given SQL query from the space of all possible plans. Currently, Hive’s query optimizer uses rules of thumbs to generate an efficient execution plan for a query. While such rules of thumb optimizations transform the query plan into a more efficient one, the resulting plan is not always the most efficient execution plan.

In contrast, the query optimizer in a traditional RDBMS is cost based; it uses the statistical properties of the input column values to estimate the cost alternative query plans and chooses the plan with the lowest cost. The cost model for query plans assigns an estimated execution cost to the plans. The cost model is based on the CPU and I/O costs of query execution for every operator in the query plan. As an example consider a query that represents a join among {A, B, C} with the predicate {A.x == B.x == C.x}. Assume table A has a total of 500 records, table B has a total of 6000 records, table C has a total of 1000 records. In the absence of cost based query optimization, the system picks the join order specified by the user. In our example, let us further assume that the result of joining A and B yields 2000 records and the result of joining A and C yields 50 records.Hence the cost of performing the join between A, B and C, without join reordering, is the cost of joining A and B + cost of joining the output of A Join B with C. In our example this would result in a cost of (500 * 6000) + (2000 * 1000). On the other hand, a cost based optimizer (CBO) in a RDBMS would pick the more optimal alternate order [(A Join C) Join B] thus resulting in a cost of (500 * 1000) + (50 * 6000). However, in order to pick the more optimal join order the CBO needs cardinality estimates on the join column.

Today, Hive supports statistics at the table and partition level – count of files, raw data size, count of rows etc, but doesn’t support statistics on column values. These table and partition level statistics are insufficient for the purpose of building a CBO because they don’t provide any information about the individual column values. Hence obtaining the statistical summary of the column values is the first step towards building a CBO for Hive.

In addition to join reordering, Hive’s query optimizer will be able to take advantage of column statistics to decide whether to perform a map side aggregation as well as estimate the cardinality of operators in the execution plan better.

Some days I wonder where improvements to algorithms and data structures are going to lead?

Other days, I just enjoy the news.

Today is one of the latter.

PS: What a cost based optimizer (CBO) would look like for merging operations? Or perhaps better, merge cost estimator (MCE)? Metered merging anyone?

July 27, 2012

Anaconda: Scalable Python Computing

Filed under: Anaconda,Data Analysis,Machine Learning,Python,Statistics — Patrick Durusau @ 10:19 am

Anaconda: Scalable Python Computing

Easy, Scalable Distributed Data Analysis

Anaconda is a distribution that combines the most popular Python packages for data analysis, statistics, and machine learning. It has several tools for a variety of types of cluster computations, including MapReduce batch jobs, interactive parallelism, and MPI.

All of the packages in Anaconda are built, tested, and supported by Continuum. Having a unified runtime for distributed data analysis makes it easier for the broader community to share code, examples, and best practices — without getting tangled in a mess of versions and dependencies.

Good way to avoid dependency issues!

On scaling, I am reminded of a developer who designed a Python application to require upgrading for “heavy” use. Much to their disappointment, Python scaled under “heavy” use with no need for an upgrade. 😉

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

July 15, 2012

Categorization of interestingness measures for knowledge extraction

Filed under: Knowledge Capture,Statistics — Patrick Durusau @ 7:57 pm

Categorization of interestingness measures for knowledge extraction by Sylvie Guillaume, Dhouha Grissa, and Engelbert Mephu Nguifo.

Abstract:

Finding interesting association rules is an important and active research field in data mining. The algorithms of the Apriori family are based on two rule extraction measures, support and confidence. Although these two measures have the virtue of being algorithmically fast, they generate a prohibitive number of rules most of which are redundant and irrelevant. It is therefore necessary to use further measures which filter uninteresting rules. Many synthesis studies were then realized on the interestingness measures according to several points of view. Different reported studies have been carried out to identify “good” properties of rule extraction measures and these properties have been assessed on 61 measures. The purpose of this paper is twofold. First to extend the number of the measures and properties to be studied, in addition to the formalization of the properties proposed in the literature. Second, in the light of this formal study, to categorize the studied measures. This paper leads then to identify categories of measures in order to help the users to efficiently select an appropriate measure by choosing one or more measure(s) during the knowledge extraction process. The properties evaluation on the 61 measures has enabled us to identify 7 classes of measures, classes that we obtained using two different clustering techniques.

It will take some time to run down the original papers but I am curious in the mean time if:

  1. Anyone agrees or disagrees with the reduction of measures as having different names (page 10)?
  2. Anyone agrees or disagrees with the classification of measures into seven groups (pages 10-11)?

July 14, 2012

Finding Structure in Text, Genome and Other Symbolic Sequences

Filed under: Genome,Statistics,Symbol,Text Analytics,Text Corpus,Text Mining — Patrick Durusau @ 8:58 am

Finding Structure in Text, Genome and Other Symbolic Sequences by Ted Dunning. (thesis, 1998)

Abstract:

The statistical methods derived and described in this thesis provide new ways to elucidate the structural properties of text and other symbolic sequences. Generically, these methods allow detection of a difference in the frequency of a single feature, the detection of a difference between the frequencies of an ensemble of features and the attribution of the source of a text. These three abstract tasks suffice to solve problems in a wide variety of settings. Furthermore, the techniques described in this thesis can be extended to provide a wide range of additional tests beyond the ones described here.

A variety of applications for these methods are examined in detail. These applications are drawn from the area of text analysis and genetic sequence analysis. The textually oriented tasks include finding interesting collocations and cooccurent phrases, language identification, and information retrieval. The biologically oriented tasks include species identification and the discovery of previously unreported long range structure in genes. In the applications reported here where direct comparison is possible, the performance of these new methods substantially exceeds the state of the art.

Overall, the methods described here provide new and effective ways to analyse text and other symbolic sequences. Their particular strength is that they deal well with situations where relatively little data are available. Since these methods are abstract in nature, they can be applied in novel situations with relative ease.

Recently posted but dating from 1998.

Older materials are interesting because the careers of their authors can be tracked, say at DBPL Ted Dunning.

Or it can lead you to check an author in Citeseer:

Accurate Methods for the Statistics of Surprise and Coincidence (1993)

Abstract:

Much work has been done on the statistical analysis of text. In some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. In particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results.This assumption of normal distribution limits the ability to analyze rare events. Unfortunately rare events do make up a large fraction of real text.However, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. These tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. In some cases, these measures perform much better than the methods previously used. In cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical.This paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text.

Which has over 600 citations, only one of which is from the author. (I could comment about a well know self-citing ontologist but I won’t.)

The observations in the thesis about “large” data sets are dated but it merits your attention as fundamental work in the field of textual analysis.

As a bonus, it is quite well written and makes an enjoyable read.

July 8, 2012

statistics.com The Institute for Statistics Education

Filed under: Education,R,Statistics — Patrick Durusau @ 10:22 am

statistics.com The Institute for Statistics Education

The spread of R made me curious about certification in R?

The first “hit” on the subject was statistics.com The Institute for Statistics Education.

From their homepage:

Certificate Programs

Programs in Analytics and Statistical Studies (PASS)

From in-depth clinical trial design and analysis to data mining skills that help you make smarter business decisions, our unique programs focus on practical applications and help you master the software skills you need to stay a step ahead in your field.

http://www.statistics.com/

Biostatistics – Epidemiology

Biostatistics – Controlled Trials

Business Analytics

Data Mining

Social Science

Environmental Science

Engineering Statistics

Using R

Not with the same group or even the same subject (NetWare several versions ago), but I have had good experiences with this type of program.

Self study is always possible and sometimes the only option.

But, a good instructor can keep your interest in a specific body of material long enough to earn a certification.

Suggestions of other certification programs that would be of interest to data miners, machine learning, big data, etc., worker bees?

PS: If the courses sound pricey, slide on over the the University of Washington 3 course certificate in computational finance. At a little over $10K for 9 months.

July 6, 2012

The R Journal 4/1 June, 2012

Filed under: R,Statistics — Patrick Durusau @ 12:47 pm

The R Journal 4/1 June, 2012

I am sure you will find something interesting to read:

July 5, 2012

Special Volume: Graphical User Interfaces for R (Journal of Statistical Software, Vol. 49)

Filed under: R,Statistics — Patrick Durusau @ 12:38 pm

Special Volume: Graphical User Interfaces for R (Journal of Statistical Software, Vol. 49)

From the table of contents:

June 26, 2012

Journal of Statistical Software

Filed under: Mathematica,Mathematics,R,Statistics — Patrick Durusau @ 12:53 pm

Journal of Statistical Software

From the homepage:

Established in 1996, the Journal of Statistical Software publishes articles, book reviews, code snippets, and software reviews on the subject of statistical software and algorithms. The contents are freely available on-line. For both articles and code snippets the source code is published along with the paper.

Statistical software is the key link between statistical methods and their application in practice. Software that makes this link is the province of the journal, and may be realized as, for instance, tools for large scale computing, database technology, desktop computing, distributed systems, the World Wide Web, reproducible research, archiving and documentation, and embedded systems.

We attempt to present research that demonstrates the joint evolution of computational and statistical methods and techniques. Implementations can use languages such as C, C++, S, Fortran, Java, PHP, Python and Ruby or environments such as Mathematica, MATLAB, R, S-PLUS, SAS, Stata, and XLISP-STAT.

There are currently 518 articles, 34 code snippets, 104 book reviews, 6 software reviews, and 13 special volumes in our archives. These can be browsed or searched. You can also subscribe for notification of new articles.

Running down resources used in Wordcloud of the Arizona et al. v. United States opinion when I encountered this wonderful site.

I have only skimmed the surface for an article or two in particular so can’t begin to describe the breadth of material you will find here.

I am sure I will be returning time and time again to this site. Suggest if you are interested in statistical manipulation of data that you do the same.

June 24, 2012

Predictive Analytics: Evaluate Model Performance

Filed under: Predictive Analytics,Statistics — Patrick Durusau @ 4:19 pm

Predictive Analytics: Evaluate Model Performance by Ricky Ho.

Ricky finishes his multi-part series on models for machine learning with the one question left hanging:

OK, so which model should I use?

In previous posts, we have discussed various machine learning techniques including linear regression with regularization, SVM, Neural network, Nearest neighbor, Naive Bayes, Decision Tree and Ensemble models. How do we pick which model is the best ? or even whether the model we pick is better than a random guess ? In this posts, we will cover how we evaluate the performance of the model and what can we do next to improve the performance.

Best guess with no model

First of all, we need to understand the goal of our evaluation. Are we trying to pick the best model ? Are we trying to quantify the improvement of each model ? Regardless of our goal, I found it is always useful to think about what the baseline should be. Usually the baseline is what is your best guess if you don’t have a model.

For classification problem, one approach is to do a random guess (with uniform probability) but a better approach is to guess the output class that has the largest proportion in the training samples. For regression problem, the best guess will be the mean of output of training samples.

Ricky walks you through the steps and code to make an evaluation of each model.

It is always better to have evidence that your choices were better than a coin flip.

Although, I am mindful of the wealth advice story in “Thinking, Fast and Slow” by Daniel Kahneman, where he was given data of investment outcomes for eight years by 28 wealth advisers. The results indicated there was no correlation between “skill” and the outcomes. Luck and not skill was being rewarded with bonuses.

The results were ignored by both management and advisers as inconsistent with their “…personal experiences from experience.” (pp. 215-216)

Do you think the same can be said of search results? Just curious.

May 29, 2012

Statistics for Genomics (Spring 2012)

Filed under: Genome,R,Statistics — Patrick Durusau @ 6:27 pm

Statistics for Genomics (Spring 2012) by Rafael Irizarry.

Rafael is in the process of posting lectures from his statistics for genomics course online.

Updates:

RafaLab’s Facebook page

Twitter feed

Good way to learn R, statistics and a good bit about genomics.

May 17, 2012

How to Visualize and Compare Distributions

Filed under: Graphics,R,Statistics,Visualization — Patrick Durusau @ 3:08 pm

How to Visualize and Compare Distributions by Nathan Yau.

Nathan writes:

Single data points from a large dataset can make it more relatable, but those individual numbers don’t mean much without something to compare to. That’s where distributions come in.

There are a lot of ways to show distributions, but for the purposes of this tutorial, I’m only going to cover the more traditional plot types like histograms and box plots. Otherwise, we could be here all night. Plus the basic distribution plots aren’t exactly well-used as it is.

Before you get into plotting in R though, you should know what I mean by distribution. It’s basically the spread of a dataset. For example, the median of a dataset is the half-way point. Half of the values are less than the median, and the other half are greater than. That’s only part of the picture.

What happens in between the maximum value and median? Do the values cluster towards the median and quickly increase? Are there are lot of values clustered towards the maximums and minimums with nothing in between? Sometimes the variation in a dataset is a lot more interesting than just mean or median. Distribution plots help you see what’s going on.

You will find distributions useful in many aspects of working with topic maps.

The most obvious use is the end-user display of data in a delivery situation. But distributions can also help you decide what areas of a data set look more “interesting” than others.

Nathan does his typically great job explaining distributions and you will learn a bit of R in the process. Not a bad evening at all.

May 11, 2012

Nuts and Bolts of Data Mining: Correlation & Scatter Plots

Filed under: Correlation,Statistics — Patrick Durusau @ 4:19 pm

Nuts and Bolts of Data Mining: Correlation & Scatter Plots by Tim Graettinger.

From the post:

In this article, I continue the “Nuts and Bolts of Data Mining” series. We will tackle two, intertwined tools/topics this time: correlation and scatter plots. These tools are fundamental for gauging the relationship (if any) between pairs of data elements. For instance, you might want to view the relationship between the age and income of your customers as a scatter plot. Or, you might compute a number that is the correlation between these two customer demographics. As we’ll soon see, there are good, bad, and ugly things that can happen when you apply a purely computational method like correlation. My goal is to help you avoid the usual pitfalls, so that you can use correlation and scatter plots effectively in your own work.

You will smile at the examples but if the popular press is any indication, correlation is no laughing matter!

Tim’s post won’t turn the tide but short enough to forward to the local broadside folks.

April 28, 2012

Workflow for statistical data analysis

Filed under: Data Analysis,R,Statistics — Patrick Durusau @ 6:06 pm

Workflow for statistical data analysis by Christophe Lalanne.

A short summary of Oliver Kirchkamp’s Workflow of statistical data analysis, which takes the reader from data to paper.

Christophe says a more detailed review is likely to follow but at eighty-six (86) pages, you could read it yourself and make detailed comments as well.

April 23, 2012

ICDM 2012

ICDM 2012 Brussels, Belgium | December 10 – 13, 2012

From the webpage:

The IEEE International Conference on Data Mining series (ICDM) has established itself as the world’s premier research conference in data mining. It provides an international forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications.

ICDM draws researchers and application developers from a wide range of data mining related areas such as statistics, machine learning, pattern recognition, databases and data warehousing, data visualization, knowledge-based systems, and high performance computing. By promoting novel, high quality research findings, and innovative solutions to challenging data mining problems, the conference seeks to continuously advance the state-of-the-art in data mining. Besides the technical program, the conference features workshops, tutorials, panels and, since 2007, the ICDM data mining contest.

Important Dates:

ICDM contest proposals: April 30
Conference full paper submissions: June 18
Demo and tutorial proposals: August 10
Workshop paper submissions: August 10
PhD Forum paper submissions: August 10
Conference paper, tutorial, demo notifications: September 18
Workshop paper notifications: October 1
PhD Forum paper notifications: October 1
Camera-ready copies and copyright forms: October 15

April 22, 2012

AI & Statistics 2012

Filed under: Artificial Intelligence,Machine Learning,Statistical Learning,Statistics — Patrick Durusau @ 7:08 pm

AI & Statistics 2012 (La Palma, Canary Islands)

Proceedings:

http://jmlr.csail.mit.edu/proceedings/papers/v22/

As one big file:

http://jmlr.csail.mit.edu/proceedings/papers/v22/v22.tar.gz

Why you should care:

The fifteenth international conference on Artificial Intelligence and Statistics (AISTATS 2012) will be held on La Palma in the Canary Islands. AISTATS is an interdisciplinary gathering of researchers at the intersection of computer science, artificial intelligence, machine learning, statistics, and related areas. Since its inception in 1985, the primary goal of AISTATS has been to broaden research in these fields by promoting the exchange of ideas among them. We encourage the submission of all papers which are in keeping with this objective.

The conference runs April 21 – 23, 2012. Sorry!

You will enjoy looking over the papers!

April 19, 2012

Contra: Search Engine Land’s Mediocre Post on Local Search

Filed under: Searching,Statistics — Patrick Durusau @ 7:19 pm

Search Engine Land’s Mediocre Post on Local Search

Matthew Hurst writes:

A colleague brought to my attention a post on the influential search blog Search Engine Land which makes claims about the quality of local data found on search engines and local verticals: Yellow Pages Sites Beat Goolge In Local Data Accuracy Test. The author describes surprise at the outcome reported – that Yellow Pages sites are better at local search than Google. Rather, we should express surprise at how poorly this article is written and at the intentional misleading nature of the title.

What surprises me is how far Matthew had to go to find something “misleading.”

You may not agree with the definition of “local businesses” but it was clearly stated, so if the results are “misleading,” it is because readers did not appreciate the definition of “local businesses.” Since it was stated, whose fault is that?

As far as “…swinging back to bad reporting…” (I didn’t see any bad reporting up to this point but it is his post), the last table with its “coverage of an attribute” saying nothing about its quality.

If you can find where the Search Engine Land post ever said anything about the quality of “additional information” I would appreciate a pointer.

That the “additional information” category is fairly vacuous but that wasn’t hidden from the reader. Or claimed to be something it wasn’t.

The original post did not follow Matthew’s preferences. That’s my take away from Matthew’s post.

Choices of variable and their definitions always, always favor a particular outcome.

What other reason is there to choose a variable and its definition?

Gapminder

Filed under: Graphics,Statistics,Visualization — Patrick Durusau @ 7:19 pm

Gapminder by Hans Rosling.

If you don’t know the name, Hans Rosling, you should.

A promoter of the use of statistics (and their illustration) to make sense of a complex and changing world.

Hans sees the world from the perspective of a public health expert.

Statistics are used to measure the effectiveness of public health programs.

The most impressive aspect of the site is its ability to create animated graphs on the fly from the data sets, for your viewing and manipulation.

Knoema Launches the World’s First Knowledge Platform Leveraging Data

Filed under: Data,Data Analysis,Data as Service (DaaS),Data Mining,Knoema,Statistics — Patrick Durusau @ 7:13 pm

Knoema Launches the World’s First Knowledge Platform Leveraging Data

From the post:

DEMO Spring 2012 conference — Today at DEMO Spring 2012, Knoema launched publicly the world’s first knowledge platform that leverages data and offers tools to its users to harness the knowledge hidden within the data. Search and exploration of public data, its visualization and analysis have never been easier. With more than 500 datasets on various topics, gallery of interactive, ready to use dashboards and its user friendly analysis and visualization tools, Knoema does for data what YouTube did to videos.

Millions of users interested in data, like analysts, students, researchers and journalists, struggle to satisfy their data needs. At the same time there are many organizations, companies and government agencies around the world collecting and publishing data on various topics. But still getting access to relevant data for analysis or research can take hours with final outcomes in many formats and standards that can take even longer to get it to a shape where it can be used. This is one of the issues that the search engines like Google or Bing face even after indexing the entire Internet due to the nature of statistical data and diversity and complexity of sources.

One-stop shop for data. Knoema, with its state of the art search engine, makes it a matter of minutes if not seconds to find statistical data on almost any topic in easy to ingest formats. Knoema’s search instantly provides highly relevant results with chart previews and actual numbers. Search results can be further explored with Dataset Browser tool. In Dataset Browser tool, users can get full access to the entire public data collection, explore it, visualize data on tables/charts and download it as Excel/CSV files.

Numbers made easier to understand and use. Knoema enables end-to-end experience for data users, allowing creation of highly visual, interactive dashboards with a combination of text, tables, charts and maps. Dashboards built by users can be shared to other people or on social media, exported to Excel or PowerPoint and embedded to blogs or any other web site. All public dashboards made by users are available in dashboard gallery on home page. People can collaborate on data related issues participating in discussions, exchanging data and content.

Excellent!!!

When “other” data becomes available, users will want to integrate it with their data.

But “other” data will have different or incompatible semantics.

So much for attempts to wrestle semantics to the ground (W3C) or build semantic prisons (unnamed vendors).

What semantics are useful to you today? (patrick@durusau.net)

April 16, 2012

The Statistical Core Vocabulary (scovo)

Filed under: Statistical Core Vocabulary (scovo),Statistics,Vocabularies — Patrick Durusau @ 7:12 pm

The Statistical Core Vocabulary (scovo)

From the webpage:

This document specifies an [RDF-Schema] vocabulary for representing statistical data on the Web. It is normatively encoded in [XHTML+RDFa], that is embedded in this page.

The homepage reports this vocabulary as deprecated but cited as a namespace in the RDF Data Cube Vocabulary (1.6).

I don’t have any numbers on the actual use of this vocabulary but you probably need to be aware of it.

The RDF Data Cube Vocabulary

Filed under: RDF Data Cube Vocabulary,Statistics — Patrick Durusau @ 7:11 pm

The RDF Data Cube Vocabulary

A new draft from the W3C, adapting existing data cube vocabularies into an RDF representation.

The proposal re-uses several other vocabularies that I will be covering separately.

There are several open issues so read carefully.


What do you make of: The RDF Data Cube Vocabulary? I haven’t run diffs on it, yet.

April 11, 2012

Calculating Word and N-Gram Statistics from the Gutenberg Corpus

Filed under: Gutenberg Corpus,N-Gram,NLTK,Statistics — Patrick Durusau @ 6:16 pm

Calculating Word and N-Gram Statistics from the Gutenberg Corpus by Richard Marsden.

From the post:

Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.

A “get your feet wet” sort of exercise with the script included.

The Gutenberg project isn’t “big data” but it is more than your usual inbox.

Think of it as learning about the data set for application of more sophisticated algorithms.

March 31, 2012

Automated science, deep data and the paradox of information – Data As Story

Filed under: BigData,Epistemology,Information Theory,Modeling,Statistics — Patrick Durusau @ 4:09 pm

Automated science, deep data and the paradox of information…

Bradley Voytek writes:

A lot of great pieces have been written about the relatively recent surge in interest in big data and data science, but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, “What’s different here? What’s special about these outliers and what do they tell us about our models and assumptions?”

The reason that big data proponents are so excited about the burgeoning data revolution isn’t just because of the math. Don’t get me wrong, the math is fun, but we’re excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.

That’s big data.

Of course, data are just a collection of facts; bits of information that are only given context — assigned meaning and importance — by human minds. It’s not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.

And therein lies the rub.

Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?

I reformulate Bradley’s question into:

We use data to tell stories about ourselves and the universe in which we live.

Which means that his rules of statistical methods:

  1. The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
  2. The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
  3. Any sufficiently advanced statistics can trick people into believing the results reflect truth.

are sources of other stories “about ourselves and the universe in which we live.”

If you prefer Bradley’s original question:

Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?

I would answer: And the difference would be?

March 26, 2012

We’re Not Very Good Statisticians

Filed under: Analytics,Statistics — Patrick Durusau @ 6:36 pm

We’re Not Very Good Statisticians by Steve Miller.

From the post:

I’ve received several emails/comments about my recent series of blogs on Duncan Watts’ interesting book “Everything is Obvious: *Once You Know the Answer — How Common Sense Fails Us.” Watts’ thesis is that the common sense that generally guides us well for life’s simple, mundane tasks often fails miserably when decisions get more complicated.

Three of the respondents suggested I take a look at “Thinking Fast and Slow,” by psychologist Daniel Kahneman, who along with the late economist Amos Tversky, was awarded the Nobel Prize in Economic Sciences for “seminal work in psychology that challenged the rational model of judgment and decision making.”

Steve’s post and the ones to follow are worth a close read.

When data, statistical or otherwise, agrees with me, I take that as a sign to evaluate it very carefully. Your mileage may vary.

The Difference Between Interaction and Association

Filed under: Mathematics,Statistics — Patrick Durusau @ 6:35 pm

The Difference Between Interaction and Association by Karen Grace-Martin.

From the post:

It’s really easy to mix up the concepts of association (a.k.a. correlation) and interaction. Or to assume if two variables interact, they must be associated. But it’s not actually true.

In statistics, they have different implications for the relationships among your variables, especially when the variables you’re talking about are predictors in a regression or ANOVA model.

Association

Association between two variables means the values of one variable relate in some way to the values of the other. Association is usually measured by correlation for two continuous variables and by cross tabulation and a Chi-square test for two categorical variables.

Unfortunately, there is no nice, descriptive measure for association between one categorical and one continuous variable, but either one-way analysis of variance or logistic regression can test an association (depending upon whether you think of the categorical variable as the independent or the dependent variable).

Essentially, association means the values of one variable generally co-occur with certain values of the other.

Interaction

Interaction is different. Whether two variables are associated says nothing about whether they interact in their effect on a third variable. Likewise, if two variables interact, they may or may not be associated.

An interaction between two variables means the effect of one of those variables on a third variable is not constant—the effect differs at different values of the other.

You will most likely be using statistics or at least discussing topic maps with analysts who use statistics so be prepared to distinguish “association” in the statistics sense from association when you use it in the topic maps sense. They are pronounced the same way. 😉

Depending upon the subject matter of your topic map, you may well be describing “interaction,” but again, not in the sense that Karen illustrates in her post.

The world of semantics is a big place so be careful out there.

March 23, 2012

Statistical Analysis: Common Mistakes

Filed under: Statistics — Patrick Durusau @ 7:24 pm

Statistical Analysis: Common Mistakes by Sandro Saitta.

The post cites the following example from the paper:

“Imagine you are a regional sales head for a major retailer in U.S. and you want to know what drives sales in your top performing stores. Your research team comes back with a revealing insight – the most significant predictor in their model is the average number of cars present in stores’ parking lots.”

A good paper to re-read from time to time.

March 11, 2012

“All Models are Right, Most are Useless”

Filed under: Modeling,Regression,Statistics — Patrick Durusau @ 8:09 pm

“All Models are Right, Most are Useless”

A counter to George Box saying: “all models are wrong, some are useful.” by Thad Tarpey. Pointer to slides for the presentation.

Covers the fallacy of “reification” (in the modeling sense) among other amusements.

Useful to remember that maps are approximations as well.

March 2, 2012

An essay on why programmers need to learn statistics

Filed under: Data,Statistics — Patrick Durusau @ 8:05 pm

An essay on why programmers need to learn statistics from Simply Statistics.

Truly an amazing post!

But it doesn’t apply just to programmers, anyone evaluating data needs to understand statistics and perhaps more importantly, have the ability to know when the data isn’t quite right. Math is correct but the data is too clean, too good, too …., something that makes you uneasy with the data.

Consider the Duke Saga for example.

February 28, 2012

OECD Homepage

Filed under: Government Data,Statistics — Patrick Durusau @ 8:41 pm

OECD Homepage

More about how I got to this site in a moment but it is a wealth of statistical information.

From the about page:

The mission of the Organisation for Economic Co-operation and Development (OECD) is to promote policies that will improve the economic and social well-being of people around the world.

The OECD provides a forum in which governments can work together to share experiences and seek solutions to common problems. We work with governments to understand what drives economic, social and environmental change. We measure productivity and global flows of trade and investment. We analyse and compare data to predict future trends. We set international standards on a wide range of things, from agriculture and tax to the safety of chemicals.

We look, too, at issues that directly affect the lives of ordinary people, like how much they pay in taxes and social security, and how much leisure time they can take. We compare how different countries’ school systems are readying their young people for modern life, and how different countries’ pension systems will look after their citizens in old age.

Drawing on facts and real-life experience, we recommend policies designed to make the lives of ordinary people better. We work with business, through the Business and Industry Advisory Committee to the OECD, and with labour, through the Trade Union Advisory Committee. We have active contacts as well with other civil society organisations. The common thread of our work is a shared commitment to market economies backed by democratic institutions and focused on the wellbeing of all citizens. Along the way, we also set out to make life harder for the terrorists, tax dodgers, crooked businessmen and others whose actions undermine a fair and open society.

I got to the site by following a link to OECD.StatExtracts which is a beta page reported by Christophe Lalanne’s A bag of tweets / Feb 2012.

I am sure comments (helpful ones in particular) would be appreciated on the beta pages.

My personal suspicion is that eventually very little data will be transferred in bulk but most large data sets will admit to both pre-programmed as well as ah hoc processing requests. That is already quite common in astronomy (both optical and radio).

StatLib

Filed under: Data,Dataset,Statistics — Patrick Durusau @ 8:41 pm

StatLib

From the webpage:

Welcome to StatLib, a system for distributing statistical software, datasets, and information by electronic mail, FTP and WWW. StatLib started out as an e-mail service and some of the organization still reflects that heritage. We hope that this document will give you sufficient guidance to navigate through the archives. For your convenience there are several sites around the world which serve as full or partial mirrors to StatLib.

An amazing source of software and data. Including sets of webpages for clustering analysis, etc.

Was mentioned in the first R-Podcast episode.

« Newer PostsOlder Posts »

Powered by WordPress