Detecting Novel Associations in Large Data Sets by David N. Reshef, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, Pardis C. Sabeti.
Abstract:
Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.
Lay version: Tool detects patterns hidden in vast data sets by Haley Bridger.
Data and software: http://exploredata.net/.
From the article:
Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs—far too many to examine manually. If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones? Data sets of this size are increasingly common in fields as varied as genomics, physics, political science, and economics, making this question an important and growing challenge (1, 2).
One way to begin exploring a large data set is to search for pairs of variables that are closely associated. To do this, we could calculate some measure of dependence for each pair, rank the pairs by their scores, and examine the top-scoring pairs. For this strategy to work, the statistic we use to measure dependence should have two heuristic properties: generality and equitability.
By generality, we mean that with sufficient sample size the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships (3). The latter condition is desirable because not only do relationships take many functional forms, but many important relationships—for example, a superposition of functions—are not well modeled by a function (4–7).
By equitability, we mean that the statistic should give similar scores to equally noisy relationships of different types. For example, we do not want noisy linear relationships to drive strong sinusoidal relationships from the top of the list. Equitability is difficult to formalize for associations in general but has a clear interpretation in the basic case of functional relationships: An equitable statistic should give similar scores to functional relationships with similar R2 values (given sufficient sample size).
Here, we describe an exploratory data analysis tool, the maximal information coefficient (MIC), that satisfies these two heuristic properties. We establish MIC’s generality through proofs, show its equitability on functional relationships through simulations, and observe that this translates into intuitively equitable behavior on more general associations. Furthermore, we illustrate that MIC gives rise to a larger family of statistics, which we refer to as MINE, or maximal information-based nonparametric exploration. MINE statistics can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity. We demonstrate the application of MIC and MINE to data sets in health, baseball, genomics, and the human microbiota. (footnotes omitted)
As you can imagine the line:
MINE statistics can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity.
caught my eye.
I usually don’t post until the evening but this looks very important. I wanted everyone to have a chance to grab the data and software before the weekend.
New acronyms:
MIC – maximal information coefficient
MINE – maximal information-based nonparametric exploration
Good thing they chose acronyms we would not be likely to confuse with other usages. 😉
Full citation:
Science 16 December 2011:
Vol. 334 no. 6062 pp. 1518-1524
DOI: 10.1126/science.1205438
[…] Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity « Detecting Novel Associations in Large Data Sets […]
Pingback by Broad Institute « Another Word For It — December 17, 2011 @ 6:30 am
[…] you may enjoy along with: Detecting Novel Associations in Large Data Sets. Jeremy Fox asks what I think about this paper by David N. Reshef, Yakir Reshef, Hilary Finucane, […]
Pingback by Mr. Pearson, meet Mr. Mandelbrot:… « Another Word For It — December 17, 2011 @ 7:52 pm