Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 18, 2014

Finding correlations in complex datasets

Filed under: Interface Research/Design,Visualization — Patrick Durusau @ 3:02 pm

Finding correlations in complex datasets by Andrés Colubri.

From the post:

It is now almost three years since I moved to Boston to start working at Fathom Information Design and the Sabeti Lab at Harvard. As I noted back then, one of the goals of this work was to create new tools for exploring complex datasets -mainly of epidemiological and health data- which could potentially contain up to thousands of different variables. After a process that went from researching visual metaphors suitable to explore these kind of datasets interactively, learning statistical techniques that can be used to quantify general correlations (not necessarily linear or between numerical quantities), and going over several iterations of internal prototypes, we finally released the 1.0 version of a tool called “Mirador” (spanish word for lookout), which attempts to bridge the space between raw data and statistical modeling. Please jump to the Mirador’s homepage to access the software and its user manual, and continue reading below for some more details about the development and design process.

The first step to build a narrative out of data is arguably finding correlations between different magnitudes or variables in the data. For instance, the placement of roads is highly correlated with the anthropogenic and geographical features of a territory. A new, unexpected, intuition-defying, or polemic correlation would probably result in an appealing narrative. Furthermore, a visual representation (of the correlation) that succeeds in its aesthetic language or conceptual clarity is also part of an appealing “data-driven” narrative. Within the scientific domains, these narratives are typically expressed in the form of a model that can be used by the researchers to make predictions. Although fields like Machine Learning and Bayesian Statistics have grown enormously in the past decades and offer techniques that allows the computer to infer predictive models from data, these techniques require careful calibration and overall supervision from the expert users who run these learning and inference algorithms. A key consideration is what variables to include in the inference process, since too few variables might result in a highly-biased model, while too many of them would lead to overfitting and large variance on new data (what is called the bias-variance dilemma.)

Leaving aside model building, an exploratory overview of the correlations in a dataset is also important in situations where one needs to quickly survey association patterns in order to understand ongoing processes, for example, the spread of an infectious disease or the relationship between individual behaviors and health indicators. The early identification of (statistically significant) associations can inform decision making and eventually help to save lives and improve public policy.

With this background in mind, three years ago we embarked in the task of creating a tool that could assist data exploration and model building by providing a visual interface to find and inspect correlations in general datasets, while having a focus on public health and epidemiological data. The thesis work from David Reshef with his tool VisuaLyzer was our starting point. Once we were handed over the initial VisuaLyzer prototype, we carried out a number of development and design iterations at Fathom, which redefined the overall workspace in VisuaLyzer but kept its main visual metaphor for data representation intact. Within this metaphor, the data is presented in “stand-alone” views such scatter plots, histograms, and maps where several “encodings” can be defined at once. An encoding is a mapping between the values of a variable in the dataset and a visual parameter, for example X and Y coordinates, size, color and opacity of the circles representing data instances, etc. This approach of defining multiple encodings in a single “large” data view is similar to what the Gapminder World visualization does.

Mirador self-describes at its homepage:

Mirador is a tool for visual exploration of complex datasets which enables users to infer new hypotheses from the data and discover correlation patterns.

Whether you call them “correlations” or “association patterns” (note the small “a” in associations), in relationships could in fact be modeled by Associations (note the capital “A” in Associations) with a topic map.

An important point for several reasons:

  • In this use case, there may be thousands of variables that contribute to an association pattern.
  • Associations can be discovered in data as opposed to being composed in an authored artifact.
  • Associations give us to the tools to talk about not just the role players identified by data analysis but also potential roles and how they compose an association.

Happy hunting!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress