Archive for the ‘Chordalysis’ Category

Association Discovery Beyond Ten Variables – Prioritized Chordalysis

Monday, May 4th, 2015

Scaling log-linear analysis to datasets with thousands of variables by François Petitjean and Geoffrey I. Webb.

Abstract:

Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.

The authors reduce the number of edges required to be examined in one example from 10,000,000 to 10,000, with corresponding savings in computation time. That was not an artifact of the data set but has been generalized by the authors and released as open source code: Chordalysis (GitHub).

If you prefer numbers, the analysis of a data set with 10,000,000 edges went from 39 hours to 27 seconds, a speedup of more than 5200X.

Definitely an addition to your data mining toolkit!

Chordalysis: a new method to discover the structure of data

Thursday, November 28th, 2013

Chordalysis: a new method to discover the structure of data by Francois Petitjean.

From the post:

…you can’t use log-linear analysis if your dataset has more than, say, 10 variables! This is because the process is exponential in the number of variables. That is where our new work makes a difference. The question was: how can we keep the rigorous statistical foundations of classical log-linear analysis but make it work for datasets with hundreds of variables?

The main part of the answer is “chordal graphs”, which are the graphs made of triangular structures. We showed that for this class of models, the theory is scalable for high-dimensional datasets. The rest of the solution involved melding the classical statistical machinery with advanced data mining techniques from association discovery and graphical modelling.

The result is Chordalysis: a log-linear analysis method for high-dimensional data. Chordalysis makes it possible to discover the structure of datasets with hundreds of variables on a standard computer. So far we’ve applied it successfully to datasets with up to 750 variables. (emphasis added)

Software: https://sourceforge.net/projects/chordalysis/

Scaling log-linear analysis to high-dimensional data (PDF), by Francois Petitjean, Geoffrey I. Webb and Ann E. Nicholson.

Abstract:

Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.

Being curious about what was meant by “…a standard computer…” I searched the paper to find:

The conjunction of these features makes it possible to scale log-linear analysis to hundreds of variables on a standard desktop computer. (page 3 of the PDF, the pages are unnumbered)

Not a lot clearer but certainly encouraging!

The data used in the paper can be found at: http://www.icpsr.umich.edu/icpsrweb/NACDA/studies/09915.

The Chordalysis wiki looks helpful.

So, are your clients going to be limited to 10 variables or a somewhat higher number?