Archive for the ‘Multivariate Statistics’ Category

Tesseract – Fast Multidimensional Filtering for Coordinated Views

Sunday, March 25th, 2012

Tesseract – Fast Multidimensional Filtering for Coordinated Views

From the post:

Tesseract is a JavaScript library for filtering large multivariate datasets in the browser. Tesseract supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Tesseract uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Tesseract works, see the API reference.

Are you ready to “slice and dice” your data set?

Multivariate Statistical Analysis: Old School

Monday, February 27th, 2012

Multivariate Statistical Analysis: Old School by John Marden.

From the preface:

The goal of this text is to give the reader a thorough grounding in old-school multivariate statistical analysis. The emphasis is on multivariate normal modeling and inference, both theory and implementation. Linear models form a central theme of the book. Several chapters are devoted to developing the basic models, including multivariate regression and analysis of variance, and especially the “both-sides models” (i.e., generalized multivariate analysis of variance models), which allow modeling relationships among individuals as well as variables. Growth curve and repeated measure models are special cases.

The linear models are concerned with means. Inference on covariance matrices covers testing equality of several covariance matrices, testing independence and conditional independence of (blocks of) variables, factor analysis, and some symmetry models. Principal components, though mainly a graphical/exploratory technique, also lends itself to some modeling.

Classification and clustering are related areas. Both attempt to categorize individuals. Classification tries to classify individuals based upon a previous sample of observed individuals and their categories. In clustering, there is no observed categorization, nor often even knowledge of how many categories there are. These must be estimated from the data.

Other useful multivariate techniques include biplots, multidimensional scaling, and canonical correlations.

The bulk of the results here are mathematically justified, but I have tried to arrange the material so that the reader can learn the basic concepts and techniques while plunging as much or as little as desired into the details of the proofs.

Practically all the calculations and graphics in the examples are implemented using the statistical computing environment R [R Development Core Team, 2010]. Throughout the notes we have scattered some of the actual R code we used. Many of the data sets and original R functions can be found in the file For others we refer to available R packages.

This is “old school.” A preface that contains useful information and outlines what the reader may find? Definitely “old school.”

Found thanks to: Christophe Lalanne’s A bag of tweets / Feb 2012.


Sunday, September 25th, 2011

Musimetrics by Vilson Vieira, Renato Fabbri, and Luciano da Fontoura Costa.


Can the arts be analyzed in a quantitative manner? We propose a methodology to study music development by applying multivariate statistics on composers characteristics. Seven representative composers were considered in terms of eight main musical features. Grades were assigned to each characteristic and their correlations were analyzed. A bootstrap method was applied to simulate hundreds of artificial composers influenced by the seven representatives chosen. Applying dimensionality reduction we obtained a planar space used to quantify non-numeric relations like dialectics, opposition and innovation. Composers differences on style and technique were represented as geometrical distances in the planar space, making it possible to quantify, for example, how much Bach and Stockhausen differ from other composers or how much Beethoven influenced Brahms. In addition, we compared the results with a prior investigation on philosophy. The influence of dialectics, strong on philosophy, was not remarkable on music. Instead, supporting an observation already considered by music theorists, strong influences were identified between subsequent composers, implying inheritance and suggesting a stronger master-disciple evolution when compared to the philosophy analysis.

The article concludes:

While taking the first steps on the direction of a quantitative approach to arts and philosophy we believe that an understanding of the creative process could also be eventually quantified. We want to end this work going back to Webern, who early envisioned these relations: “It is clear that where relatedness and unity are omnipresent, comprehensibility is also guaranteed. And all the rest is dilettantism, nothing else, for all time, and always has been. That’s so not only in music but everywhere.”

You are going to encounter multivariate statistics in a number of contexts. Where are the weak points in this paper? What questions would you ask? (Hint, they don’t involve expertise in music history or theory.) If you are familiar with multivariate statistics, what are the common weak points of that type of analysis?

I remember multivariate statistics from their use in the 1960’s/70’s in attempts to predict Supreme Court (US) behavior. The Court was quite safe and I think the same can be said for composers in the Western canon.