Archive for the ‘Sparse Learning’ Category

Statistical Learning with Sparsity: The Lasso and Generalizations (Free Book!)

Wednesday, January 6th, 2016

Statistical Learning with Sparsity: The Lasso and Generalizations by Trevor Hastie, Robert Tibshirani, and Martin Wainwright.

From the introduction:

I never keep a scorecard or the batting averages. I hate statistics. What I got to know, I keep in my head.

This is a quote from baseball pitcher Dizzy Dean, who played in the major leagues from 1930 to 1947.

How the world has changed in the 75 or so years since that time! Now large quantities of data are collected and mined in nearly every area of science, entertainment, business, and industry. Medical scientists study the genomes of patients to choose the best treatments, to learn the underlying causes of their disease. Online movie and book stores study customer ratings to recommend or sell them new movies or books. Social networks mine information about members and their friends to try to enhance their online experience. And yes, most major league baseball teams have statisticians who collect and analyze detailed information on batters and pitchers to help team managers and players make better decisions.

Thus the world is awash with data. But as Rutherford D. Roger (and others) has said:

We are drowning in information and starving for knowledge.

There is a crucial need to sort through this mass of information, and pare it down to its bare essentials. For this process to be successful, we need to hope that the world is not as complex as it might be. For example, we hope that not all of the 30, 000 or so genes in the human body are directly involved in the process that leads to the development of cancer. Or that the ratings by a customer on perhaps 50 or 100 different movies are enough to give us a good idea of their tastes. Or that the success of a left-handed pitcher against left-handed batters will be fairly consistent for different batters. This points to an underlying assumption of simplicity. One form of simplicity is sparsity, the central theme of this book. Loosely speaking, a sparse statistical model is one in which only a relatively small number of parameters (or predictors) play an important role. In this book we study methods that exploit sparsity to help recover the underlying signal in a set of data.

The delightful style of the authors had me going until they said:

…we need to hope that the world is not as complex as it might be.

What? “…not as complex as it might be?

Law school and academia both train you to look for complexity so “…not as complex as it might be” is as close to apostasy as any statement I can imagine. 😉 (At least I can say I am honest about my prejudices. Some of them at any rate.)

Not for the mathematically faint of heart but it may certainly be a counter to the intelligence communities’ mania about collecting every scrap of data.

Finding a needle in a smaller haystack could be less costly and more effective. Both of those principles run counter to well established government customs but there are those in government who wish to be effective. (Article of faith on my part.)

I first saw this in a tweet by Chris Diehl.

Sparse Machine Learning Methods for Understanding Large Text Corpora

Thursday, September 22nd, 2011

Sparse Machine Learning Methods for Understanding Large Text Corpora (pdf) by Laurent El Ghaoui, Guan-Cheng Li, Viet-An Duong, Vu Pham, Ashok Srivastava, and Kanishka Bhaduri. Status: Accepted for publication in Proc. Conference on Intelligent Data Understanding, 2011.


Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using sparse regression or classification; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents.

I suppose it depends on your background (mine includes a law degree and a decade of practice) but when I read:

The ASRS data contains several of the crucial challenges involved under the general banner of “large-scale text data understanding”. First, its scale is huge, and growing rapidly, making the need for automated analyses of the processed reports more crucial than ever. Another issue is that the reports themselves are far from being syntactically correct, with lots of abbreviations, orthographic and grammatical errors, and other shortcuts. Thus we are not facing a corpora with well-structured language having clearly de ned rules, as we would if we were to consider a corpus of laws or bills or any other well-redacted data set.

I thought I would fall out of my chair. I don’t think I have ever heard of a “corpus of laws or bills” being described as a “…well-redacted data set.”

There was a bill passed in the US Congress last year that despite being acted on by both Houses and who knows how many production specialists, was passed without a name.

Apologies for the digression.

From the paper:

Our paper makes the claim that sparse learning methods can be very useful to the understanding large text databases. Of course, machine learning methods in general have already been successfully applied to text classi cation and clustering, as evidenced for example by [21]. We will show that sparsity is an important added property that is a crucial component in any tool aiming at providing interpretable statistical analysis, allowing in particular efficient multi-document summarization, comparison, and visualization of huge-scale text corpora.

You will need to read the paper for the details but I think it clearly demonstrates that sparse learning methods are useful for exploring large text databases. While it may be the case that your users have a view of their data, it is equally likely that you will be called upon to mine a text database and to originate a navigation overlay for it. That will require exploring the data and developing an understanding of it.

For all the projections of need for data analysts and required technical skills, without insight and imagination, they will just be going through the motions.

(Applying sparse learning methods to new areas is an example of imagination.)