Sparse Machine Learning Methods for Understanding Large Text Corpora (pdf) by Laurent El Ghaoui, Guan-Cheng Li, Viet-An Duong, Vu Pham, Ashok Srivastava, and Kanishka Bhaduri. Status: Accepted for publication in Proc. Conference on Intelligent Data Understanding, 2011.
Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using sparse regression or classification; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents.
I suppose it depends on your background (mine includes a law degree and a decade of practice) but when I read:
The ASRS data contains several of the crucial challenges involved under the general banner of “large-scale text data understanding”. First, its scale is huge, and growing rapidly, making the need for automated analyses of the processed reports more crucial than ever. Another issue is that the reports themselves are far from being syntactically correct, with lots of abbreviations, orthographic and grammatical errors, and other shortcuts. Thus we are not facing a corpora with well-structured language having clearly dened rules, as we would if we were to consider a corpus of laws or bills or any other well-redacted data set.
I thought I would fall out of my chair. I don’t think I have ever heard of a “corpus of laws or bills” being described as a “…well-redacted data set.”
There was a bill passed in the US Congress last year that despite being acted on by both Houses and who knows how many production specialists, was passed without a name.
Apologies for the digression.
From the paper:
Our paper makes the claim that sparse learning methods can be very useful to the understanding large text databases. Of course, machine learning methods in general have already been successfully applied to text classication and clustering, as evidenced for example by . We will show that sparsity is an important added property that is a crucial component in any tool aiming at providing interpretable statistical analysis, allowing in particular efficient multi-document summarization, comparison, and visualization of huge-scale text corpora.
You will need to read the paper for the details but I think it clearly demonstrates that sparse learning methods are useful for exploring large text databases. While it may be the case that your users have a view of their data, it is equally likely that you will be called upon to mine a text database and to originate a navigation overlay for it. That will require exploring the data and developing an understanding of it.
For all the projections of need for data analysts and required technical skills, without insight and imagination, they will just be going through the motions.
(Applying sparse learning methods to new areas is an example of imagination.)