Corpus-Wide Association Studies by Mark Liberman.
From the post:
I’ve spent the past couple of days at GURT 2012, and one of the interesting talks that I’ve heard was Julian Brooke and Sali Tagliamonte, “Hunting the linguistic variable: using computational techniques for data exploration and analysis”. Their abstract (all that’s available of the work so far) explains that:
The selection of an appropriate linguistic variable is typically the first step of a variationist analysis whose ultimate goal is to identify and explain social patterns. In this work, we invert the usual approach, starting with the sociolinguistic metadata associated with a large scale socially stratified corpus, and then testing the utility of computational tools for finding good variables to study. In particular, we use the ‘information gain’ metric included in data mining software to automatically filter a huge set of potential variables, and then apply our own corpus reader software to facilitate further human inspection. Finally, we subject a small set of particularly interesting features to a more traditional variationist analysis.
This type of data-mining for interesting patterns is likely to become a trend in sociolinguistics, as it is in other areas of the social and behavioral sciences, and so it’s worth giving some thought to potential problems as well as opportunities.
If you think about it, the social/behavioral sciences are being applied to the results of data mining of user behavior now. Perhaps you can “catch the wave” early on this cycle of research.