Archive for the ‘Feature Spaces’ Category

Classifying Shakespearean Drama with Sparse Feature Sets

Tuesday, October 14th, 2014

Classifying Shakespearean Drama with Sparse Feature Sets by Douglas Duhaime.

From the post:

In her fantastic series of lectures on early modern England, Emma Smith identifies an interesting feature that differentiates the tragedies and comedies of Elizabethan drama: “Tragedies tend to have more streamlined plots, or less plot—you know, fewer things happening. Comedies tend to enjoy a multiplication of characters, disguises, and trickeries. I mean, you could partly think about the way [tragedies tend to move] towards the isolation of a single figure on the stage, getting rid of other people, moving towards a kind of solitude, whereas comedies tend to end with a big scene at the end where everybody’s on stage” (6:02-6:37). 

The distinction Smith draws between tragedies and comedies is fairly intuitive: tragedies isolate the poor player that struts and frets his hour upon the stage and then is heard no more. Comedies, on the other hand, aggregate characters in order to facilitate comedic trickery and tidy marriage plots. While this discrepancy seemed promising, I couldn’t help but wonder whether computational analysis would bear out the hypothesis. Inspired by the recent proliferation of computer-assisted genre classifications of Shakespeare’s plays—many of which are founded upon high dimensional data sets like those generated by DocuScope—I was curious to know if paying attention to the number of characters on stage in Shakespearean drama could help provide additional feature sets with which to carry out this task.

A quick reminder that not all text analysis is concerned with 140 character strings. 😉

Do you prefer:

high dimensional

where every letter in “high dimensional” is a hyperlink with an unknown target or a fuller listing:

Allison, Sarah, and Ryan Heuser, Matthew Jockers, Franco Moretti, Michael Witmore. Quantitative Formalism: An Experiment

Jockers, Matthew. Machine-Classifying Novels and Plays by Genre

Hope, Jonathan and Michael Witmore. “The Hundredth Psalm to the Tune of ‘Green Sleeves’”: Digital Approaches Shakespeare’s Language of Genre

Hope, Jonathan. Shakespeare by the numbers: on the linguistic texture of the Late Plays

Hope, Jonathan and Michael Witmore. The Very Large Textual Object: A Prosthetic Reading of Shakespeare

Lenthe, Victor. Finding the Sherlock in Shakespeare: some ideas about prose genre and linguistic uniqueness

Stumpf, Mike. How Quickly Nature Falls Into Revolt: On Revisiting Shakespeare’s Genres

Stumpf, Mike. This Thing of Darkness (Part III)

Tootalian, Jacob A. Shakespeare, Without Measure: The Rhetorical Tendencies of Renaissance Dramatic Prose

Ullyot, Michael. Encoding Shakespeare

Witmore, Michael. A Genre Map of Shakespeare’s Plays from the First Folio (1623)

Witmore, Michael. Shakespeare Out of Place?

Witmore, Michael. Shakespeare Quarterly 61.3 Figures

Witmore, Michael. Visualizing English Print, 1530-1800, Genre Contents of the Corpus

Decompiling Shakespeare (Site is down. Was also down when the WayBack machine tried to archive the site in July of 2014)

I prefer the longer listing.

If you are interested in Shakespeare, Folger Digital Texts has free XML and PDF versions of his work.

I first saw this in a tweet by Gregory Piatetsky

Feature Selection with Scikit-Learn

Sunday, May 26th, 2013

Feature Selection with Scikit Learn by Sujit Pal.

From the post:

I am currently doing the Web Intelligence and Big Data course from Coursera, and one of the assignments was to predict a person’s ethnicity from a set of about 200,000 genetic markers (provided as boolean values). As you can see, a simple classification problem.

One of the optimization suggestions for the exercise was to prune the featureset. Prior to this, I had only a vague notion that one could do this by running correlations of each feature against the outcome, and choosing the most highly correlated ones. This seemed like a good opportunity to learn a bit about this, so I did some reading and digging within Scikit-Learn to find if they had something to do this (they did). I also decided to investigate how the accuracy of a classifier varies with the feature size. This post is a result of this effort.

The IR Book has a sub-chapter on Feature Selection. Three main approaches to Feature Selection are covered – Mutual Information based, Chi-square based and Frequency based. Scikit-Learn provides several methods to select features based on Chi-Squared and ANOVA F-values for classification. I learned about this from Matt Spitz’s passing reference to Chi-squared feature selection in Scikit-Learn in his Slugger ML talk at Pycon USA 2012.

In the code below, I compute the accuracies with various feature sizes for 9 different classifiers, using both the Chi-squared measure and the ANOVA F measures.

Sujit uses Scikit-Learn to investigate the accuracy of classifiers.

“Almost there….” (Computing Homology)

Friday, April 12th, 2013

We all remember the pilot in Star Wars that kept saying, “Almost there….” Jeremy Kun has us “almost there…” in his latest installment: Computing Homology.

To give you some encouragement, Jeremy concludes the post saying:

The reader may be curious as to why we didn’t come up with a more full-bodied representation of a simplicial complex and write an algorithm which accepts a simplicial complex and computes all of its homology groups. We’ll leave this direct approach as a (potentially long) exercise to the reader, because coming up in this series we are going to do one better. Instead of computing the homology groups of just one simplicial complex using by repeating one algorithm many times, we’re going to compute all the homology groups of a whole family of simplicial complexes in a single bound. This family of simplicial complexes will be constructed from a data set, and so, in grandiose words, we will compute the topological features of data.

If it sounds exciting, that’s because it is! We’ll be exploring a cutting-edge research field known as persistent homology, and we’ll see some of the applications of this theory to data analysis. (bold emphasis added)

Data analysts are needed at all levels.

Do you want to be a spreadsheet data analyst or something a bit harder to find?

Collaborative Filtering via Group-Structured Dictionary Learning

Wednesday, January 30th, 2013

Collaborative Filtering via Group-Structured Dictionary Learning by Zoltan Szabo, Barnabas Poczos , and Andras Lorincz.


Structured sparse coding and the related structured dictionary learning problems are novel research areas in machine learning. In this paper we present a new application of structured dictionary learning for collaborative filtering based recommender systems. Our extensive numerical experiments demonstrate that the presented method outperforms its state-of-the-art competitors and has several advantages over approaches that do not put structured constraints on the dictionary elements.

From the paper:

Novel advances on CF show that dictionary learning based approaches can be efficient for making predictions about users’ preferences [2]. The dictionary learning based approach assumes that (i) there is a latent, unstructured feature space (hidden representation/code) behind the users’ ratings, and (ii) a rating of an item is equal to the product of the item and the user’s feature.

Is a “preference” actually a form of subject identification?

I ask because the notion of a “real time” system is incompatible with users researching the proper canonical subject identifier and/or waiting for a response from an inter-departmental committee to agree on correct terminology.

Perhaps subject identification in some systems must be on the basis of “…latent, unstructured feature space[s]…” that are known (and disclosed) imperfectly at best.

Zoltán Szabó’s Home Page, numerous publications and the source code for this article.

The TV-tree — an index structure for high-dimensional data (1994)

Saturday, September 18th, 2010

The TV-tree — an index structure for high-dimensional data (1994) Authors: King-ip Lin , H. V. Jagadish , Christos Faloutsos Keywords:Spatial Index, Similarity Retrieval, Query by Context, R*-Tree, High-Dimensionality Feature Spaces.


We propose a file structure to index high-dimensionality data, typically, points in some feature space. The idea is to use only a few of the features, utilizing additional features whenever the additional discriminatory power is absolutely necessary. We present in detail the design of our tree structure and the associated algorithms that handle such `varying length’ feature vectors. Finally we report simulation results, comparing the proposed structure with the R -tree, which is one of the most successful methods for low-dimensionality spaces. The results illustrate the superiority of our method, with up to 80% savings in disk accesses.

The notion of “…utilizing additional features whenever the additional discriminatory power is absolutely necessary…” is an important one.

Compare to fixed simplistic discrimination and/or fixed complex, high-overhead, discrimination between subject representatives.

Either one represents a failure of imagination.