Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 26, 2013

Feature Selection with Scikit-Learn

Filed under: Classifier,Feature Spaces,Scikit-Learn — Patrick Durusau @ 3:25 pm

Feature Selection with Scikit Learn by Sujit Pal.

From the post:

I am currently doing the Web Intelligence and Big Data course from Coursera, and one of the assignments was to predict a person’s ethnicity from a set of about 200,000 genetic markers (provided as boolean values). As you can see, a simple classification problem.

One of the optimization suggestions for the exercise was to prune the featureset. Prior to this, I had only a vague notion that one could do this by running correlations of each feature against the outcome, and choosing the most highly correlated ones. This seemed like a good opportunity to learn a bit about this, so I did some reading and digging within Scikit-Learn to find if they had something to do this (they did). I also decided to investigate how the accuracy of a classifier varies with the feature size. This post is a result of this effort.

The IR Book has a sub-chapter on Feature Selection. Three main approaches to Feature Selection are covered – Mutual Information based, Chi-square based and Frequency based. Scikit-Learn provides several methods to select features based on Chi-Squared and ANOVA F-values for classification. I learned about this from Matt Spitz’s passing reference to Chi-squared feature selection in Scikit-Learn in his Slugger ML talk at Pycon USA 2012.

In the code below, I compute the accuracies with various feature sizes for 9 different classifiers, using both the Chi-squared measure and the ANOVA F measures.

Sujit uses Scikit-Learn to investigate the accuracy of classifiers.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress