Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 23, 2015

Using scikit-learn Pipelines and FeatureUnions

Filed under: Machine Learning,Scikit-Learn — Patrick Durusau @ 4:41 pm

Using scikit-learn Pipelines and FeatureUnions by Zac Stewart.

From the post:

Since I posted a postmortem of my entry to Kaggle's See Click Fix competition, I've meant to keep sharing things that I learn as I improve my machine learning skills. One that I've been meaning to share is scikit-learn's pipeline module. The following is a moderately detailed explanation and a few examples of how I use pipelining when I work on competitions.

The pipeline module of scikit-learn allows you to chain transformers and estimators together in such a way that you can use them as a single unit. This comes in very handy when you need to jump through a few hoops of data extraction, transformation, normalization, and finally train your model (or use it to generate predictions).

When I first started participating in Kaggle competitions, I would invariably get started with some code that looked similar to this:

train = read_file('data/train.tsv')
train_y = extract_targets(train)
train_essays = extract_essays(train)
train_tokens = get_tokens(train_essays)
train_features = extract_feactures(train)
classifier = MultinomialNB()

scores = []
train_idx, cv_idx in KFold():
  classifier.fit(train_features[train_idx], 
train_y
[train_idx]) scores.append(model.score(
train_features
[cv_idx], train_y[cv_idx])) print("Score: {}".format(np.mean(scores)))

Often, this would yield a pretty decent score for a first submission. To improve my ranking on the leaderboard, I would try extracting some more features from the data. Let's say in instead of text n-gram counts, I wanted tf–idf. In addition, I wanted to include overall essay length. I might as well throw in misspelling counts while I'm at it. Well, I can just tack those into the implementation of extract_features. I'd extract three matrices of features–one for each of those ideas and then concatenate them along axis 1. Easy.

Zac has quite a bit of practical advice for how to improve your use of scikit-learn. Just what you need to start a week in the Spring!

Enjoy!

I first saw this in a tweet by Vineet Vashishta.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress