Using scikit-learn Pipelines and FeatureUnions by Zac Stewart.
From the post:
Since I posted a postmortem of my entry to Kaggle's See Click Fix competition, I've meant to keep sharing things that I learn as I improve my machine learning skills. One that I've been meaning to share is scikit-learn's pipeline module. The following is a moderately detailed explanation and a few examples of how I use pipelining when I work on competitions.
The pipeline module of scikit-learn allows you to chain transformers and estimators together in such a way that you can use them as a single unit. This comes in very handy when you need to jump through a few hoops of data extraction, transformation, normalization, and finally train your model (or use it to generate predictions).
When I first started participating in Kaggle competitions, I would invariably get started with some code that looked similar to this:
train = read_file('data/train.tsv') train_y = extract_targets(train) train_essays = extract_essays(train) train_tokens = get_tokens(train_essays) train_features = extract_feactures(train) classifier = MultinomialNB() scores = [] train_idx, cv_idx in KFold(): classifier.fit(train_features[train_idx],
train_y[train_idx]) scores.append(model.score(
train_features[cv_idx], train_y[cv_idx])) print("Score: {}".format(np.mean(scores)))Often, this would yield a pretty decent score for a first submission. To improve my ranking on the leaderboard, I would try extracting some more features from the data. Let's say in instead of text n-gram counts, I wanted tf–idf. In addition, I wanted to include overall essay length. I might as well throw in misspelling counts while I'm at it. Well, I can just tack those into the implementation of
extract_features
. I'd extract three matrices of features–one for each of those ideas and then concatenate them along axis 1. Easy.…
Zac has quite a bit of practical advice for how to improve your use of scikit-learn. Just what you need to start a week in the Spring!
Enjoy!
I first saw this in a tweet by Vineet Vashishta.