Archive for the ‘Scikit-Learn’ Category


Monday, June 12th, 2017

FreeDiscovery: Open Source e-Discovery and Information Retrieval Engine

From the webpage:

FreeDiscovery is built on top of existing machine learning libraries (scikit-learn) and provides a REST API for information retrieval applications. It aims to benefit existing e-Discovery and information retrieval platforms with a focus on text categorization, semantic search, document clustering, duplicates detection and e-mail threading.

In addition, FreeDiscovery can be used as Python package and exposes several estimators with a scikit-learn compatible API.

Python 3.5+ required.

Homepage has command line examples, with a pointer to: for more examples.

The additional examples use a subset of the TREC 2009 legal collection. Cool!

I saw this in a tweet by Lynn Cherny today.


Hello World – Machine Learning Recipes #1

Saturday, April 16th, 2016

Hello World – Machine Learning Recipes #1 by Josh Gordon.

From the description:

Six lines of Python is all it takes to write your first machine learning program! In this episode, we’ll briefly introduce what machine learning is and why it’s important. Then, we’ll follow a recipe for supervised learning (a technique to create a classifier from examples) and code it up.

The first in a promised series on machine learning using scikit learn and TensorFlow.

The quality of video that you wish was available to intermediate and advanced treatments.

Quite a treat! Pass onto anyone interested in machine learning.


scikit-learn 0.17b1 is out!

Friday, October 16th, 2015

scikit-learn 0.17b1 is out! by Olivier Grisel.

From the announcement:

The 0.17 beta release of scikit-learn has been uploaded to PyPI. As of now only the source tarball is available. I am waiting for the CI server to build the binary packages for the Windows and Mac OSX platform. They should be online tonight or tomorrow morning.

Please test it as much as possible especially if you have a test suite for a project that has scikit-learn as a dependency.

If you find regressions from 0.16.1 please open issues on github and put `[REGRESSION]` in the title of the issue:

Any bugfix will have to be merged to the master branch first and then we will do a cherrypick of the fix into the 0.17.X branch that will be used to generate 0.17.0 final, probably in less than 2 weeks.

Just in time for the weekend! 😉

Comment early and often.


PyCon 2015 Scikit-learn Tutorial

Wednesday, April 8th, 2015

PyCon 2015 Scikit-learn Tutorial by Jake VanderPlas.


Machine learning is the branch of computer science concerned with the development of algorithms which can be trained by previously-seen data in order to make predictions about future data. It has become an important aspect of work in a variety of applications: from optimization of web searches, to financial forecasts, to studies of the nature of the Universe.

This tutorial will explore machine learning with a hands-on introduction to the scikit-learn package. Beginning from the broad categories of supervised and unsupervised learning problems, we will dive into the fundamental areas of classification, regression, clustering, and dimensionality reduction. In each section, we will introduce aspects of the Scikit-learn API and explore practical examples of some of the most popular and useful methods from the machine learning literature.

The strengths of scikit-learn lie in its uniform and well-document interface, and its efficient implementations of a large number of the most important machine learning algorithms. Those present at this tutorial will gain a basic practical background in machine learning and the use of scikit-learn, and will be well poised to begin applying these tools in many areas, whether for work, for research, for Kaggle-style competitions, or for their own pet projects.

You can view the tutorial at: PyCon 2015 Scikit-Learn Tutorial Index.

Jake is presenting today (April 8, 2015), so this is very current news!


Scikit-Learn 0.16 release

Monday, April 6th, 2015

Scikit-Learn 0.16 is out!


BTW, improvements are already being listed for Scikit-Learn 0.17.

Using scikit-learn Pipelines and FeatureUnions

Monday, March 23rd, 2015

Using scikit-learn Pipelines and FeatureUnions by Zac Stewart.

From the post:

Since I posted a postmortem of my entry to Kaggle's See Click Fix competition, I've meant to keep sharing things that I learn as I improve my machine learning skills. One that I've been meaning to share is scikit-learn's pipeline module. The following is a moderately detailed explanation and a few examples of how I use pipelining when I work on competitions.

The pipeline module of scikit-learn allows you to chain transformers and estimators together in such a way that you can use them as a single unit. This comes in very handy when you need to jump through a few hoops of data extraction, transformation, normalization, and finally train your model (or use it to generate predictions).

When I first started participating in Kaggle competitions, I would invariably get started with some code that looked similar to this:

train = read_file('data/train.tsv')
train_y = extract_targets(train)
train_essays = extract_essays(train)
train_tokens = get_tokens(train_essays)
train_features = extract_feactures(train)
classifier = MultinomialNB()

scores = []
train_idx, cv_idx in KFold():[train_idx], 
[train_idx]) scores.append(model.score(
[cv_idx], train_y[cv_idx])) print("Score: {}".format(np.mean(scores)))

Often, this would yield a pretty decent score for a first submission. To improve my ranking on the leaderboard, I would try extracting some more features from the data. Let's say in instead of text n-gram counts, I wanted tf–idf. In addition, I wanted to include overall essay length. I might as well throw in misspelling counts while I'm at it. Well, I can just tack those into the implementation of extract_features. I'd extract three matrices of features–one for each of those ideas and then concatenate them along axis 1. Easy.

Zac has quite a bit of practical advice for how to improve your use of scikit-learn. Just what you need to start a week in the Spring!


I first saw this in a tweet by Vineet Vashishta.

Hands-on with machine learning

Saturday, March 7th, 2015

Hands-on with machine learning by Chase Davis.

From the webpage:

First of all, let me be clear about one thing: You’re not going to “learn” machine learning in 60 minutes.

Instead, the goal of this session is to give you some sense of how to approach one type of machine learning in practice, specifically

For this exercise, we’ll be training a simple classifier that learns how to categorize bills from the California Legislature based only on their titles. Along the way, we’ll focus on three steps critical to any supervised learning application: feature engineering, model building and evaluation.

To help us out, we’ll be using a Python library called, which is the easiest to understand machine learning library I’ve seen in any language.

That’s a lot to pack in, so this session is going to move fast, and I’m going to assume you have a strong working knowledge of Python. Don’t get caught up in the syntax. It’s more important to understand the process.

Since we only have time to hit the very basics, I’ve also included some additional points you might find useful under the “What we’re not covering” heading of each section below. There are also some resources at the bottom of this document that I hope will be helpful if you decide to learn more about this on your own.

A great starting place for journalists or anyone else who wants to understand basic machine learning.

I first saw this in a tweet by Hanna Wallach.

Scikit-learn 0.15 release

Thursday, July 17th, 2014

Scikit-learn 0.15 release by Gaël Varoquaux.

From the post:


Quality— Looking at the commit log, there has been a huge amount of work to fix minor annoying issues.

Speed— There has been a huge effort put in making many parts of scikit-learn faster. Little details all over the codebase. We do hope that you’ll find that your applications run faster. For instance, we find that the worst case speed of Ward clustering is 1.5 times faster in 0.15 than 0.14. K-means clustering is often 1.1 times faster. KNN, when used in brute-force mode, got faster by a factor of 2 or 3.

Random Forest and various tree methods— The random forest and various tree methods are much much faster, use parallel computing much better, and use less memory. For instance, the picture on the right shows the scikit-learn random forest running in parallel on a fat Amazon node, and nicely using all the CPUs with little RAM usage.

Hierarchical aglomerative clusteringComplete linkage and average linkage clustering have been added. The benefit of these approach compared to the existing Ward clustering is that they can take an arbitrary distance matrix.

Robust linear models— Scikit-learn now includes RANSAC for robust linear regression.

HMM are deprecated— We have been discussing for a long time removing HMMs, that do not fit in the focus of scikit-learn on predictive modeling. We have created a separate hmmlearn repository for the HMM code. It is looking for maintainers.

And much more— plenty of “minor things”, such as better support for sparse data, better support for multi-label data…

Get thee to Scikit-learn!

Visualizing Philosophers And Scientists

Tuesday, July 1st, 2014

Visualizing Philosophers And Scientists By The Words They Used With Python and d3.js by Sahand Saba.

From the post:

This is a rather short post on a little fun project I did a couple of weekends ago. The purpose was mostly to demonstrate how easy it is to process and visualize large amounts of data using Python and d3.js.

With the goal of visualizing the words that were most associated with a given scientist or philosopher, I downloaded a variety of science and philosophy books that are in the public domain (project Gutenberg, more specifically), and processed them using Python (scikit-learn and nltk), then used d3.js and d3.js cloud by Jason Davies ( to visualize the words most frequently used by the authors. To make it more interesting, only words that are somewhat unique to the author are displayed (i.e. if a word is used frequently by all authors then it is likely not that interesting and is dropped from the results). This can be easily achieved using the max_df parameter of the CountVectorizer class.

I pass by Copleston’s A History of Philosophy several times a day. It is a paperback edition from many years ago that I keep meaning to re-read.

At least for philosophers with enough surviving texts in machine readable format, perhaps Sahand’s post will provide the incentive to return to reading Copleston. A word cloud is one way to explore a text. Commentary, such as Copleston’s, is another.

What other tools would you use with philosophers and a commentary like Copleston?

I first saw this in a tweet by Christophe Viau.

Enough Machine Learning to…

Monday, May 12th, 2014

Enough Machine Learning to Make Hacker News Readable Again by Ned Jackson Lovely.

From the description:

It’s inevitable that online communities will change, and that we’ll remember the community with a fondness that likely doesn’t accurately reflect the former reality. We’ll explore how we can take a set of articles from an online community and winnow out the stuff we feel is unworthy. We’ll explore some of the machine learning tools that are just a “pip install” away, such as scikit-learn and nltk.

Ned recommends you start with the map I cover at: Machine Learning Cheat Sheet (for scikit-learn).

Great practice with scikit-learn. Following this as a general outline will develop your machine learning skills!

A Gentle Introduction to Scikit-Learn…

Thursday, April 17th, 2014

A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library by Jason Brownlee.

From the post:

If you are a Python programmer or you are looking for a robust library you can use to bring machine learning into a production system then a library that you will want to seriously consider is scikit-learn.

In this post you will get an overview of the scikit-learn library and useful references of where you can learn more.

Nothing new if you are already using Scikit-Learn but a very nice introduction with additional resources to pass onto others.

Save yourself some time in gathering materials to spread the use of Scikit-Learn. Bookmark and forward today!

SciPy2013 Videos

Sunday, June 30th, 2013

SciPy2013 Videos

A really nice set of videos, including tutorials, from SciPy2013.

Due to the limitations of YouTube, the listing is a mess.

If I have time later this week I will try to produce a cleaned up listing.

in the meantime, enjoy!

Feature Selection with Scikit-Learn

Sunday, May 26th, 2013

Feature Selection with Scikit Learn by Sujit Pal.

From the post:

I am currently doing the Web Intelligence and Big Data course from Coursera, and one of the assignments was to predict a person’s ethnicity from a set of about 200,000 genetic markers (provided as boolean values). As you can see, a simple classification problem.

One of the optimization suggestions for the exercise was to prune the featureset. Prior to this, I had only a vague notion that one could do this by running correlations of each feature against the outcome, and choosing the most highly correlated ones. This seemed like a good opportunity to learn a bit about this, so I did some reading and digging within Scikit-Learn to find if they had something to do this (they did). I also decided to investigate how the accuracy of a classifier varies with the feature size. This post is a result of this effort.

The IR Book has a sub-chapter on Feature Selection. Three main approaches to Feature Selection are covered – Mutual Information based, Chi-square based and Frequency based. Scikit-Learn provides several methods to select features based on Chi-Squared and ANOVA F-values for classification. I learned about this from Matt Spitz’s passing reference to Chi-squared feature selection in Scikit-Learn in his Slugger ML talk at Pycon USA 2012.

In the code below, I compute the accuracies with various feature sizes for 9 different classifiers, using both the Chi-squared measure and the ANOVA F measures.

Sujit uses Scikit-Learn to investigate the accuracy of classifiers.

Machine Learning Cheat Sheet (for scikit-learn)

Saturday, January 26th, 2013

Machine Learning Cheat Sheet (for scikit-learn) by Andreas Mueller.

From the post:

(Click for a larger version)

BTW, scikit-learn is doing a user survey.

Take a few minutes to contribute your feedback.

Scikit-Learn 0.13 released!

Thursday, January 24th, 2013

Scikit-Learn 0.13 released! We want your feedback. by Andreas Mueller

From the post:

After a little delay, the team finished work on the 0.13 release of scikit-learn.

There is also a user survey that we launched in parallel with the release, to get some feedback from our users.

There is a list of changes and new features on the website.

Feedback (useful feedback) is a small price to pay for such a large amount of effort!

Download the new release and submit feedback.

On the next release you will be glad you did!

Scikit-learn 0.12 released

Monday, October 1st, 2012

Scikit-learn 0.12 released by Andreas Mueller.

From the post:

Last night I uploaded the new version 0.12 of scikit-learn to pypi. Also the updated website is up and running and development now starts towards 0.13.

The new release has some nifty new features (see whatsnew):

  • Multidimensional scaling
  • Multi-Output random forests (like these)
  • Multi-task Lasso
  • More loss functions for ensemble methods and SGD
  • Better text feature extraction

Eventhough, the majority of changes in this release are somewhat “under the hood”.

Vlad developed and set up a continuous performance benchmark for the main algorithms during his google summer of code. I am sure this will help improve performance.

There already has been a lot of work in improving performance, by Vlad, Immanuel, Gilles and others for this release.

Just in case you haven’t been keeping up with Scikit-learn.

Troll Detection with Scikit-Learn

Monday, October 1st, 2012

Troll Detection with Scikit-Learn by Andreas Mueller.

I had thought that troll detection was one of those “field guide” sort of things:

troll dolls

After reading Andreas’ post, apparently not. 😉

From the post:

Cross-post from Peekaboo, Andreas Mueller‘s computer vision and machine learning blog. This post documents his experience in the Impermium Detecting Insults in Social Commentary competition, but rest of the blog is well worth a read, especially for those interested in computer vision and Python scikit-learn and -image.

Recently I entered my first kaggle competition – for those who don’t know it, it is a site running machine learning competitions. A data set and time frame is provided and the best submission gets a money prize, often something between 5000$ and 50000$.

I found the approach quite interesting and could definitely use a new laptop, so I entered Detecting Insults in Social Commentary.

My weapon of choice was Python with scikit-learn – for those who haven’t read my blog before: I am one of the core devs of the project and never shut up about it.

During the competition I was visiting Microsoft Reseach, so this is where most of my time and energy went, in particular in the end of the competition, as it was also the end of my internship. And there was also the scikit-learn release in between. Maybe I can spent a bit more time on the next competition.