Practical machine learning tricks from the KDD 2011 best industry paper by David Andrzejewski.
From the post:
A machine learning research paper tends to present a newly proposed method or algorithm in relative isolation. Problem context, data preparation, and feature engineering are hopefully discussed to the extent required for reader understanding and scientific reproducibility, but are usually not the primary focus. Given the goals and constraints of the format, this can be seen as a reasonable trade-off: the authors opt to spend scarce "ink" on only the most essential (often abstract) ideas.
As a consequence, implementation details relevant to the use of the proposed technique in an actual production system are often not mentioned whatsoever. This aspect of machine learning is often left as "folk wisdom" to be picked up from colleagues, blog
posts, discussion boards, snarky tweets, open-source libraries, or more often than not, first-hand experience.Papers from conference "industry tracks" often deviate from this template, yielding valuable insights about what it takes to make machine learning effective in practice. This paper from Google on detecting "malicious" (ie, scam/spam) advertisements won best industry paper at KDD 2011 and is a particularly interesting example.
Detecting Adversarial Advertisements in the Wild
D. Sculley, Matthew Otey, Michael Pohl, Bridget Spitznagel,
John Hainsworth, Yunkai Zhou
At first glance, this might appear to be a "Hello-World" machine learning problem straight out of a textbook or tutorial: we simply train a Naive Bayes on a set of bad ads versus a set of good ones. However this is apparently far from being the case – while Google is understandably shy about hard numbers, the paper mentions several issues which make this especially challenging and notes that this is a business-critical problem for Google.
The paper describes an impressive and pragmatic blend of different techniques and tricks. I've briefly described some of the highlights, but I would certainly encourage the interested reader to check out the original paper and presentation slides.
In addition to the original paper and slides, I would suggest having David’s comments at hand while you read the paper. Not to mention having access to a machine and online library at the same time.
There is much here to repurpose to assist you and your users.