Python Multi-armed Bandits (and Beer!) by Eric Chiang.
From the post:
There are many ways to evaluate different strategies for solving different prediction tasks. In our last post, for example, we discussed calibration and descrimination, two measurements which assess the strength of a probabilistic prediciton. Measurements like accuracy, error, and recall among others are useful when considering whether random forest “works better” than support vector machines on a problem set. Common sense tells us that knowing which analytical strategy “does the best” is important, as it will impact the quality of our decisions downstream. The trick, therefore, is in selecting the right measurement, a task which isn’t always obvious.
There are many prediction problems where choosing the right accuracy measurement is particularly difficult. For example, what’s the best way to know whether this version of your recommendation system is better than the prior version? You could – as was the case with the Netflix Prize – try to predict the number of stars a user gave to a movie and measure your accuracy. Another (simpler) way to vet your recommender strategy would be to roll I out to users and watch before and after behaviors.
So by the end of this blog post, you (the reader) will hopefully be helping me improve our beer recommender through your clicks and interactions.
The final application which this blog will explain can be found at bandit.yhathq.com. The original post explaining beer recommenders can be found here.
…
I have friend who programs in Python (as well as other languages) and they are or are on their way to becoming a professional beer taster.
Given a choice, I think I would prefer to become a professional beer drinker but each to their own. 😉
The discussion of measures of distances between beers in this post is quite good. When reading it, think about beers (or other beverages) you have had and try to pick between Euclidean distance, distance correlation, and cosine similarity in discussing how you evaluate those beverages to each other.
What? That isn’t how you evaluate your choices between beverages?
Yet, those “measures” have proven to be effective (effective != 100%) at providing distances between individual choices.
The “mapping” between the unknown internal scale of users and the metric measures used in recommendation systems is derived from a population of users. The resulting scale may or may not be an exact fit for any user in the tested group.
The usefulness of any such scale depends on the similarity of the population over which it was derived and the population where you want to use it. Not to mention how you validated the answers. (Users are reported to give the “expected” response as opposed to their actual choices in some scenarios.)