Using Lucene Similarity in Item-Item Recommenders by Sujit Pal.
From the post:
Last week, I implemented 4 (of 5) recommenders from the Programming Assignments of the Introduction to Recommender Systems course on Coursera, but using Apache Mahout and Scala instead of Lenskit and Java. This week, I implement an Item Item Collaborative Filtering Recommender that uses Lucene (more specifically, Lucene’s More Like This query) as the item similarity provider.
By default, Lucene stores document vectors keyed by terms, but can be configured to store term vectors by setting the field attribute TermVector.YES. In case of text documents, words (or terms) are the features which are used to compute similarity between documents. I am using the same dataset as last week, where movies (items) correspond to documents and movie tags correspond to the words. So we build a movie “document” by preprocessing the tags to form individual tokens and concatenating them into a tags field in the index.
Three scenarios are covered. The first two are similar to the scenarios covered with the item-item collaborative filtering recommender from last week, where the user is on a movie page, and we need to (a) predict the rating a user would given a specified movie and (b) find movies similar to a given movie. The third scenario is recommending movies to a given user. We describe each algorithm briefly, and how Lucene fits in.
I’m curious how easy/difficult it would be to re-purpose similarity algorithms to detect common choices in avatar characteristics, acquisitions, interaction with others, goals, etc.?
Thinking that while obvious repetitions are easy enough to avoid, gender, age, names, etc., there are other, more subtle characteristics of interaction with others that would be far harder to be aware of. Much less to mask effectively.
It would require a lot of data on interaction but I assume that isn’t all that difficult to whistle up on any of the major systems.
If you have any pointers to that sort of research, forward them along. I will be posting a collection of pointers and will credit anyone who wants to be credited.