Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 11, 2012

Item based similarity with GraphChi

Filed under: GraphChi,GraphLab,Similarity — Patrick Durusau @ 10:15 am

Item based similarity with GraphChi by Danny Bickson.

From the post:

Item based collaborative filtering is one of the most popular collaborative filtering methods used in more than 70% of the companies I am talking to. Following my mega collaborator Justin Yan‘s advice, I have started to implement some item based similarity methods in GraphChi.

Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!

If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.

When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.

I need to check on Danny’s blog and the GraphChi/GraphLab webpages more often!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress