Archive for the ‘Recommendation’ Category

How are recommendation engines built?

Sunday, June 14th, 2015

How are recommendation engines built?

From the post:

The success of Amazon and Netflix has made recommendation systems not only common but also extremely popular. For many people, the recommendation system seems to be one of the easiest applications to understand; and a majority of us use them daily.

Haven’t you ever marveled at the ingenuity of a website offering the HDMI cable that goes with a television? Never been tempted by the latest trendy book about vampires? Been irritated by suggestions for diapers or baby powder though your child has been potty-trained for 3 months? Been annoyed to see flat screen TVs pop up on your browser every year with the approach of summer? The answer is, at least to me: “Yes, I have.”

But before cursing, every user should be aware of the difficulty of building an effective recommendation system! Below are some elements on how these systems are built (and ideas for how you can build your own).

A high level view of some of the strategies that underlie recommendation engines. This won’t help you will the nuts-n-bolts of building a recommendation engine but can serve as a brief introduction.

Recommendation engines could be used with topic maps to either annoy users with guesses as to what they would like to see next or perhaps more usefully in a topic map authoring context. To alert an author of closely similar material already in the topic map.

I first saw this in a tweet by Christophe Lalanne.

Recommending music on Spotify with deep learning

Friday, April 17th, 2015

Recommending music on Spotify with deep learning by Sander Dieleman.

From the post:

This summer, I’m interning at Spotify in New York City, where I’m working on content-based music recommendation using convolutional neural networks. In this post, I’ll explain my approach and show some preliminary results.


This is going to be a long post, so here’s an overview of the different sections. If you want to skip ahead, just click the section title to go there.

If you are interested in the details of deep learning and recommendation for music, you have arrived at the right place!

Walking through Sander’s post will take some time but it will repay your efforts handsomely.

Not to mention Spotify having the potential to broaden your musical horizons!

I first saw this in a tweet by Mica McPeeters.

An Inside Look at the Components of a Recommendation Engine

Thursday, April 16th, 2015

An Inside Look at the Components of a Recommendation Engine by Carol McDonald.

From the post:

Recommendation engines help narrow your choices to those that best meet your particular needs. In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model, and search technology from Elasticsearch to simplify deployment of the recommender.

There are two reasons to read this post:

First, you really don’t know how recommendation engines work. Well, better late than never.

Second, you want an example of how to write an excellent explanation of recommendation engines, hopefully to replicate it for other software.

This is an example of an excellent explanation of recommendation engines but whether you can replicate it for other software remains to be seen. 😉

Still, reading excellent explanations is a first step towards authoring excellent explanations.

Good luck!

Most HR Data Is Bad Data

Thursday, February 19th, 2015

Most HR Data Is Bad Data by Marcus Buckingham.

“Bad data” can come in any number of forms and Marcus Buckingham focuses on one of the most pernicious: Data that is flawed at its inception. Data that doesn’t measure what it purports to measure. Performance evaluation data.

From the post:

Over the last fifteen years a significant body of research has demonstrated that each of us is a disturbingly unreliable rater of other people’s performance. The effect that ruins our ability to rate others has a name: the Idiosyncratic Rater Effect, which tells us that my rating of you on a quality such as “potential” is driven not by who you are, but instead by my own idiosyncrasies—how I define “potential,” how much of it I think I have, how tough a rater I usually am. This effect is resilient — no amount of training seems able to lessen it. And it is large — on average, 61% of my rating of you is a reflection of me.

In other words, when I rate you, on anything, my rating reveals to the world far more about me than it does about you. In the world of psychometrics this effect has been well documented. The first large study was published in 1998 in Personnel Psychology; there was a second study published in the Journal of Applied Psychology in 2000; and a third confirmatory analysis appeared in 2010, again in Personnel Psychology. In each of the separate studies, the approach was the same: first ask peers, direct reports, and bosses to rate managers on a number of different performance competencies; and then examine the ratings (more than half a million of them across the three studies) to see what explained why the managers received the ratings they did. They found that more than half of the variation in a manager’s ratings could be explained by the unique rating patterns of the individual doing the rating— in the first study it was 71%, the second 58%, the third 55%.

You have to follow the Idiosyncratic Rater Effect link to find the references Buckingham cites so I have repeated them (with links and abstracts) below:

Trait, Rater and Level Effects in 360-Degree Performance Ratings by Michael K. Mount, et al., Personnel Psychology, 1998, 51, 557-576.


Method and trait effects in multitrait-multirater (MTMR) data were examined in a sample of 2,350 managers who participated in a developmental feedback program. Managers rated their own performance and were also rated by two subordinates, two peers, and two bosses. The primary purpose of the study was to determine whether method effects are associated with the level of the rater (boss, peer, subordinate, self) or with each individual rater, or both. Previous research which has tacitly assumed that method effects are associated with the level of the rater has included only one rater from each level; consequently, method effects due to the rater’s level may have been confounded with those due to the individual rater. Based on confirmatory factor analysis, the present results revealed that of the five models tested, the best fit was the 10-factor model which hypothesized 7 method factors (one for each individual rater) and 3 trait factors. These results suggest that method variance in MTMR data is more strongly associated with individual raters than with the rater’s level. Implications for research and practice pertaining to multirater feedback programs are discussed.

Understanding the Latent Structure of Job Performance Ratings, by Michael K. Mount, Steven E. Scullen, Maynard Goff, Journal of Applied Psychology, 2000, Vol. 85, No. 6, 956-970 (I looked but apparently the APA hasn’t gotten the word about access to abstracts online, etc.)

Rater Source Effects are Alive and Well After All by Brian Hoffman, et al., Personnel Psychology, 2010, 63, 119-151.


Recent research has questioned the importance of rater perspective effects on multisource performance ratings (MSPRs). Although making a valuable contribution, we hypothesize that this research has obscured evidence for systematic rater source effects as a result of misspecified models of the structure of multisource performance ratings and inappropriate analytic methods. Accordingly, this study provides a reexamination of the impact of rater source on multisource performance ratings by presenting a set of confirmatory factor analyses of two large samples of multisource performance rating data in which source effects are modeled in the form of second-order factors. Hierarchical confirmatory factor analysis of both samples revealed that the structure of multisource performance ratings can be characterized by general performance, dimensional performance, idiosyncratic rater, and source factors, and that source factors explain (much) more variance in multisource performance ratings whereas general performance explains (much) less variance than was previously believed. These results reinforce the value of collecting performance data from raters occupying different organizational levels and have important implications for research and practice.

For students: Can you think of other sources that validate the Idiosyncratic Rater Effect?

What about algorithms that make recommendations based on user ratings of movies? Isn’t the premise of recommendations that the ratings tell us more about the rater than about the movie? So we can make the “right” recommendation for a person very similar to the rater?

I don’t know that it means anything but a search with a popular search engine turns up only 258 “hits” for “Idiosyncratic Rater Effect.” On the other hand, “recommendation system” turns up 424,000 “hits” and that sounds low to me considering the literature on recommendation.

Bottom line on data quality is that widespread use of data is no guarantee of quality.

What ratings reflect is useful in one context (recommendation) and pernicious in another (employment ratings).

I first saw this in a tweet by Neil Saunders.

TMR: A Semantic Recommender System using Topic Maps on the Items’ Descriptions

Wednesday, January 21st, 2015

TMR: A Semantic Recommender System using Topic Maps on the Items’ Descriptions by Angel L. Garrido and Sergio Ilarri.


Recommendation systems have become increasingly popular these days. Their utility has been proved to filter and to suggest items archived at web sites to the users. Even though recommendation systems have been developed for the past two decades, existing recommenders are still inadequate to achieve their objectives and must be enhanced to generate appealing personalized recommendations e ectively. In this paper we present TMR, a context-independent tool based on topic maps that works with item’s descriptions and reviews to provide suitable recommendations to users. TMR takes advantage of lexical and semantic resources to infer users’ preferences and thus the recommender is not restricted by the syntactic constraints imposed on some existing recommenders. We have verifi ed the correctness of TMR using a popular benchmark dataset.

One of the more exciting aspects of this paper is the building of topic maps from free texts that are then used in the recommendation process.

I haven’t seen the generated topic maps (yet) but suspect that editing an existing topic map is far easier than creating one ab initio.

A Latent Source Model for Online Collaborative Filtering

Wednesday, December 10th, 2014

A Latent Source Model for Online Collaborative Filtering by Guy Bresler, George H. Chen, and Devavrat Shah.


Despite the prevalence of collaborative filtering in recommendation systems, there has been little theoretical development on why and how well it works, especially in the “online” setting, where items are recommended to users over time. We address this theoretical gap by introducing a model for online recommendation systems, cast item recommendation under the model as a learning problem, and analyze the performance of a cosine-similarity collaborative filtering method. In our model, each of n users either likes or dislikes each of m items. We assume there to be k types of users, and all the users of a given type share a common string of probabilities determining the chance of liking each item. At each time step, we recommend an item to each user, where a key distinction from related bandit literature is that once a user consumes an item (e.g., watches a movie), then that item cannot be recommended to the same user again. The goal is to maximize the number of likable items recommended to users over time. Our main result establishes that after nearly log(km) initial learning time steps, a simple collaborative filtering algorithm achieves essentially optimal performance without knowing k. The algorithm has an exploitation step that uses cosine similarity and two types of exploration steps, one to explore the space of items (standard in the literature) and the other to explore similarity between users (novel to this work).

The similarity between users makes me wonder if merging results from a topic map could or should be returned on the basis of a similarity of users? On the assumption that at some point of similarity that distinct users share views about subject identity.

Python Multi-armed Bandits (and Beer!)

Thursday, November 20th, 2014

Python Multi-armed Bandits (and Beer!) by Eric Chiang.

From the post:

There are many ways to evaluate different strategies for solving different prediction tasks. In our last post, for example, we discussed calibration and descrimination, two measurements which assess the strength of a probabilistic prediciton. Measurements like accuracy, error, and recall among others are useful when considering whether random forest “works better” than support vector machines on a problem set. Common sense tells us that knowing which analytical strategy “does the best” is important, as it will impact the quality of our decisions downstream. The trick, therefore, is in selecting the right measurement, a task which isn’t always obvious.

There are many prediction problems where choosing the right accuracy measurement is particularly difficult. For example, what’s the best way to know whether this version of your recommendation system is better than the prior version? You could – as was the case with the Netflix Prize – try to predict the number of stars a user gave to a movie and measure your accuracy. Another (simpler) way to vet your recommender strategy would be to roll I out to users and watch before and after behaviors.

So by the end of this blog post, you (the reader) will hopefully be helping me improve our beer recommender through your clicks and interactions.

The final application which this blog will explain can be found at The original post explaining beer recommenders can be found here.

I have friend who programs in Python (as well as other languages) and they are or are on their way to becoming a professional beer taster.

Given a choice, I think I would prefer to become a professional beer drinker but each to their own. 😉

The discussion of measures of distances between beers in this post is quite good. When reading it, think about beers (or other beverages) you have had and try to pick between Euclidean distance, distance correlation, and cosine similarity in discussing how you evaluate those beverages to each other.

What? That isn’t how you evaluate your choices between beverages?

Yet, those “measures” have proven to be effective (effective != 100%) at providing distances between individual choices.

The “mapping” between the unknown internal scale of users and the metric measures used in recommendation systems is derived from a population of users. The resulting scale may or may not be an exact fit for any user in the tested group.

The usefulness of any such scale depends on the similarity of the population over which it was derived and the population where you want to use it. Not to mention how you validated the answers. (Users are reported to give the “expected” response as opposed to their actual choices in some scenarios.)

Twitter open sourced a recommendation algorithm for massive datasets

Wednesday, September 24th, 2014

Twitter open sourced a recommendation algorithm for massive datasets by Derrick Harris.

From the post:

Late last month, Twitter open sourced an algorithm that’s designed to ease the computational burden on systems trying to recommend content — contacts, articles, products, whatever — across seemingly endless sets of possibilities. Called DIMSUM, short for Dimension Independent Matrix Square using MapReduce (rolls off the tongue, no?), the algorithm trims the list of potential combinations to a reasonable number, so other recommendation algorithms can run in a reasonable amount of time.

Reza Zadeh, the former Twitter data scientist and current Stanford consulting professor who helped create the algorithm, describes it in terms of the famous handshake problem. Two people in a room? One handshake; no problem. Ten people in a room? Forty-five handshakes; still doable. However, he explained, “The number of handshakes goes up quadratically … That makes the problem very difficult when x is a million.”

Twitter claims 271 million active users.

DIMSUM works primarily in two different areas: (1) matching promoted ads with the right users, and (2) suggesting similar people to follow after users follow someone. Running through all the possible combinations would take days even on a large cluster of machines, Zadeh said, but sampling the user base using DIMSUM takes significantly less time and significantly fewer machines.

The “similarity” of two or more people or bits of content is a variation on the merging rules of the TMDM.

In recommendation language, two or more topics are “similar” if:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

TMDM 5.3.5 Properties

The TMDM says “equal” and not “similar” but the point being that you can arbitrarily decide on how “similar” two or more topics must be in order to trigger merging.

That realization opens up the entire realm of “similarity” and “recommendation” algorithms and techniques for application to topic maps.

Which brings us back to the algorithm just open sourced by Twitter.

With DIMSUM, you don’t have to do a brute force topic by topic comparison for merging purposes. Some topics will not meet a merging “threshold” and not be considered by merging routines.

Of course, with the TMDM, merging being either true or false, you may be stuck with brute force. Suggestions?

But if you have other similarity measures, you may be able to profit from DIMSUM.

BTW, I would not follow #dimsum on Twitter because it is apparently a type of dumpling. 😉

Update: All-pairs similarity via DIMSUM DIMSUM has been implemented in Spark!

Realtime Personalization/Recommendataion

Friday, May 30th, 2014

Realtime personalization and recommendation with stream mining by Mikio L. Braun.

From the post:

Last Tuesday, I gave a talk at this year’s Berlin Buzzword conference on using stream mining algorithms to efficiently store information extracted from user behavior to perform personalization and recommendation effectively already using a single computer, which is of course key behind streamdrill.

If you’ve been following my talks, you’ll probably recognize a lot of stuff I’ve talked about before, but what is new in this talk is that I tried to take the next step from simply talking about Heavy Hitters and Count- Min Sketches to using these data structures as an approximate storage for all kinds of analytics related data like counts, profiles, or even sparse matrices, as they occur recommendations algorithms.

I think reformulating our approach as basically an efficient approximate data structure also helped to steer the discussion away from comparing streamdrill to other big data frameworks (“Can’t you just do that in Storm?” — “define ‘just’”). As I said in the talk, the question is not whether you can do it in Big Data Framework X, because you probably could. I have started look at it from the other direction: we did not use any Big Data framework and were still able to achieve some serious performance numbers.

Slides and video are available at this page.

Free Recommendation Engine!

Friday, April 11th, 2014

Giving Away Our Recommendation Engine for Free by Doug Daniels.

From the post:

What’s better than a recommendation engine that’s free? A recommendation engine that is both awesome and free.

Today, we’re announcing General Availability for the Mortar Recommendation Engine. Designed by Mortar’s engineers and top data science advisors, it produces personalized recommendations at scale for companies like MTV, Comedy Central, StubHub, and the Associated Press. Today, we’re giving it away for free, and it is awesome.


But before the FOSS folks get all weepy eyed, let’s remember that in order to make use of a recommendation engine, you need:

  • Data, lots of data
  • Understanding of the data
  • Processing of the data
  • Debugging your recommendations
  • Someone to make recommendations to
  • Someone to pay you for your recommendations

And those are just the points that came to mind while writing this post.

You can learn a lot from the Mortar Recommendation Engine but it’s not a threat to Mortar’s core business.

Any more than Oracle handing out shrink wrap copies of Oracle 36DD would impact their licensing and consulting business.

When you want to wield big iron, you need professional grade training and supplies.

Merge Mahout item based recommendations…

Saturday, March 8th, 2014

Merge Mahout item based recommendations results from different algorithms

From the post:

Apache Mahout is a machine learning library that leverages the power of Hadoop to implement machine learning through the MapReduce paradigm. One of the implemented algorithms is collaborative filtering, the most successful recommendation technique to date. The basic idea behind collaborative filtering is to analyze the actions or opinions of users to recommend items similar to the one the user is interacting with.

Similarity isn’t restricted to a particular measure or metric.

How similar is enough to be considered the same?

That is a question topic map designers must answer on a case by case basis.

Using Lucene Similarity in Item-Item Recommenders

Saturday, December 14th, 2013

Using Lucene Similarity in Item-Item Recommenders by Sujit Pal.

From the post:

Last week, I implemented 4 (of 5) recommenders from the Programming Assignments of the Introduction to Recommender Systems course on Coursera, but using Apache Mahout and Scala instead of Lenskit and Java. This week, I implement an Item Item Collaborative Filtering Recommender that uses Lucene (more specifically, Lucene’s More Like This query) as the item similarity provider.

By default, Lucene stores document vectors keyed by terms, but can be configured to store term vectors by setting the field attribute TermVector.YES. In case of text documents, words (or terms) are the features which are used to compute similarity between documents. I am using the same dataset as last week, where movies (items) correspond to documents and movie tags correspond to the words. So we build a movie “document” by preprocessing the tags to form individual tokens and concatenating them into a tags field in the index.

Three scenarios are covered. The first two are similar to the scenarios covered with the item-item collaborative filtering recommender from last week, where the user is on a movie page, and we need to (a) predict the rating a user would given a specified movie and (b) find movies similar to a given movie. The third scenario is recommending movies to a given user. We describe each algorithm briefly, and how Lucene fits in.

I’m curious how easy/difficult it would be to re-purpose similarity algorithms to detect common choices in avatar characteristics, acquisitions, interaction with others, goals, etc.?

Thinking that while obvious repetitions are easy enough to avoid, gender, age, names, etc., there are other, more subtle characteristics of interaction with others that would be far harder to be aware of. Much less to mask effectively.

It would require a lot of data on interaction but I assume that isn’t all that difficult to whistle up on any of the major systems.

If you have any pointers to that sort of research, forward them along. I will be posting a collection of pointers and will credit anyone who wants to be credited.

Recommender Systems Course from GroupLens

Saturday, December 7th, 2013

Recommender Systems Course from GroupLens by Danny Bickson.

From the post:

I got the following course link from my colleague Tim Muss. The GroupLens research group (Univ. of Minnesota) have released a coursera course about recommender systems. Michael Konstan and Michael Ekstrand are lecturing. Any reader of my blog which has an elephant memory will recall I wrote about the Lenskit project already 2 years ago where I intreviewed Michael Ekstrand.

Would you agree that recommendation involves subject recognition?

At a minimum recognition of the subject to be recommended and the subject of a particular user’s preference.

I ask because the key to topic map “merging” isn’t ontological correctness but “correctness” in the eyes of a particular user.

What other standard would I use?

Using AWS to Build a Graph-based…

Friday, November 22nd, 2013

Using AWS to Build a Graph-based Product Recommendation System by Andre Fatala and Renato Pedigoni.

From the description:

Magazine Luiza, one of the largest retail chains in Brazil, developed an in-house product recommendation system, built on top of a large knowledge Graph. AWS resources like Amazon EC2, Amazon SQS, Amazon ElastiCache and others made it possible for them to scale from a very small dataset to a huge Cassandra cluster. By improving their big data processing algorithms on their in-house solution built on AWS, they improved their conversion rates on revenue by more than 25 percent compared to market solutions they had used in the past.

Not a lot of technical details but a good success story to repeat if you are pushing graph-based services.

I first saw this in a tweet by Marko A. Rodriguez.


Wednesday, October 16th, 2013

LIBMF: A Matrix-factorization Library for Recommender Systems by Machine Learning Group at National Taiwan University.

From the webpage:

LIBMF is an open source tool for approximating an incomplete matrix using the product of two matrices in a latent space. Matrix factorization is commonly used in collaborative filtering. Main features of LIBMF include

  • In addition to the latent user and item features, we add user bias, item bias, and average terms for better performance.
  • LIBMF can be parallelized in a multi-core machine. To make our package more efficient, we use SSE instructions to accelerate the vector product operations.

    For a data sets of 250M ratings, LIBMF takes less then eight minutes to converge to a reasonable level.

  • Download

    The current release (Version 1.0, Sept 2013) of LIBMF can be obtained by downloading the zip file or tar.gz file.

    Please read the COPYRIGHT notice before using LIBMF.


    The algorithms of LIBMF is described in the following paper.

    Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A Fast Parallel SGD for Matrix Factorization in Shared Memory Systems. Proceedings of ACM Recommender Systems 2013.

    See README in the package for the practical use.

Being curious about what “practical use” would be in the README, ;-), I discovered a demo data set. And basic instructions for use.

For the details of application for recommendations, see the paper.

Quepid [Topic Map Tuning?]

Tuesday, October 8th, 2013

Measure and Improve Search Quality with Quepid by Doug Turnbull.

From the post:

Let’s face it, returning good search results means making money. To this end, we’re often hired to tune search to ensure that search results are as close as possible to the intent of a user’s search query. Matching users intent to results, what we call “relevancy” is what gets us up in the morning. It’s what drives us to think hard about the dark mysteries of tuning Solr or machine-learning topics such as recommendation-based product search.

While we can do amazing feats of wizardry to make individual improvements, it’s impossible with today’s tools to do much more than prove that one problem has been solved. Search engines rank results based on a single set of rules. This single set of rules is in charge of how all searches are ranked. It’s very likely that even as we solve one problem by modifying those rules, we create another problem — or dozens of them, perhaps far more devastating than the original problem we solved.

Quepid is our instant search quality testing product. Born out of our years of experience tuning search, Quepid has become our go to tool for relevancy problems. Built around the idea of Test Driven Relevancy, Quepid allows the search developer to collaborate with product and content experts to

  1. Identify, store, and execute important queries
  2. Provide statistics/rankings that measure the quality of a search query
  3. Tune search relevancy
  4. Immediately visualize the impact of tuning on queries
  5. Rinse & Repeat Instantly

The result is a tool that empowers search developers to experiment with the impact of changes across the search experience and prove to their bosses that nothing broke. Confident in that data will prove or disprove their ideas instantly, developers are even freer experiment more than they might ever have before.

Any thoughts on automating a similar cycle to test the adding of subjects to a topic map?

Or adding subject identifies that would trigger additional merging?

Or just reporting the merging over and above what was already present?

Search-Aware Product Recommendation in Solr (Users vs. Experts?)

Tuesday, October 8th, 2013

Search-Aware Product Recommendation in Solr by John Berryman.

From the post:

Building upon earlier work with semantic search, OpenSource Connections is excited to unveil exciting new possibilities with Solr-based product recommendation. With this technology, it is now possible to serve user-specific, search-aware product recommendations directly from Solr.

In this post, we will review a simple Search-Aware Recommendation using an online grocery service as an example of e-commerce product recommendation. In this example I have built up a basic keyword search over the product catalog. We’ve also added two fields to Solr: purchasedByTheseUsers and recommendToTheseUsers. Both fields contain lists of userIds. Recall that each document in the index corresponds to a product. Thus the purchasedByTheseUsers field literally lists all of the users who have purchased said product. The next field, recommendToTheseUsers, is the special sauce. This field lists all users who might want to purchase the corresponding product. We have extracted this field using a process called collaborative filtering, which is described in my previous post, Semantic Search With Solr And Python Numpy. With collaborative filtering, we make product recommendation by mathematically identifying similar users (based on products purchased) and then providing recommendations based upon the items that these users have purchased.

Now that the background has been established, let’s look at the results. Here we search for 3 different products using two different, randomly-selected users who we will refer to as Wendy and Dave. For each product: We first perform a raw search to gather a base understanding about how the search performs against user queries. We then search for the intersection of these search results and the products recommended to Wendy. Finally we also search for the intersection of these search results and the products recommended to Dave.

BTW, don’t miss the invitation to be an alpha tester for Solr Search-Aware Product Recommendation at the end of John’s post.

Reading John’s post it occurred to me that an alternative to mining other users’ choices, you could have an expert develop the recommendations.

Much like we use experts to develop library classification systems.

But we don’t, do we?

Isn’t that interesting?

I suspect we don’t use experts for product recommendations because we know that shopping choices depends on a similarity between consumers

We may not know what the precise nature of the similarity may be, but it is sufficient that we can establish its existence in the aggregate and sell more products based upon it.

Shouldn’t the same be true for finding information or data?

If similar (in some possibly unknown way) consumers of information find information in similar ways, why don’t we organize information based on similar patterns of finding?

How an “expert” finds information may be more “precise” or “accurate,” but if a user doesn’t follow that path, the user doesn’t find the information.

A great path that doesn’t help users find information is like having a great road with sidewalks, a bike path, cross-walks, good signage, that goes no where.

How do you incorporate user paths in your topic map application?

Friend Recommendations using MapReduce

Tuesday, July 9th, 2013

Friend Recommendations using MapReduce by John Berryman.

From the post:

So Jonathan, one of our interns this summer, asked an interesting question today about MapReduce. He said, “Let’s say you download the entire data set of who’s following who from Twitter. Can you use MapReduce to make recommendations about who any particular individual should follow?” And as Jonathan’s mentor this summer, and as one of the OpenSource Connections MapReduce experts I dutifully said, “uuuhhhhh…”

And then in a stoke of genius … I found a way to stall for time. “Well, young Padawan,” I said to Jonathan, “first you must more precisely define your problem… and only then will the answer be revealed to you.” And then darn it if he didn’t ask me what I meant! Left with no viable alternatives, I squeezed my brain real hard, and this is what came out:

This is a post to work through carefully while waiting for the second post to drop!

Particularly the custom partitioning, grouping and sorting in MapReduce.

Notable presentations at Technion TCE conference 2013: RevMiner & Boom

Sunday, June 2nd, 2013

Notable presentations at Technion TCE conference 2013: RevMiner & Boom by Danny Bickson.

Danny has uncovered two papers to start your week: (RevMiner) (Twitter data mining)

Danny also describes Boom, for which I found this YouTube video:

See Danny’s post for more comments, etc.

What You Don’t See Makes A Difference

Tuesday, May 14th, 2013

Social and Content Hybrid Image Recommender System for Mobile Social Networks by Faustino Sanchez, Marta Barrilero, Silvia Uribe, Federico Alvarez, Agustin Tena, Jose Manuel. Menendez.

Recommender System for Sport Videos Based on User Audiovisual Consumption by Sanchez, F. ; Alduan, M. ; Alvarez, F. ; Menendez, J.M. ; Baez, O.

A pair of papers I discovered at: New Model to Recommend Media Content According to Your Preferences, which summarizes the work as:

The traditional recommender system usually use: semantic techniques which result in products defined by themes, similar tags to the user interests, algorithms that use collective intelligence of a large set of user, in a way that this traditional system recommends themes that suit other people with similar preferences.

From this knowledge state, an applied model of multimedia content that goes beyond this paradigm has been developed, and it incorporates other features of whose influence, the user is not always aware and because of that reason has not been used so far in these types of systems.

Therefore, researchers at the UPM have analyzed in depth the audiovisual features that can be influential for users and they proved that some of these features that determine aesthetic trends and usually go unnoticed can be decisive when defining the user tastes.

For example, researchers proved that in a movie, the relative information to the narrative rhythm (shot length, scenes and sequences), the movements (camera or frame content) or the image nature (brightness, color, texture, information quantity) is relevant when cataloguing the preferences of each piece of information. Analogously to the movies, the researchers have analyzed images using a subset of descriptors considered in the case of video.

In order to verify this model, researchers used a database of 70,000 users and a million of reviews in a set of 200 movies whose features were previously extracted.

These descriptors, once they are standardized, processed and generated adequate statistical data, allow researchers to formally characterize the contents and to find the influence degree on each user as well as their preference conditions.

This makes me curious about how to exploit similar “unseen / unnoticed” factors that influence subject identification?

Both from a quality control perspective but also for the design of topic map authoring/consumption interfaces.

Our senses, as Scrooge points out: A slight disorder of the stomach makes them cheats.

Now we know they may be cheating and we are unaware of it.

Beer Mapper

Thursday, May 2nd, 2013

Beer Mapper: An experimental app to find the right beer for you by Nathan Yau.

Beer map

Nathan reviews an app that with a data set of 10,000 beers, attempts to suggest similar beers based on your scoring of beers.

A clever app but I am betting on Lars Marius besting it more often than not!

WTF: [footnote 1]

Monday, April 8th, 2013

WTF: The Who to Follow Service at Twitter by Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh.


WTF (“Who to Follow”) is Twitter’s user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in building and running the service over the past few years. Particularly noteworthy was our design decision to process the entire Twitter graph in memory on a single server, which signicantly reduced architectural complexity and allowed us to develop and deploy the service in only a few months. At the core of our architecture is Cassovary, an open-source in-memory graph processing engine we built from scratch for WTF. Besides powering Twitter’s user recommendations, Cassovary is also used for search, discovery, promoted products, and other services as well. We describe and evaluate a few graph recommendation algorithms implemented in Cassovary, including a novel approach based on a combination of random walks and SALSA. Looking into the future, we revisit the design of our architecture and comment on its limitations, which are presently being addressed in a second-generation system under development.

You know it is going to be an amusing paper when footnote 1 reads:

The confusion with the more conventional expansion of the acronym is intentional and the butt of many internal jokes. Also, it has not escaped our attention that the name of the service is actually ungrammatical; the pronoun should properly be in the objective case, as in \whom to follow”.


Algorithmic recommendations may miss the mark for an end user.

On the other hand, what about an authoring interface that supplies recommendations of associations and other subjects?

A paper definitely worth a slow read!

I first saw this at: WTF: The Who to Follow Service at Twitter (Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh).


Monday, March 11th, 2013


Reco4j is an open source project that aims at developing a recommendation framework based on graph data sources. We choose graph databases for several reasons. They are NoSQL databases that are “schemaless”. This means that it is possible to extend the basic data structure with intermediate information, i.e. similarity value between item and so on. Moreover, since every information are expressed with some properties, nodes and relations, the recommendation process can be customized to work on every graph.
Indeed Reco4j can be used on every graph where “user” and “item” are represented by nodes and the preferences are modelled as relationship between them.

The current implementation leverages on Neo4j as first example of graph database integrated in our framework.

The main features of Reco4j are:

  1. Performance, leveraging on the graph database and storing information in it for future retrieving it produce fast recommendations also after a system restart;
  2. Use of Network structure, integrating the simple recommendation algorithms with (social) network analisys;
  3. General purpose, it can be used with preexisting databases;
  4. Customizability, editing the properties file the recommender framework can be adapted to the current graph structure and use several types of the recommendation algorithms;
  5. Ready for Cloud, leveraging on the graph database cloud features the recommendation process can be splitted on several nodes.

Just in case you don’t like the recommendations you get from Amazon. 😉

BTW, “splitted” is an archaic past tense form of split. (According to Merriam-Webster.)

Say rather “…the recommendation process can be split onto several nodes.”

Graph Based Recommendations using “How-To” Guides Dataset

Sunday, March 3rd, 2013

Graph Based Recommendations using “How-To” Guides Dataset by Marcel Caraciolo.

From the post:

In this post I’d like to introduce another approach for recommender engines using graph concepts to recommend novel and interesting items. I will build a graph-based how-to tutorials recommender engine using the data available on the website SnapGuide (By the way I am a huge fan and user of this tutorials website), the graph database Neo4J and the graph traversal language Gremlin.

What is SnapGuide ?

Snapguide is a web service for anyone who wants to create and share step-by-step “how to guides”. It is available on the web and IOS app. There you can find several tutorials with easy visual instructions for a wide array of topics including cooking, gardening, crafts, projects, fashion tips and more. It is free and anyone is invitide to submit guides in order to share their passions and expertise with the community. I have extracted from their website for only research purposes the corpus of tutorials likes. Several users may like the tutorial and this signal can be quite useful to recommend similar tutorials based on what other users liked. Unfortunately I can’t provide the dataset for download but the code you can follow below for your own data set.

An excellent tutorial that walks you through the creation of graph based recommendations, from acquiring the data to posting queries to it.

The SnapGuide site looks like another opportunity for topic map related tutorial material.

SVDFeature: A Toolkit for Feature-based Collaborative Filtering

Thursday, January 17th, 2013

SVDFeature: A Toolkit for Feature-based Collaborative Filtering – implementation by Igor Carron.

From the post:

SVDFeature: A Toolkit for Feature-based Collaborative Filtering by Tianqi ChenWeinan Zhang,  Qiuxia LuKailong Chen Zhao Zheng, Yong Yu. The abstract reads:

In this paper we introduce SVDFeature, a machine learning toolkit for feature-based collaborative filtering. SVDFeature is designed to efficiently solve the feature-based matrix factorization. The feature-based setting allows us to build factorization models incorporating side information such as temporal dynamics, neighborhood relationship, and hierarchical information. The toolkit is capable of both rate prediction and collaborative ranking, and is carefully designed for efficient training on large-scale data set. Using this toolkit, we built solutions to win KDD Cup for two consecutive years.

The wiki for the project and attendant code is here.

Can’t argue with two KDD cups in as many years!

Licensed under Apache 2.0.


Sunday, January 6th, 2013


From the webpage:

Reco4j is an open source project that aims at developing a recommendation framework based on graph data sources. We choose graph databases for several reasons. They are NoSQL databases that are “schemaless”. This means that it is possible to extend the basic data structure with intermediate information, i.e. similarity value between item and so on. Moreover, since every information are expressed with some properties, nodes and relations, the recommendation process can be customized to work on every graph.

Indeed Reco4j can be used on every graph where “user” and “item” are represented by nodes and the preferences are modelled as relationship between them.

The current implementation leverages on Neo4j as first example of graph database integrated in our framework.

The main features of Reco4j are:

  1. Performance, leveraging on the graph database and storing information in it for future retrieving it produce fast recommendations also after a system restart;
  2. Use of Network structure, integrating the simple recommendation algorithms with (social) network analisys;
  3. General purpose, it can be used with preexisting databases;
  4. Customizability, editing the properties file the recommender framework can be adapted to the current graph structure and use several types of the recommendation algorithms;
  5. Ready for Cloud, leveraging on the graph database cloud features the recommendation process can be splitted on several nodes.

The current version has two different projects:

  • reco4j-core: this project contains the base structure, the interface and the recommendation engine;
  • reco4j-neo4j: this project contains the neo4j implementation of the framework.

The “similarity value” comment caught my eye.

How much similarity between two or more items do you need, to have the same item, for some particular purpose?

I first saw this in a tweet by Peter Neubauer.

Atepassar Recommendations [social network recommender]

Sunday, November 4th, 2012

Atepassar Recommendations: Recommending friends with MapReduce and Python by Marcel Caraciolo.

From the post:

In this post I will present one of the tecnhiques used at Atépassar, a brazilian social network that help students around Brazil in order to pass the exams for a civil job, our recommender system.

(graphic omitted)

I will describe some of the data models that we use and discuss our approach to algorithmic innovation that combines offline machine learning with online testing. For this task we use distributed computing since we deal with over with 140 thousand users. MapReduce is a powerful technique and we use it by writting in python code with the framework MrJob. I recommend you to read further about it at my last post here.

One of our recommender techniques is the simple ‘people you might know‘ recommender algorithm. Indeed, there are several components behind the algorithm since at Atépassar, users can follow other people as also be followed by other people. In this post I will talk about the basic idea of the algorithm which can be derivated for those other components. The idea of the algorithm is that if person A and person B do know each other but they have a lot of mutual friends, then the system should recommend that they connect with each other.

Is there a presumption in social recommendation programs that there are no duplicate people in the network? Using different names? If two people have exactly the same friends, is there some chance they could be the same person?

How many “same” friends would you require? 20? 30? 50? Some other number?

Curious because determining personal identity and identity of the people behind two or more entries, may be a matter of pattern matching.

BTW, this is a interesting looking blog. You may want to browse older entries or even subscribe.

Twitter Recommendations by @alpa

Wednesday, October 10th, 2012

Twitter Recommendations by @alpa by Marti Hearst.

From the post:

Alpa Jain has great experience teaching from her time as a graduate student at Columbia University, and it shows in the clarity of her descriptions of SVD and other recommendation algorithms in today’s lecture:

Would you incorporate recommendation algorithms into a topic map authoring solution?

How to Build a Recommendation Engine

Monday, September 24th, 2012

How to Build a Recommendation Engine by John F. McGowan.

From the post:

This article shows how to build a simple recommendation engine using GNU Octave, a high-level interpreted language, primarily intended for numerical computations, that is mostly compatible with MATLAB. A recommendation engine is a program that recommends items such as books and movies for customers, typically of a web site such as Amazon or Netflix, to purchase. Recommendation engines frequently use statistical and mathematical methods to estimate what items a customer would like to buy or would benefit from purchasing.

From a purely business point of view, one would like to maximize the profit from a customer, discounted for time (a dollar today is worth more than a dollar next year), over the duration that the customer is a customer of the business. In a long term relationship with a customer, this probably means that the customer needs to be happy with most purchases and most recommendations.

Recommendation engines are “hot” right now. There are many attempts to apply advanced statistics and mathematics to predict what customers will buy, what purchases will make customers happy and buy again, and what purchases deliver the most value to customers. Data scientists are trying to apply a range of methods with fancy technical names such as principal component analysis (PCA), neural networks, and support vector machines (SVM) — amongst others — to predicting successful purchases and personalizing recommendations for individual customers based on their stated preferences, purchasing history, demographics and other factors.

This article presents a simple recommendation engine using Pearson’s product moment correlation coefficient, also known as the linear correlation coefficient. The engine uses the correlation coefficient to identify customers with similar purchasing patterns, and presumably tastes, and recommends items purchased by one customer to the other similar customer who has not purchased those items.

Probably not the recommendation engine you will use for commercial deployment.

But, it will give you a good start on understanding the principles of recommendation engines.

My interest in recommendations isn’t so much to identify the subjects of recommendation, which are topics in their own rights, as in probing the basis for subject identification by multiple users.

That is there is some identification that underlies a choice of some book or movie over another. It may not be possible to identify the components of that identification, but we do have aftermath of that identification.

Rather than collapsing dimensions, thinking we should expand the dimensions around choices to see if any patterns emerge.

I first saw this at DZone.

Nonparametric Techniques – Webinar [Think Movie Ratings]

Saturday, September 15th, 2012

Overview of Nonparametric Techniques with Elaine Eisenbeisz.

Date: October 3, 2012

Time: 3pm Eastern Time UTC -4 (2pm Central, 1pm Mountain, 12pm Pacific)

From the description:

A distribution of data which is not normal does not mean it is abnormal. There are many data analysis techniques which do not require the assumption of normality.

This webinar will provide information on when it is best to use nonparametric alternatives and provides information on suggested tests to use in lieu of:

  • Independent samples and paired t-tests
  • Analysis of variance techniques
  • Pearson’s Product Moment Correlation
  • Repeated measures designs

A description of nonparametric techniques for use with count data and contingency tables will also be provided.

Movie ratings, a ranked population, are appropriate for nonparametric methods.

You just thought you didn’t know anything about nonparametric methods. 😉

Applicable to all ranked populations (can you say recommendation?).

While you wait for the webinar, try some of the references from Wikipedia: Nonparametric Statistics.