Archive for the ‘Feature Vectors’ Category

SVDFeature: A Toolkit for Feature-based Collaborative Filtering

Thursday, January 17th, 2013

SVDFeature: A Toolkit for Feature-based Collaborative Filtering – implementation by Igor Carron.

From the post:

SVDFeature: A Toolkit for Feature-based Collaborative Filtering by Tianqi ChenWeinan Zhang,  Qiuxia LuKailong Chen Zhao Zheng, Yong Yu. The abstract reads:

In this paper we introduce SVDFeature, a machine learning toolkit for feature-based collaborative filtering. SVDFeature is designed to efficiently solve the feature-based matrix factorization. The feature-based setting allows us to build factorization models incorporating side information such as temporal dynamics, neighborhood relationship, and hierarchical information. The toolkit is capable of both rate prediction and collaborative ranking, and is carefully designed for efficient training on large-scale data set. Using this toolkit, we built solutions to win KDD Cup for two consecutive years.

The wiki for the project and attendant code is here.

Can’t argue with two KDD cups in as many years!

Licensed under Apache 2.0.

Fusion and inference from multiple data sources in a commensurate space

Friday, June 29th, 2012

Fusion and inference from multiple data sources in a commensurate space by Zhiliang Ma, David J. Marchette and Carey E. Priebe. (Ma, Z., Marchette, D. J. and Priebe, C. E. (2012), Fusion and inference from multiple data sources in a commensurate space. Statistical Analy Data Mining, 5: 187–193. doi: 10.1002/sam.11142)


Given objects measured under multiple conditions—for example, indoor lighting versus outdoor lighting for face recognition, multiple language translation for document matching, etc.—the challenging task is to perform data fusion and utilize all the available information for inferential purposes. We consider two exploitation tasks: (i) how to determine whether a set of feature vectors represent a single object measured under different conditions; and (ii) how to create a classifier based on training data from one condition in order to classify objects measured under other conditions. The key to both problems is to transform data from multiple conditions into one commensurate space, where the (transformed) feature vectors are comparable and would be treated as if they were collected under the same condition. Toward this end, we studied Procrustes analysis and developed a new approach, which uses the interpoint dissimilarities for each condition. We impute the dissimilarities between measurements of different conditions to create one omnibus dissimilarity matrix, which is then embedded into Euclidean space. We illustrate our methodology on English and French documents collected from Wikipedia, demonstrating superior performance compared to that obtained via standard Procrustes transformation.

An early example of identity issues in topic maps from Steve Newcomb made this paper resonate for me. Steve used the example that his home has a set of geographic coordinates, a street address and a set of directions to arrive at his home, all of which identify the same subjects. All the things that can be said using one identifier can be gathered up with statements using the other identifiers.

While I still have reservations about the use of Euclidean space when dealing with non-Euclidean semantics, one has to admit that it is possible to derive some value from it.

I had to file an ILL for a print copy of the article. More to follow when it arrives.