Archive for the ‘Nearest Neighbor’ Category

PySparNN [nearest neighbors in sparse, high dimensional spaces (like text documents).]

Thursday, April 7th, 2016

PySparNN

From the post:

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Out of the box, PySparNN supports Cosine Distance (i.e. 1 – cosine_similarity).

PySparNN benefits:

  • Designed to be efficent on sparse data (memory & cpu).
  • Implemented leveraging existing python libraries (scipy & numpy).
  • Easily extended with other metrics: Manhattan, Euclidian, Jaccard, etc.
  • Work in progress – Min, Max distance thresholds can be set at query time (not index time). Example: return the k closest items on the interval [0.8, 0.9] from a query point.

If your data is NOT SPARSE – please consider annoy. Annoy uses a similar-ish method and I am a big fan of it. As of this writing, annoy performs ~8x faster on their introductory example.
General rule of thumb – annoy performs better if you can get your data to fit into memory (as a dense vector).

The most comparable library to PySparNN is scikit-learn’s LSHForrest module. As of this writing, PySparNN is ~1.5x faster on the 20newsgroups dataset. A more robust benchmarking on sparse data is desired. Here is the comparison.

I included the text snippet in the title because PySparNN isn’t clueful, at least not at first glance.

I looked for a good explanation on nearest neighbors and encountered this lecture by Pat Wilson’s (MIT OpenCourseWare):

The lecture has a number of gems, including the observation that:

Town and Country readers tend to be social parasites.

Observations on text and nearest neightbors, time marks 17:30 – 24:17.

You should make an effort to watch the entire video. You will have a broader appreciate for the sheer power of nearest neighbor analysis and as a bonus, some valuable insights on why going without sleep is a very bad idea.

I first saw this in a tweet by Lynn Cherny.

Nearest Neighbor Search in Google Correlate

Tuesday, October 15th, 2013

Nearest Neighbor Search in Google Correlate by Dan Vanderkam, Rob Schonberger, Henry Rowley, Sanjiv Kumar.

Abstract:

This paper presents the algorithms which power Google Correlate, a tool which finds web search terms whose popularity over time best matches a user-provided time series. Correlate was developed to generalize the query-based modeling techniques pioneered by Google Flu Trends and make them available to end users.

Correlate searches across millions of candidate query time series to find the best matches, returning results in less than 200 milliseconds. Its feature set and requirements present unique challenges for Approximate Nearest Neighbor (ANN) search techniques. In this paper, we present Asymmetric Hashing (AH), the technique used by Correlate, and show how it can be adapted to the specific needs of the product.

We then develop experiments to test the throughput and recall of Asymmetric Hashing as compared to a brute-force search. For “full” search vectors, we achieve a 10x speedup over brute force search while maintaining 97% recall. For search vectors which contain holdout periods, we achieve a 4x speedup over brute force search, also with 97% recall. (I followed the paragraphing in the PDF for the abstract.)

Ten-x speedups on Google Flu Trends size data sets are non-trivial.

If you are playing in that size sand box, this is a must read.

Nearest Neighbor Search: the Old, the New, and the Impossible

Monday, October 10th, 2011

Nearest Neighbor Search: the Old, the New, and the Impossible, the MIT PhD thesis of Alexandr Andoni.

To be honest, it is the discovery of gems like this one that keep me prowling journals, pre-publication sites, homepages, etc.

Alexandr walks the reader through a very complete review of the literature on nearest neighbor search, all the while laying a foundation for the significant progress he has made.

Not for the faint of heart but it promises to be well worth the effort.