From the post:
Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).
Out of the box, PySparNN supports Cosine Distance (i.e. 1 – cosine_similarity).
- Designed to be efficent on sparse data (memory & cpu).
- Implemented leveraging existing python libraries (scipy & numpy).
- Easily extended with other metrics: Manhattan, Euclidian, Jaccard, etc.
- Work in progress – Min, Max distance thresholds can be set at query time (not index time). Example: return the k closest items on the interval [0.8, 0.9] from a query point.
If your data is NOT SPARSE – please consider annoy. Annoy uses a similar-ish method and I am a big fan of it. As of this writing, annoy performs ~8x faster on their introductory example.
General rule of thumb – annoy performs better if you can get your data to fit into memory (as a dense vector).
The most comparable library to PySparNN is scikit-learn’s LSHForrest module. As of this writing, PySparNN is ~1.5x faster on the 20newsgroups dataset. A more robust benchmarking on sparse data is desired. Here is the comparison.
I included the text snippet in the title because PySparNN isn’t clueful, at least not at first glance.
I looked for a good explanation on nearest neighbors and encountered this lecture by Pat Wilson’s (MIT OpenCourseWare):
The lecture has a number of gems, including the observation that:
Town and Country readers tend to be social parasites.
Observations on text and nearest neightbors, time marks 17:30 – 24:17.
You should make an effort to watch the entire video. You will have a broader appreciate for the sheer power of nearest neighbor analysis and as a bonus, some valuable insights on why going without sleep is a very bad idea.
I first saw this in a tweet by Lynn Cherny.