Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 15, 2013

Nearest Neighbor Search in Google Correlate

Filed under: Algorithms,Google Correlate,Nearest Neighbor — Patrick Durusau @ 6:41 pm

Nearest Neighbor Search in Google Correlate by Dan Vanderkam, Rob Schonberger, Henry Rowley, Sanjiv Kumar.

Abstract:

This paper presents the algorithms which power Google Correlate, a tool which finds web search terms whose popularity over time best matches a user-provided time series. Correlate was developed to generalize the query-based modeling techniques pioneered by Google Flu Trends and make them available to end users.

Correlate searches across millions of candidate query time series to find the best matches, returning results in less than 200 milliseconds. Its feature set and requirements present unique challenges for Approximate Nearest Neighbor (ANN) search techniques. In this paper, we present Asymmetric Hashing (AH), the technique used by Correlate, and show how it can be adapted to the specific needs of the product.

We then develop experiments to test the throughput and recall of Asymmetric Hashing as compared to a brute-force search. For “full” search vectors, we achieve a 10x speedup over brute force search while maintaining 97% recall. For search vectors which contain holdout periods, we achieve a 4x speedup over brute force search, also with 97% recall. (I followed the paragraphing in the PDF for the abstract.)

Ten-x speedups on Google Flu Trends size data sets are non-trivial.

If you are playing in that size sand box, this is a must read.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress