Lucene’s FuzzyQuery is 100 times faster in 4.0
I first saw this post mentioned in a tweet by Lars Marius Garshol.
From the post:
There are many exciting improvements in Lucene’s eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.
FuzzyQuery matches terms “close” to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.
The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).
FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other “close” terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.
Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.
The story is a good one and demonstrates the need for topic maps in computer science.
The authors used “Googling” to find an implementation by Jean-Phillipe Barrette-LaPierre of an algorithm in a paper by Klaus Schulz and Stoyan Mihov that enabled this increase in performance.
That’s one way to do it, but leaves it to hit or miss as to whether other researchers will find the same implementation.
Moreover, once that connection has been made, the implementation associated with the algorithm/paper, it should be preserved for subsequent searchers.
As well as pointing to the implementation of this algorithm in Lucene, or other implementations, or even other accounts by the same authors, such as the 2004 publication in Computational Linguistics of Fast Approximate Search in Large Dictionaries.
Sounds like a topic map to me. The question is how to make ad hoc authoring of a topic map practical?
Suggestions?