Document Frequency Limited MultiTermQuerys
From the post:
If you’ve ever looked at user generated data such as tweets, forum comments or even SMS text messages, you’ll have noticed there there are many variations in the spelling of words. In some cases they are intentional such as omissions of vowels to reduce message length, in other cases they are unintentional typos and spelling mistakes.
Querying this kind of data since only matching the traditional spelling of a word can lead to many valid results being missed. One way to includes matches on variations of a word is to use Lucene’s
MultiTermQuerys
such asFuzzyQuery
orWildcardQuery
. For example, to find matches for the word “hotel” and all its variations, you might use the queries “hotel~” and “h*t*l”. Unfortunately, depending on how many variations there are, the queries could end up matching 10s or even 100s of terms, which will impact your performance.You might be willing to accept this performance degradation to capture all the variations, or you might want to only query those terms which are common in your index, dropping the infrequent variations and giving your users maximum results with little impact on performance.
Lets explore how you can focus your
MultiTermQuerys
on the most common terms in your index.
Not to give too much away, but you will learn how to tune a fuzzy match of terms. (To account for misspellings, for example.)
This is a very good site and blog for search issues.