Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 19, 2012

Document Frequency Limited MultiTermQuerys

Filed under: Lucene,Query Expansion,Searching — Patrick Durusau @ 6:55 pm

Document Frequency Limited MultiTermQuerys

From the post:

If you’ve ever looked at user generated data such as tweets, forum comments or even SMS text messages, you’ll have noticed there there are many variations in the spelling of words.  In some cases they are intentional such as omissions of vowels to reduce message length, in other cases they are unintentional typos and spelling mistakes.

Querying this kind of data since only matching the traditional spelling of a word can lead to many valid results being missed.  One way to includes matches on variations of a word is to use Lucene’s MultiTermQuerys such as FuzzyQuery or WildcardQuery.  For example, to find matches for the word “hotel” and all its variations, you might use the queries “hotel~” and “h*t*l”.  Unfortunately, depending on how many variations there are, the queries could end up matching 10s or even 100s of terms, which will impact your performance.

You might be willing to accept this performance degradation to capture all the variations, or you might want to only query those terms which are common in your index, dropping the infrequent variations and giving your users maximum results with little impact on performance.

Lets explore how you can focus your MultiTermQuerys on the most common terms in your index.

Not to give too much away, but you will learn how to tune a fuzzy match of terms. (To account for misspellings, for example.)

This is a very good site and blog for search issues.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress