Archive for the ‘Stemming’ Category

Levenshtein distance in C++ and code profiling in R

Monday, March 26th, 2012

Levenshtein distance in C++ and code profiling in R by Dzidorius Martinaitis.

From the post:

At work, the client requested, if existing search engine could accept singular and plural forms equally, e. g. “partner” and “partners” would lead to the same result.

The first option – stemming. In that case, search engine would use root of a word, e. g. “partn”. However, stemming has many weaknesses: two different words might have same root, a user can misspell the root of the word, except English and few others languages it is not that trivial to implement stemming.

Levenshtein distance comes as the second option. The algorithm is simple – you have two words and you calculate the difference between them. You can insert, delete or replace any character, but it will cost you. Let’s imagine, an user enters “Levenstin distances” into search engine and expects to find revalent information. However, he just made 2 errors by misspeling the author’s name and he used plural form of “distance”. If search engine accepts 3 errors – the user will get relevant information.

The challenge comes, when you have a dictionary of terms (e. g. more that 1 mil.) and you want to get similar terms based on Levenshtein distance. You can visit every entry in the dictionary (very costly) or you can push dictionary into the trie. Do you need a proof for the cost? There we go:

Appreciate the comparison of approaches based on data but wondering why “professional” stemming, like you find in Solr was not investigated? Will post a comment asking and report back.

You are likely to encounter this sort of issue in almost all topic map authoring activities.