Archive for the ‘Lexical Analyzer’ Category

Levenshtein distance in C++ and code profiling in R

Monday, March 26th, 2012

Levenshtein distance in C++ and code profiling in R by Dzidorius Martinaitis.

From the post:

At work, the client requested, if existing search engine could accept singular and plural forms equally, e. g. “partner” and “partners” would lead to the same result.

The first option – stemming. In that case, search engine would use root of a word, e. g. “partn”. However, stemming has many weaknesses: two different words might have same root, a user can misspell the root of the word, except English and few others languages it is not that trivial to implement stemming.

Levenshtein distance comes as the second option. The algorithm is simple – you have two words and you calculate the difference between them. You can insert, delete or replace any character, but it will cost you. Let’s imagine, an user enters “Levenstin distances” into search engine and expects to find revalent information. However, he just made 2 errors by misspeling the author’s name and he used plural form of “distance”. If search engine accepts 3 errors – the user will get relevant information.

The challenge comes, when you have a dictionary of terms (e. g. more that 1 mil.) and you want to get similar terms based on Levenshtein distance. You can visit every entry in the dictionary (very costly) or you can push dictionary into the trie. Do you need a proof for the cost? There we go:

Appreciate the comparison of approaches based on data but wondering why “professional” stemming, like you find in Solr was not investigated? Will post a comment asking and report back.

You are likely to encounter this sort of issue in almost all topic map authoring activities.

Natural

Saturday, December 24th, 2011

Natural

From the webpage:

“Natural” is a general natural language facility for nodejs. Tokenizing, stemming, classification, phonetics, tf-idf, WordNet, and some inflection are currently supported.

It’s still in the early stages, and am very interested in bug reports, contributions and the like.

Note that many algorithms from Rob Ellis’s node-nltools are being merged in to this project and will be maintained here going forward.

At the moment most algorithms are English-specific but long-term some diversity is in order.

Aside from this README the only current documentation is here on my blog.

Just in case you are looking for natural language processing capabilities with nodejs.

Lexical Analysis with Flex

Tuesday, December 6th, 2011

Lexical Analysis with Flex

From the introduction:

flex is a tool for generating scanners. A scanner is a program which recognizes lexical patterns in text. The flex program reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. flex generates as output a C source file, lex.yy.c by default, which defines a routine yylex(). This file can be compiled and linked with the flex runtime library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.

For when you have serious scanning tasks.

Modifying a Lucene Snowball Stemmer

Tuesday, August 9th, 2011

Modifying a Lucene Snowball Stemmer

From the post:

This post is written for advanced users. If you do not know what SVN (Subversion) is or if you’re not ready to get your hands dirty, there might be something more interesting to read on Wikipedia. As usual. This is an introduction to how to get a Lucene development environment running, a Solr environment and lastly, to create your own Snowball stemmer. Read on if that seems interesting. The receipe for regenerating the Snowball stemmer (I’ll get back to that…) assumes that you’re running Linux. Please leave a comment if you’ve generated the stemmer class under another operating system.

When indexing data in Lucene (a fulltext document search library) and Solr (which uses Lucene), you may provide a stemmer (a piece of code responsible for “normalizing” words to their common form (horses => horse, indexing => index, etc)) to give your users better and more relevant results when they search. The default stemmer in Lucene and Solr uses a library named Snowball which was created to do just this kind of thing. Snowball uses a small definition language of its own to generate parsers that other applications can embed to provide proper stemming.

Really advanced users will want to check out Snowball’s namesake, SNOBOL, a powerful pattern matching language that was invented in 1962. (That’s not a typo, 1962.) See SNOBOL4.org for more information and resources.

The post outlines how to change the default stemmer for Lucene and Solr to improve its stemming of Norwegian words. Useful in case you want to write/improve a stemmer for another language.