Modifying a Lucene Snowball Stemmer
From the post:
This post is written for advanced users. If you do not know what SVN (Subversion) is or if you’re not ready to get your hands dirty, there might be something more interesting to read on Wikipedia. As usual. This is an introduction to how to get a Lucene development environment running, a Solr environment and lastly, to create your own Snowball stemmer. Read on if that seems interesting. The receipe for regenerating the Snowball stemmer (I’ll get back to that…) assumes that you’re running Linux. Please leave a comment if you’ve generated the stemmer class under another operating system.
When indexing data in Lucene (a fulltext document search library) and Solr (which uses Lucene), you may provide a stemmer (a piece of code responsible for “normalizing” words to their common form (horses => horse, indexing => index, etc)) to give your users better and more relevant results when they search. The default stemmer in Lucene and Solr uses a library named Snowball which was created to do just this kind of thing. Snowball uses a small definition language of its own to generate parsers that other applications can embed to provide proper stemming.
Really advanced users will want to check out Snowball’s namesake, SNOBOL, a powerful pattern matching language that was invented in 1962. (That’s not a typo, 1962.) See SNOBOL4.org for more information and resources.
The post outlines how to change the default stemmer for Lucene and Solr to improve its stemming of Norwegian words. Useful in case you want to write/improve a stemmer for another language.