Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 17, 2013

Spelling isn’t a subject…

Filed under: Lucene,Solr — Patrick Durusau @ 8:39 pm

Have you seen Alec Baldwin’s teacher commercial?

A student suggests spelling as a subject and Alec responds: “Spelling isn’t a subject, spell-check, that’s a program, right?”

In Spellchecking in Trovit by Xavier Sanchez Loro, you will find that spell-check is more than a “program.”

Especially in a multi-language environment where the goal isn’t just correct spelling but delivery of relevant information to users.

From the post:

This post aims to explain the implementation and use case for spellchecking in the Trovit search engine that we will be presenting at the Lucene/Solr Revolution EU 2013 [1]. Trovit [2] is a classified ads search engine supporting several different sites, one for each country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using SOLR [3] and Lucene [4] in order to help our users to better find the desired ads and to avoid the dreaded 0 results as much as possible (obviously, whilst still reporting back relevant information to the user). As such, our goal is not pure orthographic correction, but also to suggest correct searches for a certain site.

Our approach: Contextual Spellchecking

One key element in the spellchecking process is choosing the right dictionary, one with a relevant vocabulary for the type of information included in each site. Our approach is specializing the dictionaries based on user’s search context. Our search contexts are composed of country (with a default language) and vertical (determining the type of ads and vocabulary). Each site’s document corpus has a limited vocabulary, reduced to the type of information, language and terms included in each site’s ads. Using a more generalized approach is not suitable for our needs, since a unique vocabulary for each language (regardless of the vertical) is not as precise as specialized vocabularies for each language and vertical. We have observed drastic differences in the type of terms included in the indexes and the semantics of each vertical. Terms that are relevant in one context are meaningless in another one (e.g. “chalet” is not a relevant word in cars vertical, but is a highly relevant word for homes vertical). As such, Trovit’s spellchecking implementation exhibits very different vocabularies for each site, even when supporting the same language.

I like the emphasis on “contextual” spellchecking.

Sounds a lot like “contextual” subject recognition.

Yes?

Walking through this post in detail is an excellent exercise!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress