Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter by Ted Sullivan.
From the post:
In a previous blog post, I introduced the AutoPhrasingTokenFilter. This filter is designed to recognize noun-phrases that represent a single entity or ‘thing’. In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.
The problem with multi-term synonyms in Lucene/Solr is well documented (see Jack Krupansky’s proposal, John Berryman’s excellent summary and Nolan Lawson’s query parser solution). Basically, what it boils down to is a problem with parallel term positions in the synonym-expanded token list – based on the way that the Lucene indexer ingests the analyzed token stream. The indexer pays attention to a token’s start position but does not attend to its position length increment. This causes multi-term tokens to overlap subsequent terms in the token stream rather than maintaining a strictly parallel relation (in terms of both start and end positions) with their synonymous terms. Therefore, rather than getting a clean ‘state-graph’, we get a pattern called “sausagination” that does not accurately reflect the 1-1 mapping of terms to synonymous terms within the flow of the text (see blog post by Mike McCandless on this issue). This problem disappears if all of the synonym pairs are single tokens.
The multi-term synonym problem was described in a Lucene JIRA ticket (LUCENE-1622) which is still marked as “Unresolved”:
Posts like this one are a temptation to sign off Twitter and read the ticket feeds for Lucene/Solr instead. Seriously.
Ted proposes a workaround to the multi-term synonym problem using the auto phrasing tokenfilter. Equally important is his conclusion:
The AutoPhrasingTokenFilter can be an important tool in solving one of the more difficult problems with Lucene/Solr search – how to deal with multi-term synonyms. Simultaneously, we can improve another serious problem that all search engines have – their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”), the search engine is better able to return results based on ‘what’ the user is looking for rather than documents containing words that match the query. We are moving from searching with a “bag of words” to searching a “bag of things”.
Or more precisely:
…their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”)…
Ambiguity at the token level remains, even if for particular cases phrases can be treated as semantic entities.
Rather than Ted’s “bag of things,” may I suggest indexing “bags of properties?” Where the lowliest token or a higher semantic unit can be indexed as a bag of properties.
Imagine indexing these properties* for a single token:
- string: value
- pubYear: value
- author: value
- journal: value
- keywords: value
Would that suffice to distinguish a term in a medical journal from Vanity Fair?
Ambiguity is predicated upon a lack of information.
That should be suggestive of a potential cure.
*(I’m not suggesting that all of those properties or even most of them would literally appear in a bag. Most, if not all, could be defaulted from an indexed source.)
I first saw this in a tweet by SolrLucene.