Why is Multi-term synonym mapping so hard in Solr? by John Berryman.
From the post:
There is a very common need for multi-term synonyms. We’ve actually run across several use cases among our recent clients. Consider the following examples:
- Ecommerce: If a customer searches for “weed whacker”, but the more canonical name is “string trimmer”, then you need synonyms, otherwise you’re going to lose a sale.
- Law: Consider a layperson attempting to find a section of legal code pertaining to their “truck”. If the law only talks about “motor vehicles”, then, without synonyms, this individual will go away uninformed.
- Medicine: When a doctor is looking up recent publications on “heart attack”, synonyms make sure that he also finds documents that happen to only mention “myocardial infarction”.
One would hope that working with synonyms should be as simple as tossing a set of synonyms into the synonyms.txt file and just having Solr “do the right thing.”™ And when we’re talking about simple, single-term synonyms (e.g. TV = televisions), synonyms really are just that straight forward. Unfortunately, especially as you get into more complex uses of synonyms, such as multi-term synonyms, there are several gotchas. Sometimes, there are workarounds. And sometimes, for now at least, you’ll just have to make do what you can currently achieve using Solr! In this post we’ll provide a quick intro to synonyms in Solr, we’ll walk through some of the pain points, and then we’ll propose possible resolutions.
John does a great review of basic synonym mapping in Solr as a prelude to illustrating the difficulty with multi-term synonyms.
His example case is the mapping:
spider man ==> spiderman
“Obvious” solutions fail but John does conclude with a pointer to one solution to the issue.
Recommended for a deeper understanding of Solr’s handling of synonymy.
While reading John’s post it occurred to me to check with Wikipedia on disambiguation of the term “spider.”
- Comics – 17
- Other publications – 5
- Culinary – 3
- Film and television – 10
- Games and sports – 10
- Land vehicles – 4
- Mathematics – 1
- Music – 16
- People – 7
- Technology – 14
- Other uses – 7
I count eighty-eight (88) distinct “spiders” (counting spider as “an air-breathing eight-legged animal“, of which there are 44032 species identified as of June 23, 2013).
John suggests a parsing solution for the multi-term synonym problem in Solr, but however “spider” is parsed, there remains ambiguity.
An 88-fold ambiguity (at minimum).
At least for Solr and other search engines.
Not so much for us as human readers.
A human reader is not limited to “spider” in deciding which of 88 possible spiders is the correct one and/or the appropriate synonyms to use.
Each “spider” is seen in a “context” and a human reader will attribute (perhaps not consciously) characteristics to a particular “spider” in order to identify it.
If we record characteristics for each “spider,” then distinguishing and matching spiders to synonyms (also with characteristics) becomes a task of:
- Deciding which characteristic(s) to require for identification/synonymy.
- Fashioning rules for identification/synonymy.
Much can be said about those two tasks but for now, I will leave you with a practical example of their application.
Assume that you are indexing some portion of web space and you encounter The World Spider Catalog, Version 14.0.
We know for every instance of “spider” (136) at that site has the characteristics of order Araneae. How you wish to associate that with every instance of “spider” or other names from the spider database is an implementation issue.
However, knowing “order Araneae” allows us to reliably distinguish all the instances of “spider” at this resource from other instances of “spider” that lack that characteristic.
Just as importantly, we only have to perform that task once. Not rely upon our users to perform that task over and over again.
The weakness of current indexing is that it harvests only the surface text and not the rich semantic soil in which it grows.