Embedding Concepts in text for smarter searching with Solr4

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 11, 2013

Embedding Concepts in text for smarter searching with Solr4

Filed under: Concept Detection,Indexing,Searching,Solr — Patrick Durusau @ 7:08 pm

Embedding Concepts in text for smarter searching with Solr4 by Sujit Pal.

From the post:

Storing the concept map for a document in a payload field works well for queries that can treat the document as a bag of concepts. However, if you want to consider the concept’s position(s) in the document, then you are out of luck. For queries that resolve to multiple concepts, it makes sense to rank documents with these concepts close together higher than those which had these concepts far apart, or even drop them from the results altogether.

We handle this requirement by analyzing each document against our medical taxonomy, and annotating recognized words and phrases with the appropriate concept ID before it is sent to the index. At index time, a custom token filter similar to the SynonymTokenFilter (described in the LIA2 Book) places the concept ID at the start position of the recognized word or phrase. Resolved multi-word phrases are retained as single tokens – for example, the phrase “breast cancer” becomes “breast0cancer”. This allows us to rewrite queries such as “breast cancer radiotherapy”~5 as “2790981 2791965″~5.

One obvious advantage is that synonymy is implicitly supported with the rewrite. Medical literature is rich with synonyms and acronyms – for example, “breast cancer” can be variously called “breast neoplasm”, “breast CA”, etc. Once we rewrite the query, 2790981 will match against a document annotation that is identical for each of these various synonyms.

Another advantage is the increase of precision since we are dealing with concepts rather than groups of words. For example, “radiotherapy for breast cancer patients” would not match our query since “breast cancer patient” is a different concept than “breast cancer” and we choose the longest subsequence to annotate.

Yet another advantage of this approach is that it can support mixed queries. Assume that a query can only be partially resolved to concepts. You can still issue the partially resolved query against the index, and it would pick up the records where the pattern of concept IDs and words appear.

Finally, since this is just a (slightly rewritten) Solr query, all the features of standard Lucene/Solr proximity searches are available to you.

In this post, I describe the search side components that I built to support this approach. It involves a custom TokenFilter and a custom Analyzer that wraps it, along with a few lines of configuration code. The code is in Scala and targets Solr 4.3.0.

So if Solr4 can make documents smarter, can the same be said about topics?

Recalling that “document” for Solr is defined by your indexing, not some arbitrary byte count.

As we are indexing topics we could add information to topics to make merging more robust.

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 11, 2013

Embedding Concepts in text for smarter searching with Solr4

No Comments