Similarity in Elasticsearch

Similarity in Elasticsearch by Konrad G. Beiske.

From the post:

A similarity model is a set of abstractions and metrics to define to what extent things are similar. That’s quite a general definition. In this article I will only consider textual similarity. In this context, the uses of similarity models can be divided into two categories: classification of documents, with a finite set of categories where the categories are known; and information retrieval where the problem can be defined as ‘find the the most relevant documents to a given query’. In this article I will look into the latter category.

Elasticsearch provides the following similarity models: default, bm25, drf and ib. I have limited the scope of this article to default and bm25. The divergence from randomness and information based similarities may feature in a future article.

Konrad goes on to talk about the default similarity model in Elasticsearch, Tf/idf and BM25 (aka Okapi BM25), a probabilistic model.

He also points the reader to: The Probabilistic Relevance Framework: BM25 and Beyond for further details on BM25.

A good post if you want to learn more about tuning similarity in Elasticsearch.

BTW, documentation on similarity module for 0.90.

While the build-in similarity models offer a lot of mileage no doubt, I am more intrigued by the potential for creating a custom similarity model.

As you know, some people think English words are just English words. Search engines tend to ignore time, social class, context of use, etc., in returning all the “instances” of an English word.

That is to say the similarity model for one domain or period could be quite different from the similarity model for another.

Domain or period specific similarity models would be difficult to construct and certainly less general.

Given the choice, of being easy, general and less accurate versus being harder, less general and more accurate, which would you choose?

Does your answer change if you are a consumer looking for the best results or a developer trying to sell “good enough” results?

Comments are closed.