Practical Relevance Ranking for 11 Million Books, Part 1 by Tom Burton-West.
From the post:
This is the first in a series of posts about our work towards practical relevance ranking for the 11 million books in the HathiTrust full-text search application.
Relevance is a complex concept which reflects aspects of a query, a document, and the user as well as contextual factors. Relevance involves many factors such as the user’s preferences, the user’s task, the user’s stage in their information-seeking, the user’s domain knowledge, the user’s intent, and the context of a particular search.
While many different kinds of relevance have been discussed in the literature, topical relevance is the one most often used in testing relevance ranking algorithms. Topical relevance is a measure of “aboutness”, and attempts to measure how much a document is about the topic of a user’s query.
At its core, relevance ranking depends on an algorithm that uses term statistics, such as the number of times a query term appears in a document, to provide a topical relevance score. Other ranking features that try to take into account more complex aspects of relevance are built on top of this basic ranking algorithm.
In many types of search, such as e-commerce or searching for news, factors other than the topical relevance (based on the words in the document) are important. For example, a search engine for e-commerce might have facets such as price, color, size, availability, and other attributes, that are of equal importance to how well the user’s query terms match the text of a document describing a product. In news retrieval, recency[iii] and the location of the user might be factored into the relevance ranking algorithm. (footnotes omitted)
…
Great post that discusses the impact of the length of a document on its relevancy ranking by Lucene/Solr. That impact is well known but how to move from studies on relevancy studies with short documents to long documents (books) isn’t known.
I am looking forward to Part 2, which will cover the relationship between relevancy and document length.