The documentation for the Lucene scoring API makes for very interesting reading.
In more ways that one.
Important for anyone who want to understand the scoring of documents by Lucene, which will influence the usefulness of searches in your particular domain.
But I think it is also important because it emphasizes that the scoring is for documents and not subjects.
A very useful thing to score documents, because it (hopefully) puts the most relevant ones to a search at or near the top of search results.
But isn’t that similar to the last mile problem with high speed internet delivery?
That is it is one thing to get high speed internet service to the local switching office. It is quite another to get it to each home, hence, the last mile problem.
An indexing solution like Lucene can, maybe, get you to the right document for a search but that leaves you to go the last mile in terms of finding the subject of interest in the article.
And, just as importantly, relating that subject to other information about the same subject.
True enough, I have been doing that very thing with print indexes and hard copy long before the arrival of widespread full text indexes and on-demand versions of texts.
It seems like a terrible waste of time and resources for everyone interested in a particular subject to have to dig information out of documents and then that cycle is repeated every time someone looks up that subject and finds a particular document.
We all keep running the last semantic mile.
The question is what would motivate us to shorten that to say the last 1/2 semantic mile, or less?