Sherlock’s Last Case by Joe Armstrong.
Joe states the Sherlock problem as given one X and millions of Yi’s, “Which Yi is “nearer to X?”
For some measure of “nearer,” or as we prefer, similarity.
One solution is given in Programming Erlang: Software for a Concurrent World, 2nd ed., 2013, by Joe Armstrong.
Joe describes two possibly better solutions in this lecture.
Great lecture even if he omits a fundamental weakness in TF-IDF.
From the Wikipedia entry:
Suppose we have a set of English text documents and wish to determine which document is most relevant to the query “the brown cow”. A simple way to start out is by eliminating documents that do not contain all three words “the”, “brown”, and “cow”, but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency.
However, because the term “the” is so common, this will tend to incorrectly emphasize documents which happen to use the word “the” more frequently, without giving enough weight to the more meaningful terms “brown” and “cow”. The term “the” is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words “brown” and “cow”. Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.
For example, TF-IDF would not find a document with “the brown heifer,” for a query of “the brown cow.”
TF-IDF does not account for relationships between terms, such as synonymy or polysemy.
Juam Ramos states as much in describing the limitations of TF-IDF in: Using TF-IDF to Determine Word Relevance in Document Queries:
Despite its strength, TF-IDF has its limitations. In terms of synonyms, notice that TF-IDF does not make the jump to the relationship between words. Going back to (Berger & Lafferty, 1999), if the user wanted to find information about, say, the word ‘priest’, TF-IDF would not consider documents that might be relevant to the query but instead use the word ‘reverend’. In our own experiment, TF-IDF could not equate the word ‘drug’ with its plural ‘drugs’, categorizing each instead as separate words and slightly decreasing the word’s wd value. For large document collections, this could present an escalating problem.
Ramos cites Information Retrieval as Statistical Translation by Adam Berger and John Lafferty to support his comments on synonymy or polysemy.
The Berger and Lafferty treat synonymy and polysemy, issues that TF-IDF misses, as statistical translation issues:
Ultimately document retrieval systems must be sophisticated enough to handle polysemy and synonymyto know for instance that pontiff and pope are related terms The eld of statistical translation concerns itself with how to mine large text databases to automatically discover such semantic relations Brown et al [3, 4] showed for instance how a system can learn to associate French terms with their English translations given only a collection of bilingual FrenchEnglish sentences We shall demonstrate how in a similar fashion an IR system can from a collection of documents automatically learn which terms are related and exploit these relations to better nd and rank the documents it returns to the user
Merging powered by the results of statistical translation?
The Berger and Lafferty paper is more than a decade old so I will be running the research forward.