Searching for implication, that is p implies q, I got:
- “q whenever p” – 44,200 “hits” (00.7%)
- “p is sufficient for q” – 385,000 “hits” (6%)
- “p implies q” – 506,000 “hits” (8%)
- “if p, then q” – 2,189,000 “hits” (36%)
- “q if p” – 2,920,000 “hits” (48%)
What if the search was for a “smoking gun” sort of document during legal discovery? Or searching for the latest treatment for a patient dying in ER? Or engineering literature to avoid what could be a fatal flaw in a part that will go into hundreds of airplanes? Hmmm, 00.7% results don’t look all that attractive.
It isn’t possible to know what percentage of relevant documents your query returned for a document set of any size. Your query might be the 48% query but it could also be the 00.7% query.
To make matters worse, the 00.7% query could be even worse. That score assumes that those five queries return *all* the relevant documents.
The problem is that different users identify the same subjects in different ways. Or use the same identifications for different subjects. Matters get worse the more users that produce documents that need to be searched.
Available options include:
- Create new identifiers and ignore previous ones
- Create new identifiers and map previous ones
- Map identifiers people already use
This blog will explore all three and why I prefer the last one.