11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard.
From the post:
When you type in a search query — perhaps Plato — are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval — you have to know what the string actually refers to. The Knowledge Graph and Freebase are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.
We’ve previously released data to help with disambiguation and recently awarded $1.2M in research grants to work on related problems. Today we’re taking another step: releasing data consisting of nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.
These Freebase Annotations of the ClueWeb Corpora (FACC) consist of ClueWeb09 FACC and ClueWeb12 FACC. 11 billion phrases that refer to concepts and entities in Freebase were automatically labeled with their unique identifiers (Freebase MID’s). …
(…)
Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%….
(…)
Evaluate precision and recall by asking:
Your GPS gives you relevant directions on an average eight (8) times out of ten and it finds relevant locations on average of seven (7) times out of ten (10). (Wikipedia on Precision and Recall)
Is that a good GPS?
A useful data set but still a continuation of the approach of guessing what authors meant when they authored documents.
What if by some yet unknown technique, precision goes to nine (9) out of ten (10) and recall goes to nine (9) out of ten (10) as well?
The GPS question becomes:
Your GPS gives you relevant directions on an average nine (9) times out of ten and it finds relevant locations on average of nine (9) times out of ten (10).
Is that a good GPS?
Not that any automated technique has shown that level of performance.
Rather than focusing on data post-authoring, why not enable authors to declare their semantics?
Author declared semantics would reduce the cost and uncertainty of post-authoring semantic solutions.
I first saw this in a tweet by Nicolas Torzec.