Google Research Releases Wikilinks Corpus With 40M Mentions And 3M Entities by Frederic Lardinois.
From the post:
Google Research just launched its Wikilinks corpus, a massive new data set for developers and researchers that could make it easier to add smart disambiguation and cross-referencing to their applications. The data could, for example, make it easier to find out if two web sites are talking about the same person or concept, Google says. In total, the corpus features 40 million disambiguated mentions found within 10 million web pages. This, Google notes, makes it “over 100 times bigger than the next largest corpus,” which features fewer than 100,000 mentions.
For Google, of course, disambiguation is something that is a core feature of the Knowledge Graph project, which allows you to tell Google whether you are looking for links related to the planet, car or chemical element when you search for ‘mercury,’ for example. It takes a large corpus like this one and the ability to understand what each web page is really about to make this happen.
Details follow on how to create this data set.
Very cool!
The only caution is that your entities, those specific to your enterprise, are unlikely to appear, even in 40M mentions.
But the Wikilinks Corpus + your entities, now that is something with immediate ROI for your enterprise.