Free Language Lessons for Computers by Dave Orr.
From the post:
50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.
These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.
A great summary of the major data drops by Google Research over the past year. In many cases including pointers to additional information on the datasets.
One that I have seen before and that strikes me as particularly relevant to topic maps is:
Dictionaries for linking Text, Entities, and Ideas
What is it: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.
Where can I find it: http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2
I want to know more: A description of the data, several examples, and ideas for uses for it can be found in a blog post or in the associated paper.
For most purposes, you would need far less than the full set of 7.5 million concepts. Imagine having the relevant concepts for a domain that was being automatically “tagged” as you composed prose about it.
Certainly less error-prone than marking concepts by hand!