From Max Lin’s blog, Ian Soboroff posted:
Two new collections being released from TREC today:
The first is the long-awaited Tweets2011 collection. This is 16 million tweets sampled by Twitter for use in the TREC 2011 microblog track. We distribute the tweet identifiers and a crawler, and you download the actual tweets using the crawler. http://trec.nist.gov/data/tweets/
The second is TRC2, a collection of 1.8 million news articles from Thompson Reuters used in the TREC 2010 blog track. http://trec.nist.gov/data/reuters/reuters.html
Both collections are available under extremely permissive usage agreements that limit their use to research and forbid redistribution, but otherwise are very open as data usage agreements go.
It may just be my memory but I don’t recall seeing topic map research with the older Reuters data set (the new one is too recent). Is that true?
Anyway, more large data sets for your research pleasure.