From the post:
The Ancestry.com Forum Dataset was created with the cooperation of Ancestry.com in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the Ancestry.com online forum, boards.ancestry.com, from July 2010. This message board is large, with over 22 million messages, over 3.5 million authors, and active participation for over ten years.
In addition to the document collection, queries from Ancestry.com’s query log and pairwise preference relevance judgements for a
message thread retrieval task using this online forum are distributed.This webpage describes the dataset, gives instructions for obtaining the dataset, and describes the supplemental data to use for thread search information retrieval experiments. Further details of the dataset can be found in the tech report describing the collection.
Contact: Jonathan Elsas.
Document Collection
The Ancestry.com Online Forum document collection is a full snapshot of the online forum, boards.ancestry.com from July 2010.
Number of Messages 22,054,728 Number of Threads 9,040,958 Number of Sub-forums 165,358 Number of Unique Authors 3,775,670 Message Date Range December 1995 – July 2010 Size 5 GB (compressed) The documents distributed in the collection are in the TRECTEXT SGML format, similar to other collections used at the Text REtrieval Conference.
As you will read, creation of a dataset, for use as a test set, is a non-trivial project.
Curious, what questions would you ask of such a dataset? Or perhaps better, what tools would you use to ask those questions and why?
Grant Ingersoll mentioned this collection in email on the openrelevance-dev@apache.org mailing list.