Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 9, 2011

Ancestry.com Forum Dataset

Filed under: Dataset — Patrick Durusau @ 6:40 pm

Ancestry.com Forum Dataset

From the post:

The Ancestry.com Forum Dataset was created with the cooperation of Ancestry.com in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the Ancestry.com online forum, boards.ancestry.com, from July 2010. This message board is large, with over 22 million messages, over 3.5 million authors, and active participation for over ten years.

In addition to the document collection, queries from Ancestry.com’s query log and pairwise preference relevance judgements for a message thread retrieval task using this online forum are distributed.

This webpage describes the dataset, gives instructions for obtaining the dataset, and describes the supplemental data to use for thread search information retrieval experiments. Further details of the dataset can be found in the tech report describing the collection.

Contact: Jonathan Elsas.


Document Collection

The Ancestry.com Online Forum document collection is a full snapshot of the online forum, boards.ancestry.com from July 2010.

Number of Messages 22,054,728
Number of Threads 9,040,958
Number of Sub-forums 165,358
Number of Unique Authors 3,775,670
Message Date Range December 1995 – July 2010
Size 5 GB (compressed)

The documents distributed in the collection are in the TRECTEXT SGML format, similar to other collections used at the Text REtrieval Conference.

As you will read, creation of a dataset, for use as a test set, is a non-trivial project.

Curious, what questions would you ask of such a dataset? Or perhaps better, what tools would you use to ask those questions and why?

Grant Ingersoll mentioned this collection in email on the openrelevance-dev@apache.org mailing list.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress