First, you need to Get the Data is a post by Mathew Hurst about a site for asking questions about data sets (and getting answers).
A couple of the questions just to give you an idea about the site:
- How can I compile a log of Wikipedia articles by date of creation?
- Are there any indexes of available data sets?
There are useful answers to both of those questions.
Before starting off to build a data set, this is one site to check first.
A listing of sites to check for existing data sets would make an useful chapter in a book on topic maps.