Think Big Challenge 2014 [Census Data – Anonymized]

The Think Big Challenge 2014 closed October 19, 2014, but the data sets for that challenge remain available.

From the data download page:

This subdirectory contains a small extract of the data set (1,000 records). There are two data sets provided:

A complete set of records from after the year 1820 is available for download from Amazon S3 at The full data set is available for download from Amazon S3 at as a 127MB gzip file.

A sample of records pre-1820 for use in the data science “Learning of Common Ancestors” challenge. This can be downloaded at as a 4MB gzip file.

The records have been pre-processed:

The contest data set includes both publicly availabl[e] records (e.g., census data) and user-contributed submissions on To preserve user privacy, all surnames present in the data have been obscured with a hash function. The hash is constructed such that all occurrences of the same string will result in the same hash code.

Reader exercise: You can find multiple ancestors of yours in these records with different surnames and compare those against the hash function results. How many you will need to reverse the hash function and recover all the surnames? Use other ancestors of yours to check your results.

Take a look at the original contest tasks for inspiration. What other online records would you want to merge with these? Thinking local newspapers? What about law reporters?


I first saw this mentioned on Danny Bickson’s blog as: Interesting dataset from

Update: I meant to mention Risks of Not Understanding a One-Way Function by Bruce Schneier, to get you started on the deanonymization task. Apologies for the omission.

If you are interested in cryptography issues, following Bruce Schneier’s blog should be on your regular reading list.

