By now you realize how useless relevancy at the “document” level can be, considering documents can be ten, twenty, hundreds or even thousands of pages long.
Highly relevant “hits” are great, but are you going to read every page of every document?
The main report on the Iraq War, The U.S. Army in the Iraq War – Volume 1: Invasion – Insurgency – Civil War, 2003-2006 and The U.S. Army in the Iraq War — Volume 2: Surge and Withdrawal, 2007-2011, totals out at about 1,400+ pages.
Along with the report, nearly 30,000 unclassified documents used in the writing of the report are also available.
Other than being timely, the advantage for data miners is the report, while a bit long, is readable and you know in advance the ~30,000 documents are relevant to that report. Ignoring footnotes (that’s cheating), which documents go with which pages of the report? You can check your answers against the footnotes.
For bonus points, what pages of the ~30,000 documents should go with which pages of the report? They weren’t citing the entire document like some search engines, but particular pages.
And no, I haven’t loaded these documents but hope to this weekend.
PS: The Army War College Publications office has a remarkable range of very high quality publications.