Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 6, 2016

Chilcot Report – Collected PDFs, Converted to Text

Filed under: Chilcot Report (Iraq),Government — Patrick Durusau @ 3:19 pm

I didn’t see a bulk download option for the chapters of the Chilcot Report at: The Iraq Inquiry Report page so I have collected those files and bundled them up for download as Iraq-Inquiry-Report-All-Volumes.tar.gz.

I wrote about Apache PDFBox recently so I also converted all of those files to text and have bundled them up as a Iraq-Inquiry-Report-Text-Conversion.tar.gz.

Some observations on the text files:

  • Numbered paragraphs have the format: digit(one or more)-period-space
  • Footnotes are formatted: digit(1 or more)-space-text
  • Page numbers: digit(1 or more)-space-no following text

Suggestions on other processing steps?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress