I didn’t see a bulk download option for the chapters of the Chilcot Report at: The Iraq Inquiry Report page so I have collected those files and bundled them up for download as Iraq-Inquiry-Report-All-Volumes.tar.gz.
I wrote about Apache PDFBox recently so I also converted all of those files to text and have bundled them up as a Iraq-Inquiry-Report-Text-Conversion.tar.gz.
Some observations on the text files:
- Numbered paragraphs have the format: digit(one or more)-period-space
- Footnotes are formatted: digit(1 or more)-space-text
- Page numbers: digit(1 or more)-space-no following text
Suggestions on other processing steps?