I forgot to mention in Copyright Troll Hunting – 92,398 Possibles -> 146 Possibles that while using LibreOffice, I deleted a large number of either N/A only or columns not relevant for troll-mining.zip.
Except as otherwise noted, after removal of “no last name,” these fields had N/A for all records except as noted:
- L – Implementation Date
- M – Effective Date
- N – Related RINs
- O – Document SubType (Comment(s))
- P – Subject
- Q – Abstract
- R – Status – (Posted, except for 2)
- S – Source Citation
- T – OMB Approval Number
- U – FR Citation
- V – Federal Register Number (8 exceptions)
- W – Start End Page (8 exceptions)
- X – Special Instructions
- Y – Legacy ID
- Z – Post Mark Date
- AA – File Type (1 docx)
- AB – Number of Pages
- AC – Paper Width
- AD – Paper Length
- AE – Exhibit Type
- AF – Exhibit Location
- AG – Document Field_1
- AH – Document Field_2
From the state of the records, one suspects the “bulking up” is NOT an artifact of the export but represents the storage of each record.
One way to test that theory would be a query on the noise fields via the API for Regulations.gov.
The documentation for the API is out-dated, the Field References documentation lacks the Document Detail (field AI), which contains the URL to access the comment.
The closest thing I could find was:
fileFormats Formats of the document, included as URLs to download from the API
How easy/hard it will be to download attachments isn’t clear.
BTW, the comment pages themselves are seriously puffed up. Take https://www.regulations.gov/document?D=COLC-2015-0013-52236.
Saved to disk: 148.6 KB.
Content of the comment: 2.5 KB.
The content of the comment is 1.6 % of the delivered webpage.
It must have taken serious effort to achieve a 98.4% noise to 1.6% signal ratio.
How transparent is data when you have to mine for the 1.6% that is actual content?