What’s Up With Data Padding? (Regulations.gov)

I forgot to mention in Copyright Troll Hunting – 92,398 Possibles -> 146 Possibles that while using LibreOffice, I deleted a large number of either N/A only or columns not relevant for troll-mining.zip.

Except as otherwise noted, after removal of “no last name,” these fields had N/A for all records except as noted:

  1. L – Implementation Date
  2. M – Effective Date
  3. N – Related RINs
  4. O – Document SubType (Comment(s))
  5. P – Subject
  6. Q – Abstract
  7. R – Status – (Posted, except for 2)
  8. S – Source Citation
  9. T – OMB Approval Number
  10. U – FR Citation
  11. V – Federal Register Number (8 exceptions)
  12. W – Start End Page (8 exceptions)
  13. X – Special Instructions
  14. Y – Legacy ID
  15. Z – Post Mark Date
  16. AA – File Type (1 docx)
  17. AB – Number of Pages
  18. AC – Paper Width
  19. AD – Paper Length
  20. AE – Exhibit Type
  21. AF – Exhibit Location
  22. AG – Document Field_1
  23. AH – Document Field_2

Regulations.gov, not the Copyright Office, is responsible for the collection and management of comments, including the bulked up export of comments.

From the state of the records, one suspects the “bulking up” is NOT an artifact of the export but represents the storage of each record.

One way to test that theory would be a query on the noise fields via the API for Regulations.gov.

The documentation for the API is out-dated, the Field References documentation lacks the Document Detail (field AI), which contains the URL to access the comment.

The closest thing I could find was:

fileFormats Formats of the document, included as URLs to download from the API

How easy/hard it will be to download attachments isn’t clear.

BTW, the comment pages themselves are seriously puffed up. Take https://www.regulations.gov/document?D=COLC-2015-0013-52236.

Saved to disk: 148.6 KB.

Content of the comment: 2.5 KB.

The content of the comment is 1.6 % of the delivered webpage.

It must have taken serious effort to achieve a 98.4% noise to 1.6% signal ratio.

How transparent is data when you have to mine for the 1.6% that is actual content?

Comments are closed.