Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 5, 2017

Tabula: Extracting A Hit (sorry) Security List From PDF Report

Filed under: Cybersecurity,Extraction,Government,PDF,Security — Patrick Durusau @ 11:44 am

Benchmarking U.S. Government Websites by Daniel Castro, Galia Nurko, and Alan McQuinn, provides a quick assessment of 468 of the most popular federal websites for “…page-load speed, mobile friendliness, security, and accessibility.”

Unfortunately, it has an ugly table layout:

Double column listings with the same headers?

There are 476 results on Stackoverflow this morning for extracting tables from PDF.

However, I need a cup of coffee, maybe two cups of coffee answer to extracting data from these tables.

Enter Tabula.

If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful it is — there’s no easy way to copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux.

Tabula is download, extract, start and point your web browser to http://localhost:8080 (or http://127.0.0.1:8080), load your PDF file, select the table, export the content, easy to use.

I tried selecting the columns separately (one page at a time) but then used table recognition and selected the entirety of Table 6 (security evaluation). I don’t think it made any difference in the errors I was seeing in the result (dropping first letter of site domains, but check with your data.)

Warning: For some unknown reason, possibly a defect in the PDF and/or Tabula, the leading character from the second domain field was dropped on some entries. Not all, not consistently, but it was dropped. Not to mention missing the last line of entries on a couple of pages. Proofing is required!

Not to mention there were other recognition issues

Capture wasn’t perfect due to underlying differences in the PDF:

cancer.gov,100,901,fdic.gov,100,"3,284"
weather.gov,100,904,blm.gov,100,"3,307"
transportation.gov,,,100,,,"3,340",,,ecreation.gov,,,100,,,"9,012",
"regulations.gov1003,390data.gov1009,103",,,,,,,,,,,,,,,,
nga.gov,,,100,,,"3,462",,,irstgov.gov,,,100,,,"9,112",
"nrel.gov1003,623nationalservice.gov1009,127",,,,,,,,,,,,,,,,
hrsa.gov,,,100,,,"3,635",,,topbullying.gov,,,100,,,"9,285",
"consumerfinance.gov1004,144section508.gov1009,391",,,,,,,,,,,,,,,,

With proofing, we are way beyond two cups of coffee but once proofed, I tossed it into Calc and produced a single column CSV file: 2017-Benchmarking-US-Government-Websites-Security-Table-6.csv.

Enjoy!

PS: I discovered a LibreOffice Calc “gotcha” in this exercise. If you select a column for the top and attempt to paste it under an existing column (same or different spreadsheet), you get the error message: “There is not enough room on the sheet to insert here.”

When you select a column from the top, it copies all the blank cells in that column so there truly isn’t sufficient space to paste it under another column. Tip: Always copy columns in Calc from the bottom of the column up.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress