Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 14, 2015

Manipulate PDFs with Python

Filed under: Ferguson,PDF,Python — Patrick Durusau @ 5:16 pm

Manipulate PDFs with Python by Tim Arnold.

From the overview:

PDF documents are beautiful things, but that beauty is often only skin deep. Inside, they might have any number of structures that are difficult to understand and exasperating to get at. The PDF reference specification (ISO 32000-1) provides rules, but it is programmers who follow them, and they, like all programmers, are a creative bunch.

That means that in the end, a beautiful PDF document is really meant to be read and its internals are not to be messed with. Well, we are programmers too, and we are a creative bunch, so we will see how we can get at those internals.

Still, the best advice if you have to extract or add information to a PDF is: don’t do it. Well, don’t do it if there is any way you can get access to the information further upstream. If you want to scrape that spreadsheet data in a PDF, see if you can get access to it before it became part of the PDF. Chances are, now that is is inside the PDF, it is just a bunch of lines and numbers with no connection to its former structure of cells, formats, and headings.

If you cannot get access to the information further upstream, this tutorial will show you some of the ways you can get inside the PDF using Python. (emphasis in the original)

Definitely a collect the software and experiment type post!

Is there a collection of “nasty” PDFs on the web? Thinking that would be a useful think to have for testing tools such as the ones listed in this post. Not to mention getting experience with extracting information from them. Suggestions?

I first saw this in a tweet by Christophe Lalanne.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress