Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 31, 2013

Freeing Information From Its PDF Chains

Filed under: node-js,PDF — Patrick Durusau @ 3:58 pm

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js by Gary Sieling.

From the post:

Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided to do a proof of concept with new Javascript tools. This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. A full-text index is also built, the beginning of a larger ingestion process.

I like the phrase “[m]uch information is trapped inside PDFs….”

Despite window dressing executive orders, information is going to continue to be trapped inside PDFs.

What information do you want to free from its PDF chains?

I first saw this at DZone.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress