Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js by Gary Sieling.
From the post:
Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided to do a proof of concept with new Javascript tools. This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. A full-text index is also built, the beginning of a larger ingestion process.
I like the phrase “[m]uch information is trapped inside PDFs….”
Despite window dressing executive orders, information is going to continue to be trapped inside PDFs.
What information do you want to free from its PDF chains?
I first saw this at DZone.