Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 23, 2019

Best OCR Tools – Side by Side

Filed under: Government,Government Data,OCR — Patrick Durusau @ 8:34 pm

Our Search for the Best OCR Tool, and What We Found by Ted Han and Amanda Hickman.

From the post:

We selected several documents—two easy to read reports, a receipt, an historical document, a legal filing with a lot of redaction, a filled in disclosure form, and a water damaged page—to run through the OCR engines we are most interested in. We tested three free and open source options (Calamari, OCRopus and Tesseract) as well as one desktop app (Adobe Acrobat Pro) and three cloud services (Abbyy Cloud, Google Cloud Vision, and Microsoft Azure Computer Vision).

All the scripts we used, as well as the complete output from each OCR engine, are available on GitHub. You can use the scripts to check our work, or to run your own documents against any of the clients we tested.

The quality of results varied between applications, but there wasn’t a stand out winner. Most of the tools handled a clean document just fine. None got perfect results on trickier documents, but most were good enough to make text significantly more comprehensible. In most cases if you need a complete, accurate transcription you’ll have to do additional review and correction.

Since government offices are loathe to release searchable versions of important documents (think Mueller report), reasonable use of those documents requires OCR tools.

Han and Hickman enable you to compare OCR engines on your documents, an important step before deciding on which engine best meets your needs.

Should you find yourself in a hacker forum, no doubt by accident, do mention agencies which force OCR of their document releases. That unnecessary burden on readers and reporters should not go unrewarded.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress