Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 23, 2019

Best OCR Tools – Side by Side

Filed under: Government,Government Data,OCR — Patrick Durusau @ 8:34 pm

Our Search for the Best OCR Tool, and What We Found by Ted Han and Amanda Hickman.

From the post:

We selected several documents—two easy to read reports, a receipt, an historical document, a legal filing with a lot of redaction, a filled in disclosure form, and a water damaged page—to run through the OCR engines we are most interested in. We tested three free and open source options (Calamari, OCRopus and Tesseract) as well as one desktop app (Adobe Acrobat Pro) and three cloud services (Abbyy Cloud, Google Cloud Vision, and Microsoft Azure Computer Vision).

All the scripts we used, as well as the complete output from each OCR engine, are available on GitHub. You can use the scripts to check our work, or to run your own documents against any of the clients we tested.

The quality of results varied between applications, but there wasn’t a stand out winner. Most of the tools handled a clean document just fine. None got perfect results on trickier documents, but most were good enough to make text significantly more comprehensible. In most cases if you need a complete, accurate transcription you’ll have to do additional review and correction.

Since government offices are loathe to release searchable versions of important documents (think Mueller report), reasonable use of those documents requires OCR tools.

Han and Hickman enable you to compare OCR engines on your documents, an important step before deciding on which engine best meets your needs.

Should you find yourself in a hacker forum, no doubt by accident, do mention agencies which force OCR of their document releases. That unnecessary burden on readers and reporters should not go unrewarded.

November 17, 2016

The new Tesseract package: High Quality OCR in R

Filed under: OCR,R — Patrick Durusau @ 1:38 pm

The new Tesseract package: High Quality OCR in R by Jeroen Ooms.

From the post:

Optical character recognition (OCR) is the process of extracting written or typed text from images such as photos and scanned documents into machine-encoded text. The new rOpenSci package tesseract brings one of the best open-source OCR engines to R. This enables researchers or journalists, for example, to search and analyze vast numbers of documents that are only available in printed form.

People looking to extract text and metadata from pdf files in R should try our pdftools package.

Reading too quickly at first I thought I had missed a new version of Tesseract (tesseract-ocr Github), an OCR program that I use on a semi-regular basis.

Reading a little slower, ;-), I discovered Ooms is describing a new package for R, which uses Tesseract for OCR.

This is great news but be aware that Tesseract (whether called by an R package or standalone) can generate a large amount of output in a fairly short period of time.

One of the stumbling blocks of OCR is the labor intensive process of cleaning up the inevitable mistakes.

Depending on how critical accuracy is for searching, for example, you may choose to verify and clean only quotes for use in other publications.

Best to make those decisions up front and not be faced with a mountain of output that isn’t useful unless and until it has been corrected.

May 22, 2015

Project Naptha

Filed under: Image Processing,OCR — Patrick Durusau @ 10:18 am

Project Naptha – highlight, copy, and translate text from any image by Kevin Kwok.

From the webpage:

Project Naptha automatically applies state-of-the-art computer vision algorithms on every image you see while browsing the web. The result is a seamless and intuitive experience, where you can highlight as well as copy and paste and even edit and translate the text formerly trapped within an image.

The homepage has examples of Project Naptha being used on comics, scans, photos, diagrams, Internet memes, screenshots, along with sneak peeks at beta features, such as translation, erase text (from images) and change text. (You can select multiple regions with the shift key.)

This should be especially useful for journalists, bloggers, researchers, basically anyone who spends a lot of time looking for content on the Web.

If the project needs a slogan, I would suggest:

Naptha Frees Information From Image Prisons!

March 18, 2015

Wandora tutorial – OCR extractor and Alchemy API Entity extractor

Filed under: Entity Resolution,OCR,Topic Map Software,Topic Maps,Wandora — Patrick Durusau @ 1:47 pm

From the description:

Video reviews the OCR (Optical Character Recognition) extractor and the Alchemy API Entity extractor of Wandora application. First, the OCR extractor is used to recognize text out of PNG images. Next the Alchemy API Entity extractor is used to recognize entities out of the text. Wandora is an open source tool for people who collect and process information, especially networked knowledge and knowledge about WWW resources. For more information see http://wandora.org.

A great demo of some of the many options of Wandora! (Wandora has more options than a Swiss army knife.)

It is an impressive demonstration.

If you aren’t familiar with Wandora, take a close look at it: http://wandora.org.

March 18, 2014

eMOP Early Modern OCR Project

Filed under: OCR,Text Mining — Patrick Durusau @ 9:06 pm

eMOP Early Modern OCR Project

From the webpage:

The Early Modern OCR Project is an effort, on the one hand, to make access to texts more transparent and, on the other, to preserve a literary cultural heritage. The printing process in the hand-press period (roughly 1475-1800), while systematized to a certain extent, nonetheless produced texts with fluctuating baselines, mixed fonts, and varied concentrations of ink (among many other variables). Combining these factors with the poor quality of the images in which many of these books have been preserved (in EEBO and, to a lesser extent, ECCO), creates a problem for Optical Character Recognition (OCR) software that is trying to translate the images of these pages into archiveable, mineable texts. By using innovative applications of OCR technology and crowd-sourced corrections, eMOP will solve this OCR problem.

I first saw this project at: Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr by Chris Adams.

I find it exciting because of the progress the project is making for texts between 1475-1800. For the texts in that time period for sure but also hoping those techniques can be adapted to older materials.

Say older by several thousand years.

Despite pretensions to the contrary, “web scale” is not very much when compared to data feeds from modern science colliders, telescopes, gene sequencing, etc., but also with the vast store of historical texts that remain off-line. To say nothing of the need for secondary analysis of those texts.

Every text that becomes available enriches a semantic tapestry that only humans can enjoy.

Powered by WordPress