Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 18, 2014

Automatic bulk OCR and full-text search…

Filed under: ElasticSearch,Search Engines,Solr,Topic Maps — Patrick Durusau @ 8:48 pm

Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr by Chris Adams.

From the post:

Digitizing printed material has become an industrial process for large collections. Modern scanning equipment makes it easy to process millions of pages and concerted engineering effort has even produced options at the high-end for fragile rare items while innovative open-source projects like Project Gado make continual progress reducing the cost of reliable, batch scanning to fit almost any organization’s budget.

Such efficiencies are great for our goals of preserving history and making it available but they start making painfully obvious the degree to which digitization capacity outstrips our ability to create metadata. This is a big problem because most of the ways we find information involves searching for text and a large TIFF file is effectively invisible to a full-text search engine. The classic library solution to this challenge has been cataloging but the required labor is well beyond most budgets and runs into philosophical challenges when users want to search on something which wasn’t considered noteworthy at the time an item was cataloged.

In the spirit of finding the simplest thing that could possibly work I’ve been experimenting with a completely automated approach to perform OCR on new items and offering combined full-text search over both the available metadata and OCR text, as can be seen in this example:

If this weren’t impressive enough, Chris has a number of research ideas, including:

the idea for a generic web application which would display hOCR with the corresponding images for correction with all of the data stored somewhere like Github for full change tracking and review. It seems like something along those lines would be particularly valuable as a public service to avoid the expensive of everyone reinventing large parts of this process customized for their particular workflow.

More grist for a topic map mill!

PS: Should you ever come across a treasure trove of not widely available documents, please replicate them to as many public repositories as possible.

Traditional news outlets protect people in leak situations who knew they were playing in the street. Why they merit more protection than the average person is a mystery to me. Let’s protect the average people first and the players last.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress