From the webpage:
The Early Modern OCR Project is an effort, on the one hand, to make access to texts more transparent and, on the other, to preserve a literary cultural heritage. The printing process in the hand-press period (roughly 1475-1800), while systematized to a certain extent, nonetheless produced texts with fluctuating baselines, mixed fonts, and varied concentrations of ink (among many other variables). Combining these factors with the poor quality of the images in which many of these books have been preserved (in EEBO and, to a lesser extent, ECCO), creates a problem for Optical Character Recognition (OCR) software that is trying to translate the images of these pages into archiveable, mineable texts. By using innovative applications of OCR technology and crowd-sourced corrections, eMOP will solve this OCR problem.
I first saw this project at: Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr by Chris Adams.
I find it exciting because of the progress the project is making for texts between 1475-1800. For the texts in that time period for sure but also hoping those techniques can be adapted to older materials.
Say older by several thousand years.
Despite pretensions to the contrary, “web scale” is not very much when compared to data feeds from modern science colliders, telescopes, gene sequencing, etc., but also with the vast store of historical texts that remain off-line. To say nothing of the need for secondary analysis of those texts.
Every text that becomes available enriches a semantic tapestry that only humans can enjoy.