Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 19, 2011

TCP Text Creation Partnership

Filed under: Concept Drift,Dataset,Language — Patrick Durusau @ 7:51 pm

TCP Text Creation Partnership

From the “mission” page:

The Text Creation Partnership’s primary objective is to produce standardized, digitally-encoded editions of early print books. This process involves a labor-intensive combination of manual keyboard entry (from digital images of the books’ original pages), the addition of digital markup (conforming to guidelines set by a text encoding standard-setting body know as the TEI), and editorial review.

The chief sources of the TCP’s digital images are database products marketed by commercial publishers. These include Proquest’s Early English Books Online (EEBO), Gale’s Eighteenth Century Collections Online (ECCO), and Readex’s Evans Early American Imprints. Idiosyncrasies in early modern typography make these collections very difficult to convert into searchable, machine-readable text using common scanning techniques (i.e., Optical Character Recognition). Through the TCP, commercial publishers and over 150 different libraries have come together to fund the conversion of these cultural heritage materials into enduring, digitally dynamic editions.

To date, the EEBO-TCP project has converted over 25,000 books. ECCO- and EVANS-TCP have converted another 7,000+ books. A second phase of EEBO-TCP production aims to convert the remaining 44,000 unique monograph titles in the EEBO corpus by 2015, and all of the TCP texts are scheduled to enter the public domain by 2020.

Several thousand titles from the 18th century collection are already available to the general public.

I mention this as a source of texts for testing search software against semantic drift. The sort of drift that occurs in any living language. To say nothing of the changing mores of our interpretation of languages with no native speakers remaining to defend them.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress