Abbot MorphAdorner collaboration
From the webpage:
The Center for Digital Research in the Humanities at the University of Nebraska and Northwestern University’s Academic and Academic Research Technologies are pleased to announce the first fruits of a collaboration between the Abbot and EEBO-MorphAdorner projects: the release of some 2,000 18th century texts from the TCP-ECCO collections in a TEI-P5 format and with linguistic annotation. More texts will follow shortly, subject to the access restrictions that will govern the use of TCP texts for the remainder of this decade.
The Text Creation Partnership (TCP) collection currently consists of about 50,000 fully transcribed SGML texts from the first three centuries of English print culture. The collection will grow to approximately 75,000 volumes and will contain at least one copy of every book published before 1700 as well as substantial samples of 18th century texts published in the British Isles or North America. The ECCO-TCP texts are already in the public domain. The other texts will follow them between 2014 and 2015. The Evans texts will be released in June 2014, followed by a release of some 25,000 EEBO texts in 2015.
It is a major goal of the Abbot and EEBO MorphAdorner collaboration to turn the TCP texts into the foundation for a “Book of English,” defined as
- a large, growing, collaboratively curated, and public domain corpus
- of written English since its earliest modern form
- with full bibliographical detail
- and light but consistent structural and linguistic annotation
Texts in the annotated TCP corpus will exist in more than one format so as to facilitate different uses to which they are likely to be put. In a first step, Abbot transforms the SGML source text into a TEI P5 XML format. Abbot, a software program designed by Brian Pytlik Zillig and Stephen Ramsay, can read arbitrary XML files and convert them into other XML formats or a shared format. Abbot generates its own set of conversion routines at runtime by reading an XML schema file and programmatically effecting the desired transformations. It is an excellent tool for creating an environment in which texts originating in separate projects can acquire a higher degree of interoperability. A prototype of Abbot was used in the MONK project to harmonize texts from several collections, including the TCP, Chadwyck-Healey’s Nineteenth-Century Fiction, the Wright Archive of American novels 1851-1875, and Documenting the American South.
This first transformation maintains all the typographical data recorded in the original SGML transcription, including long ‘s’, printer’s abbreviations, superscripts etc. In a second step MorphAdorner tokenizes this file. MorphAdorner was developed by Philip R. Burns. It is a multi-purpose suite of NLP tools with special features for the tokenization, analysis, and annotation of historical corpora. The tokenization uses algorithms and heuristics specific to the practices of Early Modern print culture, wraps every word token in a <w> element with a unique ID, and explicitly marks sentence boundaries.
In the next step (conceptually different but merged in practice with the previous), some typographical features are removed from the tokenized text, but all such changes are recorded in a change log and may therefore be reversed. The changes aim at making it easier to manipulate the corpus with software tools that presuppose modern printing practices. They involve such things as replacing long ‘s’ with plain ‘s’, or resolving unambiguous printer’s abbreviations and superscripts.
Talk about researching across language as it changes!
This is way cool!
Lots of opportunities for topic map-based applications.
For more information: