Archive for the ‘Text Encoding Initiative (TEI)’ Category

TEI XML -> HTML w/ XQuery [+ CSS -> XML]

Thursday, May 5th, 2016

Convert TEI XML to HTML with XQuery and BaseX by Adam Steffanick.

From the post:

We converted a document from the Text Encoding Initiative’s (TEI) Extensible Markup Language (XML) scheme to HTML with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This guide covers the basics of how to convert a document from TEI XML to HTML while retaining element attributes with XQuery and BaseX.

I’ve created a GitHub repository of sample TEI XML files to convert from TEI XML to HTML. This guide references a GitHub gist of XQuery code and HTML output to illustrate each step of the TEI XML to HTML conversion process.

The post only treats six (6) TEI elements but the methods presented could be extended to a larger set of TEI elements.

TEI 5 has 563 elements, which may appear in varying, valid, combinations. It also defines 256 attributes which are distributed among those 563 elements.

Consider using XQuery as a quality assurance (QA) tool to insure that encoded texts conform your project’s definition of expected text encoding.

While I was at Adam’s site I encountered: Convert CSV to XML with XQuery and BaseX, which you should bookmark for future reference.

Catchwords

Wednesday, February 17th, 2016

Johan Oosterman tweets:

Look, the next page begins with these words! Very helpful man makes clear how catchwords work. HuntingtonHM1048

HuntingtonHM1048-detail

HuntingtonHM1048

Catchwords were originally used to keep pages in order for binding. You won’t encounter them in post-19th century materials but still interesting from a markup perspective.

The catchword and in this case with a graphic, appears on the page and the next page does appear with these words. Do you capture the catchword? Its graphic? The relationship between the catchword and the opening words of the next page? What if there is an error?

The Preservation of Favoured Traces [Multiple Editions of Darwin]

Thursday, December 10th, 2015

The Preservation of Favoured Traces

From the webpage:

Charles Darwin first published On the Origin of Species in 1859, and continued revising it for several years. As a result, his final work reads as a composite, containing more than a decade’s worth of shifting approaches to his theory of evolution. In fact, it wasn’t until his fifth edition that he introduced the concept of “survival of the fittest,” a phrase that actually came from philosopher Herbert Spencer. By color-coding each word of Darwin’s final text by the edition in which it first appeared, our latest book and poster of his work trace his thoughts and revisions, demonstrating how scientific theories undergo adaptation before their widespread acceptance.

The original interactive version was built in tandem with exploratory and teaching tools, enabling users to see changes at both the macro level, and word-by-word. The printed poster allows you to see the patterns where edits and additions were made and—for those with good vision—you can read all 190,000 words on one page. For those interested in curling up and reading at a more reasonable type size, we’ve also created a book.

The poster and book are available for purchase below. All proceeds are donated to charity.

For textual history fans this is an impressive visualization of the various editions of On the Origin of Species.

To help students get away from the notion of texts as static creations, plus to gain some experience with markup, consider choosing a well known work that has multiple editions that is available in TEI.

Then have the students write XQuery expressions to transform a chapter of such a work into a later (or earlier) edition.

Depending on the quality of the work, that could be a means of contributing to the number of TEI encoded texts and your students would gain experience with both TEI and XQuery.

„To See or Not to See“…

Saturday, October 17th, 2015

„To See or Not to See“ – an Interactive Tool for the Visualization and Analysis of Shakespeare Plays by Thomas Wilhelm, Manuel Burghardt, and Christian Wolff.

Abstract:

In this article we present a web-based tool for the visualization and analysis of quantitative characteristics of Shakespeare plays. We use resources from the Folger Digital Texts Library 1 as input data for our tool. The Folger Shakespeare texts are annotated with structural markup from the Text Encoding Initiative (TEI) 2. Our tool interactively visualizes which character says what and how much at a particular point in time, allowing customized interpretations of Shakespeare plays on the basis of quantitative aspects, without having to care about technical hurdles such as markup or programming languages.

I found the remarkable web tool described in this paper at: http://www.thomaswilhelm.eu/shakespeare/output/hamlet.html.

You can easily change plays (menu, top left) but note that “download source” refers to the processed plays themselves, not the XSL/T code that transformed the TEI markup. I think all the display code is JavaScript/CSS so you can scrape that from the webpage. I am more interested in the XSL/T applied to the original markup.

In the paper the authors say that plays may have over “5000 lines of code” for their transformation with XSL/T.

I am very curious if translating the XSL/T code into XQuery would reduce the amount of code required?

I recently re-wrote the XSLT code for the W3C Bibliography Generator, limited to Recommendations, and the XQuery code was far shorter than the XSLT used by the W3C.

Look for a post on the XQuery I wrote for the W3C bibliography on Monday, 19 October 2015.

If you decide to cite this article:

Wilhelm, T., Burghardt, M. & Wolff, C. (2013). “To See or Not to See” – An Interactive Tool for the Visualization and Analysis of Shakespeare Plays. In Franken-Wendelstorf, R., Lindinger, E. & Sieck J. (eds): Kultur und Informatik – Visual Worlds & Interactive Spaces, Berlin (pp. 175-185). Glückstadt: Verlag Werner Hülsbusch.

Two of the resources mentioned in the article:

Folger Digital Texts Library

Text Encoding Initiative (TEI)

Early English Books Online – Good News and Bad News

Friday, January 2nd, 2015

Early English Books Online

The very good news is that 25,000 volumes from the Early English Books Online collection have been made available to the public!

From the webpage:

The EEBO corpus consists of the works represented in the English Short Title Catalogue I and II (based on the Pollard & Redgrave and Wing short title catalogs), as well as the Thomason Tracts and the Early English Books Tract Supplement. Together these trace the history of English thought from the first book printed in English in 1475 through to 1700. The content covers literature, philosophy, politics, religion, geography, science and all other areas of human endeavor. The assembled collection of more than 125,000 volumes is a mainstay for understanding the development of Western culture in general and the Anglo-American world in particular. The STC collections have perhaps been most widely used by scholars of English, linguistics, and history, but these resources also include core texts in religious studies, art, women’s studies, history of science, law, and music.

Even better news from Sebastian Rahtz Sebastian Rahtz (Chief Data Architect, IT Services, University of Oxford):

The University of Oxford is now making this collection, together with Gale Cengage’s Eighteenth Century Collections Online (ECCO), and Readex’s Evans Early American Imprints, available in various formats (TEI P5 XML, HTML and ePub) initially via the University of Oxford Text Archive at http://www.ota.ox.ac.uk/tcp/, and offering the source XML for community collaborative editing via Github. For the convenience of UK universities who subscribe to JISC Historic Books, a link to page images is also provided. We hope that the XML will serve as the base for enhancements and corrections.

This catalogue also lists EEBO Phase 2 texts, but the HTML and ePub versions of these can only be accessed by members of the University of Oxford.

[Technical note]
Those interested in working on the TEI P5 XML versions of the texts can check them out of Github, via https://github.com/textcreationpartnership/, where each of the texts is in its own repository (eg https://github.com/textcreationpartnership/A00021). There is a CSV file listing all the texts at https://raw.githubusercontent.com/textcreationpartnership/Texts/master/TCP.csv, and a simple Linux/OSX shell script to clone all 32853 unrestricted repositories at https://raw.githubusercontent.com/textcreationpartnership/Texts/master/cloneall.sh

Now for the BAD NEWS:

An additional 45,000 books:

Currently, EEBO-TCP Phase II texts are available to authorized users at partner libraries. Once the project is done, the corpus will be available for sale exclusively through ProQuest for five years. Then, the texts will be released freely to the public.

Can you guess why the public is barred from what are obviously public domain texts?

Because our funding is limited, we aim to key as many different works as possible, in the language in which our staff has the most expertise.

Academic projects are supposed to fund themselves and be self-sustaining. When anyone asks about sustainability of an academic project, ask them when the last time your countries military was “self sustaining?” The U.S. has spent $2.6 trillion on a “war on terrorism” and has nothing to show for it other than dead and injured military personnel, perversion of budgetary policies, and loss of privacy on a world wide scale.

It is hard to imagine what sort of life-time access for everyone on Earth could be secured for less than $1 trillion. No more special pricing and contracts if you are in countries A to Zed. Eliminate all that paperwork for publishers and to access all you need is a connection to the Internet. The publishers would have a guaranteed income stream, less overhead from sales personnel, administrative staff, etc. And people would have access (whether used or not) to educate themselves, to make new discoveries, etc.

My proposal does not involve payments to large military contractors or subversion of legitimate governments or imposition of American values on other cultures. Leaving those drawbacks to one side, what do you think about it otherwise?

The Shelley-Godwin Archive

Tuesday, November 5th, 2013

The Shelley-Godwin Archive

From the homepage:

The Shelley-Godwin Archive will provide the digitized manuscripts of Percy Bysshe Shelley, Mary Wollstonecraft Shelley, William Godwin, and Mary Wollstonecraft, bringing together online for the first time ever the widely dispersed handwritten legacy of this uniquely gifted family of writers. The result of a partnership between the New York Public Library and the Maryland Institute for Technology in the Humanities, in cooperation with Oxford’s Bodleian Library, the S-GA also includes key contributions from the Huntington Library, the British Library, and the Houghton Library. In total, these partner libraries contain over 90% of all known relevant manuscripts.

In case you don’t recognize the name, Mary Shelley wrote Frankenstein; or, The Modern Prometheus; William Godwin, philosopher, early modern (unfortunately theoretical) anarchist; Percy Bysshe Shelley, English Romantic Poet; Mary Wollstonescraft, writer, feminist. Quite a group for the time or even now.

From the About page on Technological Infrastructure:

The technical infrastructure of the Shelley-Godwin Archive builds on linked data principles and emerging standards such as the Shared Canvas data model and the Text Encoding Initiative’s Genetic Editions vocabulary. It is designed to support a participatory platform where scholars, students, and the general public will be able to engage in the curation and annotation of the Archive’s contents.

The Archive’s transcriptions and software applications and libraries are currently published on GitHub, a popular commercial host for projects that use the Git version control system.

  • TEI transcriptions and other data
  • Shared Canvas viewer and search service
  • Shared Canvas manifest generation

All content and code in these repositories is available under open licenses (the Apache License, Version 2.0 and the Creative Commons Attribution license). Please see the licensing information in each individual repository for additional details.

Shared Canvas and Linked Open Data

Shared Canvas is a new data model designed to facilitate the description and presentation of physical artifacts—usually textual—in the emerging linked open data ecosystem. The model is based on the concept of annotation, which it uses both to associate media files with an abstract canvas representing an artifact, and to enable anyone on the web to describe, discuss, and reuse suitably licensed archival materials and digital facsimile editions. By allowing visitors to create connections to secondary scholarship, social media, or even scenes in movies, projects built on Shared Canvas attempt to break down the walls that have traditionally enclosed digital archives and editions.

Linked open data or content is published and licensed so that “anyone is free to use, reuse, and redistribute it—subject only, at most, to the requirement to attribute and/or share-alike,” (from http://opendefinition.org/) with the additional requirement that when an entity such as a person, a place, or thing that has a recognizable identity is referenced in the data, the reference is made using a well-known identifier—called a universal resource identifier, or “URI”—that can be shared between projects. Together, the linking and openness allow conformant sets of data to be combined into new data sets that work together, allowing anyone to publish their own data as an augmentation of an existing published data set without requiring extensive reformulation of the information before it can be used by anyone else.

The Shared Canvas data model was developed within the context of the study of medieval manuscripts to provide a way for all of the representations of a manuscript to co-exist in an openly addressable and shareable form. A relatively well-known example of this is the tenth-century Archimedes Palimpsest. Each of the pages in the palimpsest was imaged using a number of different wavelengths of light to bring out different characteristics of the parchment and ink. For example, some inks are visible under one set of wavelengths while other inks are visible under a different set. Because the original writing and the newer writing in the palimpsest used different inks, the images made using different wavelengths allow the scholar to see each ink without having to consciously ignore the other ink. In some cases, the ink has faded so much that it is no longer visible to the naked eye. The Shared Canvas data model brings together all of these different images of a single page by considering each image to be an annotation about the page instead of a surrogate for the page. The Shared Canvas website has a viewer that demonstrates how the imaging wavelengths can be selected for a page.

One important bit, at least for topic maps, is the view of the Shared Canvas data model that:

each image [is considered] to be an annotation about the page instead of a surrogate for the page.

If I tried to say that or even re-say it, it would be much more obscure. 😉

Whether “annotation about” versus “surrogate for” will catch on beyond manuscript studies it’s hard to say.

Not the way it is usually said in topic maps but if other terminology is better understood, why not?

The Music Encoding Conference 2013

Saturday, November 10th, 2012

The Music Encoding Conference 2013

22-24 May, 2013
Mainz Academy for Literature and Sciences, Mainz, Germany

Important dates:
31 December 2012: Deadline for abstract submissions
31 January 2013: Notification of acceptance/rejection of submissions
21-24 May 2013: Conference
31 July 2013: Deadline for submission of full papers for conference proceedings
December 2013: Publication of conference proceedings

From the email announcement of the conference:

You are cordially invited to participate in the Music Encoding Conference 2013 – Concepts, Methods, Editions, to be held 22-24 May, 2013, at the Mainz Academy for Literature and Sciences in Mainz, Germany.

Music encoding is now a prominent feature of various areas in musicology and music librarianship. The encoding of symbolic music data provides a foundation for a wide range of scholarship, and over the last several years, has garnered a great deal of attention in the digital humanities. This conference intends to provide an overview of the current state of data modeling, generation, and use, and aims to introduce new perspectives on topics in the fields of traditional and computational musicology, music librarianship, and scholarly editing, as well as in the broader area of digital humanities.

As the conference has a dual focus on music encoding and scholarly editing in the context of the digital humanities, the Program Committee is also happy to announce keynote lectures by Frans Wiering (Universiteit Utrecht) and Daniel Pitti (University of Virginia), both distinguished scholars in their respective fields of musicology and markup technologies in the digital humanities.

Proposals for papers, posters, panel discussions, and pre-conference workshops are encouraged. Prospective topics for submissions include:

  • theoretical and practical aspects of music, music notation models, and scholarly editing
  • rendering of symbolic music data in audio and graphical forms
  • relationships between symbolic music data, encoded text, and facsimile images
  • capture, interchange, and re-purposing of music data and metadata
  • ontologies, authority files, and linked data in music encoding
  • additional topics relevant to music encoding and music editing

I know Daniel Pitti from the TEI (Text Encoding Initiative). His presence assures me this will be a great conference for markup, modeling and music enthusiasts.

I can recognize music because it comes in those little plastic boxes. 😉 If you want to talk about the markup/encoding/mapping side, ping me.

NEH Institute Working With Text In a Digital Age

Saturday, September 1st, 2012

NEH Institute Working With Text In a Digital Age

From the webpage:

The goal of this demo/sample code is to provide a platform which institute participants can use to complete an exercise to create a miniature digital edition. We will use these editions as concrete examples for discussion of decisions and issues to consider when creating digital editions from TEI XML, annotations and other related resources.

Some specific items for consideration and discussion through this exercise :

  • Creating identifiers for your texts.
  • Establishing markup guidelines and best practices.
  • Use of inline annotations versus standoff markup.
  • Dealing with overlapping hierarchies.
  • OAC (Open Annotation Collaboration)
  • Leveraging annotation tools.
  • Applying Linked Data concepts.
  • Distribution formats: optimzing for display vs for enabling data reuse.

Excellent resource!

Offers a way to learn/test digital edition skills.

You can use it as a template to produce similar materials with texts of greater interest to you.

The act of encoding asks what subjects you are going to recognize and under what conditions? Good practice for topic map construction.

Not to mention that historical editions of a text have made similar, possibly differing decisions on the same text.

Topic maps are a natural way to present such choices on their own merits, as well as being able to compare and contrast those choices.

I first saw this at The banquet of the digital scholars.

The banquet of the digital scholars

Saturday, September 1st, 2012

The banquet of the digital scholars

The actual workshop title: Humanities Hackathon on editing Athenaeus and on the Reinvention of the Edition in a Digital Space


September 30, 2012 Registration Deadline

October 10-12, 2012
Universität Leipzig (ULEI) & Deutsches Archäologisches Institut (DAI) Berlin

Abstract:

The University of Leipzig will host a hackathon that addresses two basic tasks. On the one hand, we will focus upon the challenges of creating a digital edition for the Greek author Athenaeus, whose work cites more than a thousand earlier sources and is one of the major sources for lost works of Greek poetry and prose. At the same time, we use the case Athenaeus to develop our understanding of to organize a truly born-digital edition, one that not only includes machine actionable citations and variant readings but also collations of multiple print editions, metrical analyses, named entity identification, linguistic features such as morphology, syntax, word sense, and co-reference analysis, and alignment between the Greek original and one or more later translations.

After some details:

Overview:
The Deipnosophists (Δειπνοσοφισταί, or “Banquet of the Sophists”) by Athenaeus of Naucratis is a 3rd century AD fictitious account of several banquet conversations on food, literature, and arts held in Rome by twenty-two learned men. This complex and fascinating work is not only an erudite and literary encyclopedia of a myriad of curiosities about classical antiquity, but also an invaluable collection of quotations and text re-uses of ancient authors, ranging from Homer to tragic and comic poets and lost historians. Since the large majority of the works cited by Athenaeus is nowadays lost, this compilation is a sort of reference tool for every scholar of Greek theater, poetry, historiography, botany, zoology, and many other topics.

Athenaeus’ work is a mine of thousands of quotations, but we still lack a comprehensive survey of its sources. The aim of this “humanities hackathon” is to provide a case study for drawing a spectrum of quoting habits of classical authors and their attitude to text reuse. Athenaeus, in fact, shapes a library of forgotten authors, which goes beyond the limits of a physical building and becomes an intellectual space of human knowledge. By doing so, he is both a witness of the Hellenistic bibliographical methods and a forerunner of the modern concept of hypertext, where sequential reading is substituted by hierarchical and logical connections among words and fragments of texts. Quantity, variety, and precision of Athenaeus’ citations make the Deipnosophists an excellent training ground for the development of a digital system of reference linking for primary sources. Athenaeus’ standard citation includes (a) the name of the author with additional information like ethnic origin and literary category, (b) the title of the work, and (c) the book number (e.g., Deipn. 2.71b). He often remembers the amount of papyrus scrolls of huge works (e.g., 6.229d-e; 6.249a), while distinguishing various editions of the same comedy (e.g., 1.29a; 4.171c; 6.247c; 7.299b; 9.367f) and different titles of the same work (e.g., 1.4e).

He also adds biographical information to identify homonymous authors and classify them according to literary genres, intellectual disciplines and schools (e.g., 1.13b; 6.234f; 9.387b). He provides chronological and historical indications to date authors (e.g., 10.453c; 13.599c), and he often copies the first lines of a work following a method that probably goes back to the Pinakes of Callimachus (e.g., 1.4e; 3.85f; 8.342d; 5.209f; 13.573f-574a).

Last but not least, the study of Athenaeus’ “citation system” is also a great methodological contribution to the domain of “fragmentary literature”, since one of the main concerns of this field is the relation between the fragment (quotation) and its context of transmission. Having this goal in mind, the textual analysis of the Deipnosophists will make possible to enumerate a series of recurring patterns, which include a wide typology of textual reproductions and linguistic features helpful to identify and classify hidden quotations of lost authors.

The 21st century has “big data” in the form of sensor streams and Twitter feeds, but “complex data” in the humanities pre-dates “big data” by a considerable margin.

If you are interested in being challenged by complexity and not simply the size of your data, take a closer look at this project.

Greek is a little late to be of interest to me but there are older texts that could benefit from a similar treatment.

BTW, while you are thinking about this project/text, consider how you would merge prior scholarship, digital and otherwise, with what originates here and what follows it in the decades to come.

TEI Boilerplate

Tuesday, April 24th, 2012

TEI Boilerplate

If you don’t know it, the TEI (Text Encoding Initiative), is one of the oldest digital humanities projects dedicated to fashioning encoding solutions for non-digital texts. The Encoding Guidelines, as they are known, were designed to capture the complexities of pre-digital texts.

If you doubt the complexities of pre-digital texts, consider the following image of a cover page from the Leningrad Codex:

Leningrad Codex Image

Or, consider this page from the Mikraot Gedolot:

Mikraot Gedolot Image

There are more complex pages, such as the mss. of Charles Peirce (Peirce Logic Notebook, Charles Sanders Peirce Papers MS Am 1632 (339). Houghton Library, Harvard University, Cambridge, Mass.):

Peirce Logic Notebook, Charles Sanders Peirce Papers MS AM 1632 (339)

And those are just a few random examples. Encoding pre-digital texts is a complex and rewarding field of study.

Not that “born digital” texts need concede anything to “pre-digital” texts. When you think about our capacity to capture versions, multiple authors, sources, interpretations of readers, discussions and the like, the wealth of material that can be associated with any one text becomes quite complex.

Consider for example the Harry Potter book series that spawned websites, discussion lists, interviews with the author, films and other resources. Not quite like the interpretative history of the Bible but enough to make an interesting problem.

Anything that can encode that range of texts is of necessity quite complex itself and therein lies the rub. You work very hard at document analysis, using or extending the TEI Guidelines to encode your text, now what?

You can:

  1. Show the XML text to family and friends. Always a big hit at parties. 😉
  2. Use your tame XSLT wizard to create a custom conversion of the XML text so normal people will want to see and use it.
  3. Use the TEI Boilerplate project for a stock delivery of the XML text so normal people will want to see and use it. (like your encoders, funders)

From the webpage:

TEI Boilerplate is a lightweight solution for publishing styled TEI (Text Encoding Initiative) P5 content directly in modern browsers. With TEI Boilerplate, TEI XML files can be served directly to the web without server-side processing or translation to HTML. Our TEI Boilerplate Demo illustrates many TEI features rendered by TEI Boilerplate.

Browser Compatibility

TEI Boilerplate requires a robust, modern browser to do its work. It is compatible with current versions of Firefox, Chrome, Safari, and Internet Explorer (IE 9). If you have problems with TEI Boilerplate with a modern browser, please let us know by filing a bug report at https://sourceforge.net/p/teiboilerplate/tickets/.

Many thanks to John Walsh, Grant Simpson, and Saeed Moaddeli, all from Indiana University for this wonderful addition to the TEI toolbox!

PS: If you have disposable funds and aren’t planning on mining asteroids, please consider donating to the TEI (Text Encoding Initiative). Even asteroid miners need to know Earth history, a history written in texts.