Archive for the ‘Texts’ Category

eTRAP (electronic Text Reuse Acquisition Project) [Motif Identities]

Wednesday, November 8th, 2017

eTRAP (electronic Text Reuse Acquisition Project)

From the webpage:

As the name suggests, this interdisciplinary team studies the linguistic and literary phenomenon that is text reuse with a particular focus on historical languages. More specifically, we look at how ancient authors copied, alluded to, paraphrased and translated each other as they spread their knowledge in writing. This early career research group seeks to provide a basic understanding of the historical text reuse methodology (it being distinct from plagiarism), and so to study what defines text reuse, why some people reuse information, how text is reused and how this practice has changed over history. We’ll be investigating text reuse on big data or, in other words, datasets that, owing to their size, cannot be manually processed.

While primarily geared towards research, the team also organises events and seminars with the aim of learning more about the activities conducted by our scholarly communities, to broaden our network of collaborations and to simply come together to share our experiences and knowledge. Our Activities page lists our events and we provide project updates via the News section.

Should you have any comments, queries or suggestions, feel free to contact us!

A bit more specifically, Digital Breadcrumbs of Brothers Grimm, which is described in part as:

Described as “a great monument to European literature” (David and David, 1964, p. 180), 2 Jacob and Wilhelm Grimm’s masterpiece Kinder- und Hausmärchen has captured adult and child imagination for over 200 years. International cinema, literature and folklore have borrowed and adapted the brothers’ fairy tales in multifarious ways, inspiring themes and characters in numerous cultures and languages. 3

Despite being responsible for their mainstream circulation, the brothers were not the minds behind all fairy tales. Indeed, Jacob and Wilhelm themselves collected and adapted their stories from earlier written and oral traditions, some of them dating back to as far as the seventh century BC, and made numerous changes to their own collection (ibid., p. 183) producing seven distinct editions between 1812 and 1857.

The same tale often appears in different forms and versions across cultures and time, making it an interesting case-study for textual and cross-lingual comparisons. Is it possible to compare the Grimm brothers’ Snow White and the Seven Dwarves to Pushkin’s Tale of the Dead Princess and the Seven Nights? Can we compare the Grimm brothers’ version of Cinderella to Charles Perrault’s Cinderella? In order to do so it is crucial to find those elements that both tales have in common. Essentially, one must find those measurable primitives that, if present in a high number – and in a similar manner – in both texts, make the stories comparable. We identify these primitives as the motifs of a tale. Prince’s Dictionary of Narratology describes motifs as “..minimal thematic unit[s]”, 4 which can be recorded and have been recorded in the Thompson Motif-index. 5 Hans-Jörg Uther, who expanded Aarne-Thompson classification system (AT number system) in 2004 defined a motif as:

“…a broad definition that enables it to be used as a basis for literary and ethnological research. It is a narrative unit, and as such is subject to a dynamic that determines with which other motifs it can be combined. Thus motifs constitute the basic building blocks of narratives.” (Uther, 2004)

From a topic maps perspective, what do you “see” in a tale that supports your identification of one or more motifs?

Or for that matter, how do you search across multiple identifications of motifs to discover commonalities between identifications by different readers?

It’s all well and good to tally which motifs were identified by particular readers, but clues as to why they differ requires more detail (read subjects).

Unlike the International Consortium of Investigative Journalists (ICIJ), sponsor of the Panama Papers and the Paradise Papers, the eTRAP data is available on Github.

There are only three stories, Snow White, Puss in Boots, and Fisherman and his Wife, in the data repository as of today.

It’s more than just overlap: Text As Graph

Wednesday, August 2nd, 2017

It’s more than just overlap: Text As Graph – Refining our notion of what text really is—this time for sure! by Ronald Haentjens Dekker and David J. Birnbaum.


The XML tree paradigm has several well-known limitations for document modeling and processing. Some of these have received a lot of attention (especially overlap), and some have received less (e.g., discontinuity, simultaneity, transposition, white space as crypto-overlap). Many of these have work-arounds, also well known, but—as is implicit in the term “work-around”—these work-arounds have disadvantages. Because they get the job done, however, and because XML has a large user community with diverse levels of technological expertise, it is difficult to overcome inertia and move to a technology that might offer a more comprehensive fit with the full range of document structures with which researchers need to interact both intellectually and programmatically. A high-level analysis of why XML has the limitations it has can enable us to explore how an alternative model of Text as Graph (TAG) might address these types of structures and tasks in a more natural and idiomatic way than is available within an XML paradigm.

Hyperedges, texts and XML, what more could you need? 😉

This paper merits a deep read and testing by everyone interested in serious text modeling.

You can’t read the text but here is a hypergraph visualization of an excerpt from Lewis Carroll’s “The hunting of the Snark:”

The New Testament, the Hebrew Bible, to say nothing of the Rabbinic commentaries on the Hebrew Bible and centuries of commentary on other texts could profit from this approach.

Put your text to the test and share how to advance this technique!

Open Islamicate Texts Initiative (OpenITI)

Tuesday, July 11th, 2017

Open Islamicate Texts Initiative (OpenITI)

From the description (Annotation) of the project:

Books are grouped into authors. All authors are grouped into 25 AH periods, based on the year of their death. These repositories are the main working loci—if any modifications are to be added or made to texts or metadata, all has to be done in files in these folders.

There are three types of text repositories:

  • RAWrabicaXXXXXX repositories include raw texts as they were collected from various open-access online repositories and libraries. These texts are in their initial (raw) format and require reformatting and further integration into OpenITI. The overall current number of text files is over 40,000; slightly over 7,000 have been integrated into OpenITI.
  • XXXXAH are the main working folders that include integrated texts (all coming from collections included into RAWrabicaXXXXXX repositories).
  • i.xxxxx repositories are instantiations of the OpenITI corpus adapted for specific forms of analysis. At the moment, these include the following instantiations (in progress):
    • i.cex with all texts split mechanically into 300 word units, converted into cex format.
    • i.mech with all texts split mechanically into 300 word units.
    • i.logic with all texts split into logical units (chapters, sections, etc.); only tagged texts are included here (~130 texts at the moment).
    • i.passim_new_mech with all texts split mechanically into 300 word units, converted for the use with new passim (JSON).
    • [not created yet] i.passim_new_mech_cluster with all text split mechanically into 900 word units (3 milestones) with 300 word overlap; converted for the use with new passim (JSON).
    • i.passim_old_mech with all texts split mechanically into 300 word units, converted for the use with old passim (XML, gzipped).
    • i.stylo includes all texts from OpenITI (duplicates excluded) that are renamed and slightly reformatted (Arabic orthography is simplified) for the use with stylo R-package.

A project/site to join to hone your Arabic NLP and reading skills.


Music Encoding Initiative

Wednesday, May 24th, 2017

Music Encoding Initiative

From the homepage:

The Music Encoding Initiative (MEI) is an open-source effort to define a system for encoding musical documents in a machine-readable structure. MEI brings together specialists from various music research communities, including technologists, librarians, historians, and theorists in a common effort to define best practices for representing a broad range of musical documents and structures. The results of these discussions are formalized in the MEI schema, a core set of rules for recording physical and intellectual characteristics of music notation documents expressed as an eXtensible Markup Language (XML) schema. It is complemented by the MEI Guidelines, which provide detailed explanations of the components of the MEI model and best practices suggestions.

MEI is hosted by the Akademie der Wissenschaften und der Literatur, Mainz. The Mainz Academy coordinates basic research in musicology through editorial long-term projects. This includes the complete works of from Brahms to Weber. Each of these (currently 15) projects has a duration of at least 15 years, and some (like Haydn, Händel and Gluck) are running since the 1950s. Therefore, the Academy is one of the most prominent institutions in the field of scholarly music editing. Several Academy projects are using MEI already (c.f. projects), and the Academy’s interest in MEI is a clear recommendation to use standards like MEI and TEI in such projects.

This website provides a Gentle Introduction to MEI, introductory training material, and information on projects and tools that utilize MEI. The latest MEI news, including information about additional opportunities for learning about MEI, is displayed on this page.

If you want to become an active MEI member, you’re invited to read more about the community and then join us on the MEI-L mailing list.

Any project that cites and relies upon Standard Music Description Language (SMDL), merits a mention on my blog!

If you are interested in encoding music or just complex encoding challenges in general, MEI merits your attention.

Nuremberg Trial Verdicts [70th Anniversary]

Saturday, October 1st, 2016

Nuremberg Trial Verdicts by Jenny Gesley.

From the post:

Seventy years ago – on October 1, 1946 – the Nuremberg trial, one of the most prominent trials of the last century, concluded when the International Military Tribunal (IMT) issued the verdicts for the main war criminals of the Second World War. The IMT sentenced twelve of the defendants to death, seven to terms of imprisonment ranging from ten years to life, and acquitted three.

The IMT was established on August 8, 1945 by the United Kingdom (UK), the United States of America, the French Republic, and the Union of Soviet Socialist Republics (U.S.S.R.) for the trial of war criminals whose offenses had no particular geographical location. The defendants were indicted for (1) crimes against peace, (2) war crimes, (3) crimes against humanity, and of (4) a common plan or conspiracy to commit those aforementioned crimes. The trial began on November 20, 1945 and a total of 403 open sessions were held. The prosecution called thirty-three witnesses, whereas the defense questioned sixty-one witnesses, in addition to 143 witnesses who gave evidence for the defense by means of written answers to interrogatories. The hearing of evidence and the closing statements were concluded on August 31, 1946.

The individuals named as defendants in the trial were Hermann Wilhelm Göring, Rudolf Hess, Joachim von Ribbentrop, Robert Ley, Wilhelm Keitel, Ernst Kaltenbrunner, Alfred Rosenberg, Hans Frank, Wilhelm Frick, Julius Streicher, Walter Funk, Hjalmar Schacht, Karl Dönitz, Erich Raeder, Baldur von Schirach, Fritz Sauckel, Alfred Jodl, Martin Bormann, Franz von Papen, Arthur Seyss-Inquart, Albert Speer, Constantin von Neurath, Hans Fritzsche, and Gustav Krupp von Bohlen und Halbach. All individual defendants appeared before the IMT, except for Robert Ley, who committed suicide in prison on October 25, 1945; Gustav Krupp von Bolden und Halbach, who was seriously ill; and Martin Borman, who was not in custody and whom the IMT decided to try in absentia. Pleas of “not guilty” were entered by all the defendants.

The trial record is spread over forty-two volumes, “The Blue Series,” Trial of the Major War Criminals before the International Military Tribunal Nuremberg, 14 November 1945 – 1 October 1946.

All forty-two volumes are available in PDF format and should prove to be a more difficult indexing, mining, modeling, searching challenge than twitter feeds.

Imagine instead of “text” similarity, these volumes were mined for “deed” similarity. Similarity to deeds being performed now. By present day agents.

Instead of seldom visited dusty volumes in the library stacks, “The Blue Series” could develop a sharp bite.

bAbI – Facebook Datasets For Automatic Text Understanding And Reasoning

Sunday, February 21st, 2016

The bAbI project

Four papers and datasets on text understanding and reasoning from Facebook.

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin and Tomas Mikolov. Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks. arXiv:1502.05698.

Felix Hill, Antoine Bordes, Sumit Chopra and Jason Weston. The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations. arXiv:1511.02301.

Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, Jason Weston. Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems. arXiv:1511.06931.

Antoine Bordes, Nicolas Usunier, Sumit Chopra and Jason Weston. Simple Question answering with Memory Networks. arXiv:1506.02075.



Wednesday, February 17th, 2016

Johan Oosterman tweets:

Look, the next page begins with these words! Very helpful man makes clear how catchwords work. HuntingtonHM1048



Catchwords were originally used to keep pages in order for binding. You won’t encounter them in post-19th century materials but still interesting from a markup perspective.

The catchword and in this case with a graphic, appears on the page and the next page does appear with these words. Do you capture the catchword? Its graphic? The relationship between the catchword and the opening words of the next page? What if there is an error?

Detecting Text Reuse in Nineteenth-Century Legal Documents:…

Thursday, March 12th, 2015

Detecting Text Reuse in Nineteenth-Century Legal Documents: Methods and Preliminary Results by Lincoln Mullen.

From the post:

How can you track changes in the law of nearly every state in the United States over the course of half a century? How can you figure out which states borrowed laws from one another, and how can you visualize the connections among the legal system as a whole?

Kellen Funk, a historian of American law, is writing a dissertation on how codes of civil procedure spread across the United States in the second half of the nineteenth century. He and I have been collaborating on the digital part of this project, which involves identifying and visualizing the borrowings between these codes. The problem of text reuse is a common one in digital history/humanities projects. In this post I want to describe our methods and lay out some of our preliminary results. To get a fuller picture of this project, you should read the four posts that Kellen has written about his project:

Quite a remarkable project with many aspects that will be relevant to other projects.

Lincoln doesn’t use the term but this would be called textual criticism, if it were being applied to the New Testament. Of course here, Lincoln and Kellen have the original source document and the date of its adoption. New Testament scholars have copies of copies in no particular order and no undisputed evidence of the original text.

Did I mention that all the source code for this project is on Github?

Confessions of an Information Materialist

Wednesday, January 14th, 2015

Confessions of an Information Materialist by Aaron Kirschenfeld.

There aren’t many people in the world that can tempt me into reading UCC (Uniform Commercial Code) comments (again) but it appears that Aaron is one of them, at least this time.

Aaron was extolling on the usefulness of categories for organization and organization of information in particular and invokes “Official Comment 4a to UCC 9-102 by the ALI & NCCUSL.” (ALI = American Law Institute, NCCUSL = National Conference of Commissioners on Uniform State Laws. Seriously, that’s really their name.)

I will quote part of it so you can get the flavor of what Aaron is praising:

The classes of goods are mutually exclusive. For example, the same property cannot simultaneously be both equipment and inventory. In borderline cases — a physician’s car or a farmer’s truck that might be either consumer goods or equipment — the principal use to which the property is put is determinative. Goods can fall into different classes at different times. For example, a radio may be inventory in the hands of a dealer and consumer goods in the hands of a consumer. As under former Article 9, goods are “equipment” if they do not fall into another category.

The definition of “consumer goods” follows former Section 9-109. The classification turns on whether the debtor uses or bought the goods for use “primarily for personal, family, or household purposes.”

Goods are inventory if they are leased by a lessor or held by a person for sale or lease. The revised definition of “inventory” makes clear that the term includes goods leased by the debtor to others as well as goods held for lease. (The same result should have obtained under the former definition.) Goods to be furnished or furnished under a service contract, raw materials, and work in process also are inventory. Implicit in the definition is the criterion that the sales or leases are or will be in the ordinary course of business. For example, machinery used in manufacturing is equipment, not inventory, even though it is the policy of the debtor to sell machinery when it becomes obsolete or worn. Inventory also includes goods that are consumed in a business (e.g., fuel used in operations). In general, goods used in a business are equipment if they are fixed assets or have, as identifiable units, a relatively long period of use, but are inventory, even though not held for sale or lease, if they are used up or consumed in a short period of time in producing a product or providing a service.

Aaron’s reaction to this comment:

The UCC comment hits me two ways. First, it shows how inexorably linked law and the organization of information really are. The profession seeks to explain or justify what is what, what belongs to who, how much of it, and so on. The comment also shows how the logical process of categorizing involves deductive, inductive, and analogical reasoning. With the UCC specifically, practice came before formal classification, and seeks, much like a foreign-language textbook, to explain a living thing by reducing it to categories of words and phrases — nouns, verbs and their tenses, and adjectives (really, the meat of descriptive vocabulary), among others. What are goods and the subordinate types of goods? Comment 4a to 9-102 will tell you!

All of what Aaron says about Comment 4a to UCC 9-102 is true, if you grant the UCC the right to put the terms of discussion beyond the pale of being questioned.

Take for example:

The classes of goods are mutually exclusive. For example, the same property cannot simultaneously be both equipment and inventory.

Ontology friends would find nothing remarkable about classes of goods being mutually exclusive. Or with the example of not being both equipment and inventory at the same time.

The catch is that the UCC isn’t defining these terms in a vacuum. These definitions apply to UCC Article 9, which governs rights in secured transactions. Put simply, were a creditor has the legal right to take your car, boat, house, equipment, etc.

By defining these terms, the UCC (actually the state legislature that adopts the UCC), has put these terms, their definitions and their relationships to other statutes, beyond the pale of discussion. They are the fundamental underpinning of any discussion, including discussions of how to modify them.

It is very difficult to lose an argument if you have already defined the terms upon which the argument can be conducted.

Most notions of property and the language used to describe it are deeply embedded in both constitutions and the law, such as the UCC. The question of should “property” mean the same thing to an ordinary citizen and a quasi-immortal corporation doesn’t comes up. And under the terms of the UCC, it is unlikely to ever come up.

We need legal language for a vast number of reasons but we need to realize that the users of legal language have an agenda of their own and that their language can conceal questions that some of us would rather discuss.

Early English Books Online – Good News and Bad News

Friday, January 2nd, 2015

Early English Books Online

The very good news is that 25,000 volumes from the Early English Books Online collection have been made available to the public!

From the webpage:

The EEBO corpus consists of the works represented in the English Short Title Catalogue I and II (based on the Pollard & Redgrave and Wing short title catalogs), as well as the Thomason Tracts and the Early English Books Tract Supplement. Together these trace the history of English thought from the first book printed in English in 1475 through to 1700. The content covers literature, philosophy, politics, religion, geography, science and all other areas of human endeavor. The assembled collection of more than 125,000 volumes is a mainstay for understanding the development of Western culture in general and the Anglo-American world in particular. The STC collections have perhaps been most widely used by scholars of English, linguistics, and history, but these resources also include core texts in religious studies, art, women’s studies, history of science, law, and music.

Even better news from Sebastian Rahtz Sebastian Rahtz (Chief Data Architect, IT Services, University of Oxford):

The University of Oxford is now making this collection, together with Gale Cengage’s Eighteenth Century Collections Online (ECCO), and Readex’s Evans Early American Imprints, available in various formats (TEI P5 XML, HTML and ePub) initially via the University of Oxford Text Archive at, and offering the source XML for community collaborative editing via Github. For the convenience of UK universities who subscribe to JISC Historic Books, a link to page images is also provided. We hope that the XML will serve as the base for enhancements and corrections.

This catalogue also lists EEBO Phase 2 texts, but the HTML and ePub versions of these can only be accessed by members of the University of Oxford.

[Technical note]
Those interested in working on the TEI P5 XML versions of the texts can check them out of Github, via, where each of the texts is in its own repository (eg There is a CSV file listing all the texts at, and a simple Linux/OSX shell script to clone all 32853 unrestricted repositories at

Now for the BAD NEWS:

An additional 45,000 books:

Currently, EEBO-TCP Phase II texts are available to authorized users at partner libraries. Once the project is done, the corpus will be available for sale exclusively through ProQuest for five years. Then, the texts will be released freely to the public.

Can you guess why the public is barred from what are obviously public domain texts?

Because our funding is limited, we aim to key as many different works as possible, in the language in which our staff has the most expertise.

Academic projects are supposed to fund themselves and be self-sustaining. When anyone asks about sustainability of an academic project, ask them when the last time your countries military was “self sustaining?” The U.S. has spent $2.6 trillion on a “war on terrorism” and has nothing to show for it other than dead and injured military personnel, perversion of budgetary policies, and loss of privacy on a world wide scale.

It is hard to imagine what sort of life-time access for everyone on Earth could be secured for less than $1 trillion. No more special pricing and contracts if you are in countries A to Zed. Eliminate all that paperwork for publishers and to access all you need is a connection to the Internet. The publishers would have a guaranteed income stream, less overhead from sales personnel, administrative staff, etc. And people would have access (whether used or not) to educate themselves, to make new discoveries, etc.

My proposal does not involve payments to large military contractors or subversion of legitimate governments or imposition of American values on other cultures. Leaving those drawbacks to one side, what do you think about it otherwise?

Science fiction fanzines to be digitized as part of major UI initiative

Wednesday, November 19th, 2014

Science fiction fanzines to be digitized as part of major UI initiative by Kristi Bontrager.

From the post:

The University of Iowa Libraries has announced a major digitization initiative, in partnership with the UI Office of the Vice President for Research and Economic Development. 10,000 science fiction fanzines will be digitized from the James L. “Rusty” Hevelin Collection, representing the entire history of science fiction as a popular genre and providing the content for a database that documents the development of science fiction fandom.

Hevelin was a fan and a collector for most of his life. He bought pulp magazines from newsstands as a boy in the 1930s, and by the early 1940s began attending some of the first organized science fiction conventions. He remained an active collector, fanzine creator, book dealer, and fan until his death in 2011. Hevelin’s collection came to the UI Libraries in 2012, contributing significantly to the UI Libraries’ reputation as a major international center for science fiction and fandom studies.

Interesting content for many of us but an even more interesting work flow model for the content:

Once digitized, the fanzines will be incorporated into the UI Libraries’ DIY History interface, where a select number of interested fans (up to 30) will be provided with secure access to transcribe, annotate, and index the contents of the fanzines. This group will be modeled on an Amateur Press Association (APA) structure, a fanzine distribution system developed in the early days of the medium that required contributions of content from members in order to qualify for, and maintain, membership in the organization. The transcription will enable the UI Libraries to construct a full-text searchable fanzine resource, with links to authors, editors, and topics, while protecting privacy and copyright by limiting access to the full set of page images.

The similarity between the Amateur Press Association (APA) structure and modern open source projects is interesting. I checked the APA’s homepage, they are have a more traditional membership fee now.

The Hevelin Collection homepage.

The American Yawp [Free History Textbook]

Tuesday, September 9th, 2014

The American Yawp [Free History Textbook], Editors: Joseph Locke, University of Houston-Victoria and Ben Wright, Abraham Baldwin Agricultural College.

From the about page:

In an increasingly digital world in which pedagogical trends are de-emphasizing rote learning and professors are increasingly turning toward active-learning exercises, scholars are fleeing traditional textbooks. Yet for those that still yearn for the safe tether of a synthetic text, as either narrative backbone or occasional reference material, The American Yawp offers a free and online, collaboratively built, open American history textbook designed for college-level history courses. Unchecked by profit motives or business models, and free from for-profit educational organizations, The American Yawp is by scholars, for scholars. All contributors—experienced college-level instructors—volunteer their expertise to help democratize the American past for twenty-first century classrooms.

The American Yawp constructs a coherent and accessible narrative from all the best of recent historical scholarship. Without losing sight of politics and power, it incorporates transnational perspectives, integrates diverse voices, recovers narratives of resistance, and explores the complex process of cultural creation. It looks for America in crowded slave cabins, bustling markets, congested tenements, and marbled halls. It navigates between maternity wards, prisons, streets, bars, and boardrooms. Whitman’s America, like ours, cut across the narrow boundaries that strangle many narratives. Balancing academic rigor with popular readability, The American Yawp offers a multi-layered, democratic alternative to the American past.

In “beta” now but worth your time to read, comment and possibly contribute. I skimmed to a couple of events that I remember quite clearly and I can’t say the text (yet) captures the tone of the time.

For example, the Chicago Police Riot in 1968 gets a bare two paragraphs in Chapter 27, The Sixties. In the same chapter, 1967, the long hot summer when the cities burned, was over in a sentence.

I am sure the author(s) of that chapter were trying to keep the text to some reasonable length and avoid the death by details I encountered in my college American history textbook so many years ago.

Still, given the wealth of materials online, written, audio, video, expanding the text and creating exploding sub-themes (topic maps anyone?) on particular subjects would vastly enhance this project.

PS: If you want a small flavor of what could be incorporated via hyperlinks, see: and the documents, such as FBI documents, at that site.

Working Drafts available in EPUB3

Saturday, March 22nd, 2014

Working Drafts available in EPUB3 by Ivan Herman.

From the post:

As reported elsewhere, the Digital Publishing Interest Group has published its first two public Working Drafts. Beyond the content of those documents, the publication has another aspect worth mentioning. For the first time, “alternate” versions of the two documents have been published, alongside the canonical HTML versions, in EPUB3 format. Because EPUB3 is based on the Open Web Platform, it is a much more faithful alternative to the original content than, for example, a PDF version (which has also been used, time to time, as alternate versions of W3C documents). The EPUB3 versions (of the “Requirements for Latin Text Layout and Pagination“ and the “Annotation Use Cases” Drafts, both linked from the respective documents’ front matter) can be used, for example, for off-line reading, relying on different EPUB readers, available either as standalone applications or as browser extensions.

(The EPUB3 versions were produced using a Python program, also available on github.)

Interesting work but also a reminder that digital formats will continue to evolve as long as they are used.

How well will your metadata transfer to a new system or application?

Or are you suffering from vendor lock?

Full-Text-Indexing (FTS) in Neo4j 2.0

Wednesday, March 19th, 2014

Full-Text-Indexing (FTS) in Neo4j 2.0 by Michael Hunger.

From the post:

With Neo4j 2.0 we got automatic schema indexes based on labels and properties for exact lookups of nodes on property values.

Fulltext and other indexes (spatial, range) are on the roadmap but not addressed yet.

For fulltext indexes you still have to use legacy indexes.

As you probably don’t want to add nodes to an index manually, the existing “auto-index” mechanism should be a good fit.

To use that automatic index you have to configure the auto-index upfront to be a fulltext index and then secondly enable it in your settings.

Great coverage of full-text indexing in Neo4j 2.0.

Looking forward to spatial indexing. In the most common use case, think of it as locating assets on the ground relative to other actors. In real time.

Words as Tags?

Saturday, March 15th, 2014

Wordcounts are amazing. by Ted Underwood.

From the post:

People new to text mining are often disillusioned when they figure out how it’s actually done — which is still, in large part, by counting words. They’re willing to believe that computers have developed some clever strategy for finding patterns in language — but think “surely it’s something better than that?“

Uneasiness with mere word-counting remains strong even in researchers familiar with statistical methods, and makes us search restlessly for something better than “words” on which to apply them. Maybe if we stemmed words to make them more like concepts? Or parsed sentences? In my case, this impulse made me spend a lot of time mining two- and three-word phrases. Nothing wrong with any of that. These are all good ideas, but they may not be quite as essential as we imagine.

Working with text is like working with a video where every element of every frame has already been tagged, not only with nouns but with attributes and actions. If we actually had those tags on an actual video collection, I think we’d recognize it as an enormously valuable archive. The opportunities for statistical analysis are obvious! We have trouble recognizing the same opportunities when they present themselves in text, because we take the strengths of text for granted and only notice what gets lost in the analysis. So we ignore all those free tags on every page and ask ourselves, “How will we know which tags are connected? And how will we know which clauses are subjunctive?”

What a delightful insight!

When we say text is “unstructured” what we really mean is something as dumb as a computer sees no structure in the text.

A human reader, even a 5 or 6 year old reader of a text sees lots of structure, meaning too.

Rather than trying to “teach” computers to read, perhaps we should use computers to facilitate reading by those who already can.


I first saw this in a tweet by Matthew Brook O’Donnell.


Wednesday, February 26th, 2014

Atom: A hackable text editor for the 21st Century.

From the webpage:

At GitHub, we’re building the text editor we’ve always wanted. A tool you can customize to do anything, but also use productively on the first day without ever touching a config file. Atom is modern, approachable, and hackable to the core. We can’t wait to see what you build with it.

I can’t imagine anyone improving on Emacs but we might learn something from watching people try. 😉

Some other links of interest:


Twitter: @AtomEditor




Rendered Prose Diffs (GitHub)

Saturday, February 15th, 2014

Rendered Prose Diffs

From the post:

Today we are making it easier to review and collaborate on prose documents. Commits and pull requests including prose files now feature source and rendered views.

Given the success of GitHub with managing code collaboration, this expansion into prose collaboration is a welcome one.

I like the “rendered” feature. Imagine a topic map that show the impact of a proposed topic on the underlying map, prior to submission.

That could have some interesting possibilities for interactive proofing while authoring.

The Shelley-Godwin Archive

Tuesday, November 5th, 2013

The Shelley-Godwin Archive

From the homepage:

The Shelley-Godwin Archive will provide the digitized manuscripts of Percy Bysshe Shelley, Mary Wollstonecraft Shelley, William Godwin, and Mary Wollstonecraft, bringing together online for the first time ever the widely dispersed handwritten legacy of this uniquely gifted family of writers. The result of a partnership between the New York Public Library and the Maryland Institute for Technology in the Humanities, in cooperation with Oxford’s Bodleian Library, the S-GA also includes key contributions from the Huntington Library, the British Library, and the Houghton Library. In total, these partner libraries contain over 90% of all known relevant manuscripts.

In case you don’t recognize the name, Mary Shelley wrote Frankenstein; or, The Modern Prometheus; William Godwin, philosopher, early modern (unfortunately theoretical) anarchist; Percy Bysshe Shelley, English Romantic Poet; Mary Wollstonescraft, writer, feminist. Quite a group for the time or even now.

From the About page on Technological Infrastructure:

The technical infrastructure of the Shelley-Godwin Archive builds on linked data principles and emerging standards such as the Shared Canvas data model and the Text Encoding Initiative’s Genetic Editions vocabulary. It is designed to support a participatory platform where scholars, students, and the general public will be able to engage in the curation and annotation of the Archive’s contents.

The Archive’s transcriptions and software applications and libraries are currently published on GitHub, a popular commercial host for projects that use the Git version control system.

  • TEI transcriptions and other data
  • Shared Canvas viewer and search service
  • Shared Canvas manifest generation

All content and code in these repositories is available under open licenses (the Apache License, Version 2.0 and the Creative Commons Attribution license). Please see the licensing information in each individual repository for additional details.

Shared Canvas and Linked Open Data

Shared Canvas is a new data model designed to facilitate the description and presentation of physical artifacts—usually textual—in the emerging linked open data ecosystem. The model is based on the concept of annotation, which it uses both to associate media files with an abstract canvas representing an artifact, and to enable anyone on the web to describe, discuss, and reuse suitably licensed archival materials and digital facsimile editions. By allowing visitors to create connections to secondary scholarship, social media, or even scenes in movies, projects built on Shared Canvas attempt to break down the walls that have traditionally enclosed digital archives and editions.

Linked open data or content is published and licensed so that “anyone is free to use, reuse, and redistribute it—subject only, at most, to the requirement to attribute and/or share-alike,” (from with the additional requirement that when an entity such as a person, a place, or thing that has a recognizable identity is referenced in the data, the reference is made using a well-known identifier—called a universal resource identifier, or “URI”—that can be shared between projects. Together, the linking and openness allow conformant sets of data to be combined into new data sets that work together, allowing anyone to publish their own data as an augmentation of an existing published data set without requiring extensive reformulation of the information before it can be used by anyone else.

The Shared Canvas data model was developed within the context of the study of medieval manuscripts to provide a way for all of the representations of a manuscript to co-exist in an openly addressable and shareable form. A relatively well-known example of this is the tenth-century Archimedes Palimpsest. Each of the pages in the palimpsest was imaged using a number of different wavelengths of light to bring out different characteristics of the parchment and ink. For example, some inks are visible under one set of wavelengths while other inks are visible under a different set. Because the original writing and the newer writing in the palimpsest used different inks, the images made using different wavelengths allow the scholar to see each ink without having to consciously ignore the other ink. In some cases, the ink has faded so much that it is no longer visible to the naked eye. The Shared Canvas data model brings together all of these different images of a single page by considering each image to be an annotation about the page instead of a surrogate for the page. The Shared Canvas website has a viewer that demonstrates how the imaging wavelengths can be selected for a page.

One important bit, at least for topic maps, is the view of the Shared Canvas data model that:

each image [is considered] to be an annotation about the page instead of a surrogate for the page.

If I tried to say that or even re-say it, it would be much more obscure. 😉

Whether “annotation about” versus “surrogate for” will catch on beyond manuscript studies it’s hard to say.

Not the way it is usually said in topic maps but if other terminology is better understood, why not?

A Guide to Documentary Editing

Tuesday, July 16th, 2013

A Guide to Documentary Editing by Mary-Jo Kline and Susan Holbrook Perdue.

From the introduction:

Don’t be embarrassed if you aren’t quite sure what we mean by “documentary editing.” When the first edition of this Guide appeared in 1987, the author found that her local bookstore on the Upper West Side of Manhattan had shelved a copy in the “Movies and Film” section. When she pointed out the error and explained what the book was about, the store manager asked perplexedly, “Where the heck should we shelve it?”

Thus we offer no apologies for providing a brief introduction that explains what documentary editing is and how it came to be.

If this scholarly specialty had appeared overnight in the last decade, we could spare our readers the “history” as well as the definition of documentary editing. Unfortunately, this lively and productive area of scholarly endeavor evolved over more than a half century, and it would be difficult for a newcomer to understand many of the books and articles to which we’ll refer without some understanding of the intellectual debates and technological innovations that generated these discussions. We hope that our readers will find a brief account of these developments entertaining as well as instructive.

We also owe our readers a warning about a peculiar trait of documentary editors that creates a special challenge for students of the craft: practitioners have typically neglected to furnish the public with careful expositions of the principles and practices by which they pursue their goals. Indeed, it was editors’ failure to write about editing that made the first edition of this Guide necessary in the 1980s. It’s hard to overemphasize the impact of modern American scholarly editing in the third quarter of the twentieth century: volumes of novels, letters, diaries, statesmen’s papers, political pamphlets, and philosophical and scientific treatises were published in editions that claimed to be scholarly, with texts established and verified according to the standards of the academic community. Yet the field of scholarly editing grew so quickly that many of its principles were left implicit in the texts or annotation of the volumes themselves.


Even for materials under revision control, explicit principles of documentary editing will someday play a role in future editions of those texts. In part because texts do not stand alone, apart from social context.

Abbie Hoffman‘s introduction to Steal This Book:

We cannot survive without learning to fight and that is the lesson in the second section. FIGHT! separates revolutionaries from outlaws. The purpose of part two is not to fuck the system, but destroy it. The weapons are carefully chosen. They are “home-made,” in that they are designed for use in our unique electronic jungle. Here the uptown reviewer will find ample proof of our “violent” nature. But again, the dictionary of law fails us. Murder in a uniform is heroic, in a costume it is a crime. False advertisements win awards, forgers end up in jail. Inflated prices guarantee large profits while shoplifters are punished. Politicians conspire to create police riots and the victims are convicted in the courts. Students are gunned down and then indicted by suburban grand juries as the trouble-makers. A modern, highly mechanized army travels 9,000 miles to commit genocide against a small nation of great vision and then accuses its people of aggression. Slumlords allow rats to maim children and then complain of violence in the streets. Everything is topsy-turvy. If we internalize the language and imagery of the pigs, we will forever be fucked. Let me illustrate the point. Amerika was built on the slaughter of a people. That is its history. For years we watched movie after movie that demonstrated the white man’s benevolence. Jimmy Stewart, the epitome of fairness, puts his arm around Cochise and tells how the Indians and the whites can live in peace if only both sides will be reasonable, responsible and rational (the three R’s imperialists always teach the “natives”). “You will find good grazing land on the other side of the mountain,” drawls the public relations man. “Take your people and go in peace.” Cochise as well as millions of youngsters in the balcony of learning, were being dealt off the bottom of the deck. The Indians should have offed Jimmy Stewart in every picture and we should have cheered ourselves hoarse. Until we understand the nature of institutional violence and how it manipulates values and mores to maintain the power of the few, we will forever be imprisoned in the caves of ignorance. When we conclude that bank robbers rather than bankers should be the trustees of the universities, then we begin to think clearly. When we see the Army Mathematics Research and Development Center and the Bank of Amerika as cesspools of violence, filling the minds of our young with hatred, turning one against another, then we begin to think revolutionary.

Be clever using section two; clever as a snake. Dig the spirit of the struggle. Don’t get hung up on a sacrifice trip. Revolution is not about suicide, it is about life. With your fingers probe the holiness of your body and see that it was meant to live. Your body is just one in a mass of cuddly humanity. Become an internationalist and learn to respect all life. Make war on machines, and in particular the sterile machines of corporate death and the robots that guard them. The duty of a revolutionary is to make love and that means staying alive and free. That doesn’t allow for cop-outs. Smoking dope and hanging up Che’s picture is no more a commitment than drinking milk and collecting postage stamps. A revolution in consciousness is an empty high without a revolution in the distribution of power. We are not interested in the greening of Amerika except for the grass that will cover its grave.

would require a lot of annotation to explain to an audience that meekly submits to public gropings in airport security lines, widespread government surveillance and wars that benefit only contractors.

Both the Guide to Documentary Editing and Steal This Book are highly recommended.

Voyeur Tools: See Through Your Texts

Thursday, February 28th, 2013

Voyeur Tools: See Through Your Texts

From the website:

Voyeur is a web-based text analysis environment. It is designed to be user-friendly, flexible and powerful. Voyeur is part of the, a collaborative project to develop and theorize text analysis tools and text analysis rhetoric. This section of the web site provides information and documentation for users and developers of Voyeur.

What you can do with Voyeur:

  • use texts in a variety of formats including plain text, HTML, XML, PDF, RTF and MS Word
  • use texts from different locations, including URLs and uploaded files
  • perform lexical analysis including the study of frequency and distribution data; in particular
  • export data into other tools (as XML, tab separated values, etc.)
  • embed live tools into remote web sites that can accompany or complement your own content

One of the tools used in the Lincoln Logarithms project.

MapReduce Algorithms

Tuesday, February 5th, 2013

MapReduce Algorithms by Bill Bejeck.

Bill is writing a series of posts on implementing the algorithms given in pseudo-code in: Data-Intensive Text Processing with MapReduce.

  1. Working Through Data-Intensive Text Processing with MapReduce
  2. Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II
  3. Calculating A Co-Occurrence Matrix with Hadoop
  4. MapReduce Algorithms – Order Inversion
  5. Secondary Sorting

Another resource to try with your Hadoop Sandbox install!

I first saw this at Alex Popescu’s 3 MapReduce and Hadoop Links: Secondary Sorting, Hadoop-Based Letterpress, and Hadoop Vaidya.

Text as Data:…

Sunday, February 3rd, 2013

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts by Justin Grimmer and Brandon M. Stewart.


Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

As a former political science major, I had to stop to read this article.

A wide ranging survey of an “exciting new area of research” but I remember content/text analysis as an undergraduate, North of forty years ago now.

True, some of the measures are new, along with better visualization techniques.

On the other hand, many of the problems of textual analysis now were the problems in textual analysis then (and before).

Highly recommended as a survey of current techniques.

A history of the “problems” of textual analysis and their resistance to various techniques will have to await another day.

Comment Visualization

Saturday, February 2nd, 2013

New from Juice Labs: A visualization tool for exploring text data by Zach Gemignani.

From the post:

Today we are pleased to release another free tool on Juice Labs. The Comment visualization is the perfect way to exploring qualitative data like text survey responses, tweets, or product reviews. A few of the fun features:

  • Color comments based on a selected value
  • Filter comments using an interactive distribution chart at the top
  • Highlight the most interesting comments by selecting the flags in the upper right
  • Show the author and other contextual information about a comment

[skipping the lamest Wikipedia edits example]

Like our other free visualization tools in Juice Labs, the Comments visualization is designed for ease of use and sharing. Just drop in your own data, choose what fields you want to show as text and as values, and the visualization will immediately reflect your choices. The save button gives you a link that includes your data and settings.

Apparently the interface starts with the lamest Wikipedia edit data.

To change that, you have to scroll down to the Data window, Hover over Learn how.

I have reformatted the how-to content here:

Put any comma delimited data in this box. The first row needs to contain the column names. Then, give us some hints on how to use your data.

[Pre-set column names]

[*] Use this column as the question.

[a] Use this column as the author.

[cby] Use this column to color the comments. Should be a metric. By default, the comments will be sorted in ascending order.

[-] Sort the comments in descending order of the metric value. Can only be used with [cby]

[c] Use this column as a context.

Tip: you can combine the hints like: [c-cby]

Could be an interesting tool for quick and dirty exploration of textual content.

silenc: Removing the silent letters from a body of text

Sunday, January 20th, 2013

silenc: Removing the silent letters from a body of text by Nathan Yau.

From the post:

During a two-week visualization course, Momo Miyazaki, Manas Karambelkar, and Kenneth Aleksander Robertsen imagined what a body of text would be without the the silent letters in silenc.

Nathan suggest it isn’t fancy on the analysis side but the views are interesting.

True enough that removing silent letters (once mapped) isn’t difficult, but the results of the technique may be more than just visually interesting.

Usage patterns of words with silent letters would be an interesting question.

Or extending the technique to remove all adjectives from a text (that would shorten ad copy).

“Seeing” text or data from a different or unexpected perspective can lead to new insights. Some useful, some less so.

But it is the job of analysis to sort them out.

TSD 2013: 16th International Conference on Text, Speech and Dialogue

Wednesday, December 19th, 2012

TSD 2013: 16th International Conference on Text, Speech and Dialogue

Important Dates:

When Sep 1, 2013 – Sep 5, 2013
Where Plzen (Pilsen), Czech Republic
Submission Deadline Mar 31, 2013
Notification Due May 12, 2013
Final Version Due Jun 9, 2013

Subjects for submissions:

  • Speech Recognition
    —multilingual, continuous, emotional speech, handicapped speaker, out-of-vocabulary words, alternative way of feature extraction, new models for acoustic and language modelling,
  • Corpora and Language Resources
    —monolingual, multilingual, text, and spoken corpora, large web corpora, disambiguation, specialized lexicons, dictionaries,
  • Speech and Spoken Language Generation
    —multilingual, high fidelity speech synthesis, computer singing,
  • Tagging, Classification and Parsing of Text and Speech
    —multilingual processing, sentiment analysis, credibility analysis, automatic text labeling, summarization, authorship attribution,
  • Semantic Processing of Text and Speech
    —information extraction, information retrieval, data mining, semantic web, knowledge representation, inference, ontologies, sense disambiguation, plagiarism detection,
  • Integrating Applications of Text and Speech Processing
    —machine translation, natural language understanding, question-answering strategies, assistive technologies,
  • Automatic Dialogue Systems
    —self-learning, multilingual, question-answering systems, dialogue strategies, prosody in dialogues,
  • Multimodal Techniques and Modelling
    —video processing, facial animation, visual speech synthesis, user modelling, emotion and personality modelling.

It was TSD 2012 where I found the presentation by Ruslan Mitkov presentation: Coreference Resolution: to What Extent Does it Help NLP Applications? So, highly recommended!

Are Texts Unstructured Data? [Text Series]

Monday, November 5th, 2012

I ask because there is a pejorative tinge to “unstructured” when applied to texts. As though texts lack structure and can be improved by various schemes and designs.

Before reaching other aspects of such claims, I wanted to test the notion that texts are “unstructured.” If that the case, then, the Gettysburg Address written:

Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.

Should be equivalent to the Gettysburg Address Scrambled (via the 3by3by3 Text Scrambler):

not are or freedom—and great a consecrated gave by as devotion they created in portion great and fought forth lives that are should we have that is nor died our to not a struggled the testing here, equal.

Now nation, little this to all civil highly we in these remaining say who we work perish nation, endure. engaged brave resolve that for here. the The of shall Four remember take here we fitting we liberty, forget did to It living poor that above and have honored ground. consecrate, place that is they the for have nation, the nation, for so ago what here, score not us—that it that us dead for who to we those from resting here can fathers this.

But, the that here It war. a is far never do years not what of proper new this brought and shall last earth. conceived their be of nation to detract. men dedicated here battle-field dedicated men, and in come are altogether the cause of great can which whether nobly living, this to which dead, rather to birth that who a field, will advanced. add so proposition people, long The the they power new It any us long increased and on we met or sense, for might hallow dedicate these far devotion—that to not note, in a the We dead before on full government dedicated we have from that war, the conceived God, a thus measure can gave of be people, that here We continent so a it, world people, vain—that dedicate, under can shall live. but larger task have dedicated, unfinished final our rather, can seven

It may just be me but I don’t get the same semantics from the second version as the first.


My premise going forward is that texts are structured.