Archive for the ‘Literature’ Category

The Marshall Index: A Guide to Negro Periodical Literature, 1940-1948

Tuesday, May 2nd, 2017

The Marshall Index: A Guide to Negro Periodical Literature, 1940-1948 by Albert P. Marshall, revised edition, Danky and Newman, 2002. Posted by ProQuest as a guide to their literature collections.

From the introduction:


For researchers today, one of the rewarding aspects of Marshall’s Guide, and an important one, is the number of obscure, little-collected, and discontinued African-American serials that he includes. Who today is familiar, for example, with Pulse, Service, New Vistas, Negro Traveler, Informer, Whetstone, Sphinx. Ivy Leaf, or Oracle? Until the large and comprehensive bibliography of black periodicals collected and edited by James P. Danky and Maureen Hady of the State Historical Society of Wisconsin and published by Harvard University Press is widely disseminated, few will even know the existence of many of these rare sources.

Superseded in some sense by African American newspapers and periodicals : a national bibliography by James P. Danky, but only in a sense.

The Marshall Index will always remain the first index of Black periodical literature and reflect the choices and judgments of its author.

Pass this along to your librarian friends and anyone interested in Black literature.

Black Womxn Authors, Library of Congress and MarcXML (Part 2)

Thursday, April 20th, 2017

(After writing this post I got a message from Clifford Anderson on a completely different way to approach the Marc to XML problem. A very neat way. But, I thought the directions on installing MarcEdit on Ubuntu 16.04 would be helpful anyway. More on Clifford’s suggestion to follow.)

If your just joining, read Black Womxn Authors, Library of Congress and MarcXML (Part 1) for the background on why this flurry of installation is at all meaningful!

The goal is to get a working copy of MarcEdit installed on my Ubuntu 16.04 machine.

MarcEdit Linux Installation Instructions reads in part:

Installation Steps:

  1. Download the MarcEdit app bundle. This file has been zipped to reduce the download size. http://marcedit.reeset.net/software/marcedit.bin.zip
  2. Unzip the file and open the MarcEdit folder. Find the Install.txt file and read it.
  3. Ensure that you have the Mono framework installed. What is Mono? Mono is an open source implementation of Microsoft’s .NET framework. The best way to describe it is that .NET is very Java-like; it’s a common runtime that can work across any platform in which the framework has been installed. There are a number of ways to get the Mono framework — for MarcEdit’s purposes, it is recommended that you download and install the official package available from the Mono Project’s website. You can find the Mac OSX download here: http://www.go-mono.com/mono-downloads/download.html
  4. Run MarEdit via the command-line using mono MarcEdit.exe from within the MarcEdit directory.

Well, sort of. 😉

First, you need to go to the Mono Project Download page. From there, under Xamarin packages, follow Debian, Ubuntu, and derivatives.

There is a package for Ubuntu 16.10, but it’s Mono 4.2.1. By installing the Xamarin packages, I am running Mono 4.7.0. Your call but as a matter of habit, I run the latest compatible packages.

Updating your package lists for Debian, Ubuntu, and derivatives:

Add the Mono Project GPG signing key and the package repository to your system (if you don’t use sudo, be sure to switch to root):

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 3FA7E0328081BFF6A14DA29AA6A19B38D3D831EF

echo "deb http://download.mono-project.com/repo/debian wheezy main" | sudo tee /etc/apt/sources.list.d/mono-xamarin.list

And for Ubuntu 16.10:

echo "deb http://download.mono-project.com/repo/debian wheezy-apache24-compat main" | sudo tee -a /etc/apt/sources.list.d/mono-xamarin.list

Now run:

sudo apt-get update

The Usage section suggests:

The package mono-devel should be installed to compile code.

The package mono-complete should be installed to install everything – this should cover most cases of “assembly not found” errors.

The package referenceassemblies-pcl should be installed for PCL compilation support – this will resolve most cases of “Framework not installed: .NETPortable” errors during software compilation.

The package ca-certificates-mono should be installed to get SSL certificates for HTTPS connections. Install this package if you run into trouble making HTTPS connections.

The package mono-xsp4 should be installed for running ASP.NET applications.

Find and select mono-complete first. Most decent package managers will show dependencies that will be installed. Add any of these that were missed.

Do follow the hints here to verify that Mono is working correctly.

Are We There Yet?

Not quite. It was at this point that I unpacked http://marcedit.reeset.net/software/marcedit.bin.zip and discovered there is no “Install.txt file.” Rather there is a linux_install.txt, which reads:

a) Ensure that the dependencies have been installed
1) Dependency list:
i) MONO 3.4+ (Runtime plus the System.Windows.Forms library [these are sometimes separate])
ii) YAZ 5 + YAZ 5 develop Libraries + YAZ++ ZOOM bindings
iii) ZLIBC libraries
iV) libxml2/libxslt libraries
b) Unzip marcedit.zip
c) On first run:
a) mono MarcEdit.exe
b) Preferences tab will open, click on other, and set the following two values:
i) Temp path: /tmp/
ii) MONO path: [to your full mono path]

** For Z39.50 Support
d) Yaz.Sharp.dll.config — ensure that the dllmap points to the correct version of the shared libyaz object.
e) main_icon.bmp can be used for a desktop icon

Opps! Without unzipping marcedit.zip, you won’t see the dependencies:

ii) YAZ 5 + YAZ 5 develop Libraries + YAZ++ ZOOM bindings
iii) ZLIBC libraries
iV) libxml2/libxslt libraries

The YAZ site has a readme file for Ubuntu, but here is the very abbreviated version:


wget http://ftp.indexdata.dk/debian/indexdata.asc
sudo apt-key add indexdata.asc

echo "deb http://ftp.indexdata.dk/ubuntu xenial main" | sudo tee -a /etc/apt/sources.list
echo "deb-src http://ftp.indexdata.dk/ubuntu xenial main" | sudo tee -a /etc/apt/sources.list

(That sequence only works for Ubuntu xenial. See the readme file for other versions.)

Of course:

sudo apt-get update

As of of today, you are looking for yaz 5.21.0-1 and libyaz5-dev 5.21.0-1.

Check for and/or install ZLIBC and libxml2/libxslt libraries.

Personal taste but I reboot at this point to make sure all the libraries re-load to the correct versions, etc. Should work without rebooting but that’s up to you.

Fire it up with

mono MarcEdit.ext

Choose Locations (not Other) and confirm “Set Temporary Path:” is /tmp/ and MONO Path (the location of mono, try which mono, input the results and select OK.

I did the install on Sunday evening and so after all this, the software on loading announces it has been ungraded! Yes, while I was installing all the dependencies, a new and improved version of MarcEdit was posted.

The XML extraction is a piece of cake so I am working on the XQuery on the resulting MarcXML records for part 3.

Black Womxn Authors, Library of Congress and MarcXML (Part 1)

Monday, April 17th, 2017

This adventure started innocently enough with the 2017 Womxn of Color Reading Challenge by Der Vang. As an “older” White male Southerner working in technology, I don’t encounter works by womxn of color unless it is intentional.

The first book, “A book that became a movie,” was easy. I read the deeply moving Beloved by Toni Morrison. I recommend reading a non-critical edition before you read a critical one. Let Morrison speak for herself before you read others offering their views on the story.

The second book, “A book that came out the year you were born,” have proven to be more difficult. Far more difficult. You see I think Der Vang was assuming a reading audience younger than I am, for which womxn of color authors would not be difficult to find. That hasn’t proven to be the case for me.

I searched the usual places but likely collections did not denote an author’s gender or race. The Atlanta-Fulton Public Library reference service came riding to the rescue after I had exhausted my talents with this message:

‘Attached is a “List of Books Published by Negro Writers in 1954 and Late 1953” (pp. 10-12) by Blyden Jackson, IN “The Blithe Newcomers: Resume of Negro Literature in 1954: Part I,” Phylon v.16, no.1 (1st Quarter 1955): 5-12, which has been annotated with classifications (Biography) or subjects (Poetry). Thirteen are written by women; however, just two are fiction. The brief article preceding the list does not mention the books by the women novelists–Elsie Jordan (Strange Sinner) or Elizabeth West Wallace (Scandal at Daybreak). No Part II has been identified. And AARL does not own these two. Searching AARL holdings in Classic Catalog by year yields seventeen by women but no fiction. Most are biographies. Two is better than none but not exactly a list.

A Celebration of Women Writers – African American Writers (http://digital.library.upenn.edu/women/_generate/
AFRICAN%20AMERICAN.html
) seems to have numerous [More Information] links which would possibly allow the requestor to determine the 1954 novelists among them.’
(emphasis in original)

Using those two authors/titles as leads, I found in the Library of Congress online catalog:

https://lccn.loc.gov/54007603
Jordan, Elsie. Strange sinner / Elsie Jordan. 1st ed. New York : Pageant, c1954.
172 p. ; 21 cm.
PZ4.J818 St

https://lccn.loc.gov/54012342
Wallace, Elizabeth West. [from old catalog] Scandal at daybreak. [1st ed.] New York, Pageant Press [1954]
167 p. 21 cm.
PZ4.W187 Sc

Checking elsewhere, both titles are out of print, although I did see one (1) copy of Elise Jordan’s Strange Sinner for $100. I think I have located a university with a digital scan but will have to report back on that later.

Since both Jordan and Wallace published with Pageant Press the same year, I reasoned that other womxn of color may have also published with them and that could lead me to more accessible works.

Experienced librarians are no doubt already grinning because if you search for “Pageant Press,” with the Library of Congress online catalog, you get 961 “hits,” displayed 25 “hits” at a time. Yes, you can set the page to return 100 “hits at a time, but not while you have sort by date of publication selected. 🙁

That is you can display 100 “hits” per page in no particular order, or, you can display the “hits” in date of publication order, but only 25 “hits” at a time. (Or at least that was my experience, please correct me if that’s wrong.)

But, with the 100 “hits” per page, you can “save as,” but only as Marc records, Unicode (UTF-8) or not. No MarcXML format.

In the response to my query about the same, the response from the Library of Congress reads:

At the moment we have no plans to provide an option to save search results as MARCXML. We will consider it for future development projects.

I can understand that in the current climate in Washington but a way to convert Marc records to the easier (in my view) to manipulate MarcXMLformat, would be a real benefit to readers and researchers alike.

Fortunately there is a solution, MarcEdit.

From the webpage:

This LibGuide attempts to document the features of MarcEdit, which was developed by Terry Reese. It is open source software designed to facilitate the harvesting, editing, and creation of MARC records. This LibGuide was adapted from a standalone document, and while the structure of the original document has been preserved in this LibGuide, it is also available in PDF form at the link below. The original documentation and this LibGuide were written with the idea that it would be consulted on an as-needed basis. As a result, the beginning steps of many processes may be repeated within the same page or across the LibGuide as a whole so that users would be able to understand the entire process of implementing a function within MarcEdit without having to consult other guides to know where to begin. There are also screenshots that are repeated throughout, which may provide a faster reference for users to understand what steps they may already be familiar with.

Of course, installing MarcEdit on Ubuntu, isn’t a straightforward task. But I have 961 Marc records and possibly more that would be very useful in MarcXML. Tomorrow I will document the installation steps I followed with Ubuntu 16.04.

PS: I’m not ignoring the suggested A Celebration of Women Writers – African American Writers (http://digital.library.upenn.edu/women/_generate/
AFRICAN%20AMERICAN.html)
. But I have gotten distracted by the technical issue of how to convert all the holdings at the Library of Congress for a publisher into MarcXML. Suggestions on how to best use this resource?

“Tidying” Up Jane Austen (R)

Thursday, February 16th, 2017

Text Mining the Tidy Way by Julia Silge.

Thanks to Julia’s presentation I now know there is an R package with all of Jane Austen’s novels ready for text analysis.

OK, Austen may not be at the top of your reading list, but the Tidy techniques Julia demonstrates are applicable to a wide range of textual data.

Among those mentioned in the presentation, NASA datasets!

Julia, along with Dave Robinson, wrote: Text Mining with R: A Tidy Approach, available online now and later this year from O’Reilly.

Digital Humanities / Studies: U.Pitt.Greenberg

Wednesday, February 1st, 2017

Digital Humanities / Studies: U.Pitt.Greenberg maintained by Elisa E. Beshero-Bondar.

I discovered this syllabus and course materials by accident when one of its modules on XQuery turned up in a search. Backing out of that module I discovered this gem of a digital humanities course.

The course description:

Our course in “digital humanities” and “digital studies” is designed to be interdisciplinary and practical, with an emphasis on learning through “hands-on” experience. It is a computer course, but not a course in which you learn programming for the sake of learning a programming language. It’s a course that will involve programming, and working with coding languages, and “putting things online,” but it’s not a course designed to make you, in fifteen weeks, a professional website designer. Instead, this is a course in which we prioritize what we can investigate in the Humanities and related Social Sciences fields about cultural, historical, and literary research questions through applications in computer coding and programming, which you will be learning and applying as you go in order to make new discoveries and transform cultural objects—what we call “texts” in their complex and multiple dimensions. We think of “texts” as the transmittable, sharable forms of human creativity (mainly through language), and we interface with a particular text in multiple ways through print and electronic “documents.” When we refer to a “document,” we mean a specific instance of a text, and much of our work will be in experimenting with the structures of texts in digital document formats, accessing them through scripts we write in computer code—scripts that in themselves are a kind of text, readable both by humans and machines.

Your professors are scholars and teachers of humanities, not computer programmers by trade, and we teach this course from our backgrounds (in literature and anthropology, respectively). We teach this course to share coding methods that are highly useful to us in our fields, with an emphasis on working with texts as artifacts of human culture shaped primarily with words and letters—the forms of “written” language transferable to many media (including image and sound) that we can study with computer modelling tools that we design for ourselves based on the questions we ask. We work with computers in this course as precision instruments that help us to read and process great quantities of information, and that lead us to make significant connections, ask new kinds of questions, and build models and interfaces to change our reading and thinking experience as people curious about human history, culture, and creativity.

Our focus in this course is primarily analytical: to apply computer technologies to represent and investigate cultural materials. As we design projects together, you will gain practical experience in editing and you will certainly fine-tune your precision in writing and thinking. We will be working primarily with eXtensible Markup Language (XML) because it is a powerful tool for modelling texts that we can adapt creatively to our interests and questions. XML represents a standard in adaptability and human-readability in digital code, and it works together with related technologies with which you will gain working experience: You’ll learn how to write XPath expressions: a formal language for searching and extracting information from XML code which serves as the basis for transforming XML into many publishable forms, using XSLT and XQuery. You’ll learn to write XSLT: a programming “stylesheet” transforming language designed to convert XML to publishable formats, as well as XQuery, a query (or search) language for extracting information from XML files bundled collectively. You will learn how to design your own systematic coding methods to work on projects, and how to write your own rules in schema languages (like Schematron and Relax-NG) to keep your projects organized and prevent errors. You’ll gain experience with an international XML language called TEI (after the Text Encoding Initiative) which serves as the international standard for coding digital archives of cultural materials. Since one of the best and most widely accessible ways to publish XML is on the worldwide web, you’ll gain working experience with HTML code (a markup language that is a kind of XML) and styling HTML with Cascading Stylesheets (CSS). We will do all of this with an eye to your understanding how coding works—and no longer relying without question on expensive commercial software as the “only” available solution, because such software is usually not designed with our research questions in mind.

We think you’ll gain enough experience at least to become a little dangerous, and at the very least more independent as investigators and makers who wield computers as fit instruments for your own tasks. Your success will require patience, dedication, and regular communication and interaction with us, working through assignments on a daily basis. Your success will NOT require perfection, but rather your regular efforts throughout the course, your documenting of problems when your coding doesn’t yield the results you want. Homework exercises are a back-and-forth, intensive dialogue between you and your instructors, and we plan to spend a great deal of time with you individually over these as we work together. Our guiding principle in developing assignments and working with you is that the best way for you to learn and succeed is through regular practice as you hone your skills. Our goal is not to make you expert programmers (as we are far from that ourselves)! Rather, we want you to learn how to manipulate coding technologies for your own purposes, how to track down answers to questions, how to think your way algorithmically through problems and find good solutions.

Skimming the syllabus rekindles an awareness of the distinction between the “hard” sciences and the “difficult” ones.

Enjoy!

Update:

After yesterday’s post, Elisa Beshero-Bondar tweeted this one course is now two:

At a new homepage: newtFire {dh|ds}!

Enjoy!

War and Peace & R

Friday, December 2nd, 2016

No, not a post about R versus Python but about R and Tolstoy‘s War and Peace.

Using R to Gain Insights into the Emotional Journeys in War and Peace by Wee Hyong Tok.

From the post:

How do you read a novel in record time, and gain insights into the emotional journey of main characters, as they go through various trials and tribulations, as an exciting story unfolds from chapter to chapter?

I remembered my experiences when I start reading a novel, and I get intrigued by the story, and simply cannot wait to get to the last chapter. I also recall many conversations with friends on some of the interesting novels that I have read awhile back, and somehow have only vague recollection of what happened in a specific chapter. In this post, I’ll work through how we can use R to analyze the English translation of War and Peace.

War and Peace is a novel by Leo Tolstoy, and captures the salient points about Russian history from the period 1805 to 1812. The novel consists of the stories of five families, and captures the trials and tribulations of various characters (e.g. Natasha and Andre). The novel consists of about 1400 pages, and is one of the longest novels that have been written.

We hypothesize that if we can build a dashboard (shown below), this will allow us to gain insights into the emotional journey undertaken by the characters in War and Peace.

Impressive work, even though I would not use it as a short-cut to “read a novel in record time.”

Rather I take this as an alternative way of reading War and Peace, one that can capture insights a casual reader may miss.

Moreover, the techniques demonstrated here could be used with other works of literature, or even non-fictional works.

Imagine conducting this analysis over the reportedly more than 7,000 page full CIA Torture Report, for example.

A heatmap does not connect any dots, but points a user towards places where interesting dots may be found.

Certainly a tool for exploring large releases/leaks of text data.

Enjoy!

PS: Large, tiresome, obscure-on-purpose, government reports to practice on with this method?

Ulysses, Joyce and Stanford CoreNLP

Saturday, November 26th, 2016

Introduction to memory and time usage

From the webpage:

People not infrequently complain that Stanford CoreNLP is slow or takes a ton of memory. In some configurations this is true. In other configurations, this is not true. This section tries to help you understand what you can or can’t do about speed and memory usage. The advice applies regardless of whether you are running CoreNLP from the command-line, from the Java API, from the web service, or from other languages. We show command-line examples here, but the principles are true of all ways of invoking CoreNLP. You will just need to pass in the appropriate properties in different ways. For these examples we will work with chapter 13 of Ulysses by James Joyce. You can download it if you want to follow along.

You have to appreciate the use of a non-trivial text for advice on speed and memory usage of CoreNLP.

How does your text stack up against Chapter 13 of Ulysses?

I’m supposed to be reading Ulysses long distance with a friend. I’m afraid we have both fallen behind. Perhaps this will encourage me to have another go at it.

What favorite or “should read” text would you use to practice with CoreNLP?

Suggestions?

Electronic Literature Organization

Sunday, June 19th, 2016

Electronic Literature Organization

From the “What is E-Lit” page:

Electronic literature, or e-lit, refers to works with important literary aspects that take advantage of the capabilities and contexts provided by the stand-alone or networked computer. Within the broad category of electronic literature are several forms and threads of practice, some of which are:

  • Hypertext fiction and poetry, on and off the Web
  • Kinetic poetry presented in Flash and using other platforms
  • Computer art installations which ask viewers to read them or otherwise have literary aspects
  • Conversational characters, also known as chatterbots
  • Interactive fiction
  • Literary apps
  • Novels that take the form of emails, SMS messages, or blogs
  • Poems and stories that are generated by computers, either interactively or based on parameters given at the beginning
  • Collaborative writing projects that allow readers to contribute to the text of a work
  • Literary performances online that develop new ways of writing

The ELO showcase, created in 2006 and with some entries from 2010, provides a selection outstanding examples of electronic literature, as do the two volumes of our Electronic Literature Collection.

The field of electronic literature is an evolving one. Literature today not only migrates from print to electronic media; increasingly, “born digital” works are created explicitly for the networked computer. The ELO seeks to bring the literary workings of this network and the process-intensive aspects of literature into visibility.

The confrontation with technology at the level of creation is what distinguishes electronic literature from, for example, e-books, digitized versions of print works, and other products of print authors “going digital.”

Electronic literature often intersects with conceptual and sound arts, but reading and writing remain central to the literary arts. These activities, unbound by pages and the printed book, now move freely through galleries, performance spaces, and museums. Electronic literature does not reside in any single medium or institution.

I was looking for a recent presentation by Allison Parrish on bots when I encountered Electronic Literature Organization (ELO).

I was attracted by the bot discussion at a recent conference but as you can see, the range of activities of the ELO is much broader.

Enjoy!

“Library of Babel” (Jorge Luis Borges)

Wednesday, May 4th, 2016

buzz-feed-tower-of-babel-drewpatroopa17-plotted

Select the image for a larger view. Trust me, it’s worth it.

The illustration is from “Plotted: A Literary Atlas” by Andrew DeGraff and this particular image of the illustration is from the review: 9 Awesome Literary Maps Every Book Lover Needs To See by Krystie Lee Yandoli.

DeGraff has maps for portions of these works:

Adventures of Huckleberry Finn – Mark Twain

Around the World in Eighty Days – Jules Verne

A Christmas Carol – Charles Dickens

A Good Man Is Hard to Find – Flannery O’Connor

Hamlet, Prince of Denmark – William Shakespeare

Invisible Man – Ralph Ellison

The Library of Babel – Jorge Luis Borges

The Lottery – Shirley Jackson

Moby Dick, or, The Whale – Herman Melville

Narrative of the Life of Frederick Douglass, an American Slave – Frederick Douglas

A Narrow Fellow in the Grass – Emily Dickinson

The Odyssey – Homer

The Ones Who Walk Away from Omclas – Ursula K. Le Guinn

Pride and Prejudice – Jane Austen

A Report to the Academy – Franz Kafka

Robinson Crusoe – Daniel Defoe

Waiting for Godot – Samuel Beckett

Watership Down – Richard Adams

Wrinkle in Time – Madeleine L’Engle

Keep a copy of Plotted: A Literary Atlas on hand as inspiration.

At the same time, try your hand at capturing your spatial understanding of a narrative. Your reading experience, will be different.

Enjoy!

Advice on Reading Academic Papers [Comments on Reading Case Law/Statutes]

Tuesday, March 1st, 2016

Advice on Reading Academic Papers by Aaron Massey.

From the post:

Graduate students must learn to read academic papers, but in virtually all cases, these same students are not formally taught how to best read academic papers. It is not the same process used to read a newspaper, magazine, or novel. The process of learning how to read academic papers properly can not only be painful, but also waste quite a bit of time. Here are my quick tips on reading papers of all stripes:

Less detailed than How to read and understand a scientific paper…., which includes a worked example, and not as oriented to CS as Now to Read a Paper.

In addition to four other guides, Aaron includes this link which returns (as of today), some 384,000,000 “hits” on the search string: “how to read a scientific paper.”

There appears to be no shortage of advice on “how to read a scientific paper.” 😉

Just for grins, a popular search engine returns these results:

“how to read case law” returns 2,070 “hits,” which dwindles down to 80 when similar materials are removed.

Isn’t that interesting? Case law, which in many cases determines who pays, who goes to jail, who wins, has such poor coverage in reading helps?

“how to read statutes” returns 2,500 “hits,” which dwindles down to 97 when similar materials are omitted.

Beyond the barriers of legal “jargon,” be aware that even ordinary words may not have expected meanings in both case law and statutes.

For best and safest results, always consult licensed legal counsel.

That perpetuates the legal guild but its protective mechanisms are harsh and pitiless. Consider yourself forewarned.

If magazine (Internet Archive)

Thursday, February 25th, 2016

Read: The full run of If magazine, scanned at the Internet Archive by Cory Doctorow.

From the post:

The Internet Archive’s amazing Pulp Magazine Archive includes all 176 issues of If, a classic science fiction magazine that ran from 1952 to 1974.

Included in the collection are all of the issues edited by Frederick Pohl from 1966-68, three years that netted him three consecutive Best Editor Hugo awards. If‘s Pohl run included signficant stories by Larry Niven, Harlan Ellison, Samuel Delany, Alexei Panshin and Gene Wolfe; it was the serialized home of such Heinlein novels as The Moon is a Harsh Mistress, as well as Laumer’s Retief stories and Saberhagen’s Berserker stories.

IF Magazine [Internet Archive]

(via Metafilter)

Good resource when you need to take a break from reading search literature or legal briefs. 😉

Enjoy!

Between the Words [Alternate Visualizations of Texts]

Saturday, February 6th, 2016

Between the Words – Exploring the punctuation in literary classics by Nicholas Rougeux.

From the webpage:

Between the Words is an exploration of visual rhythm of punctuation in well-known literary works. All letters, numbers, spaces, and line breaks were removed from entire texts of classic stories like Alice’s Adventures in Wonderland, Moby Dick, and Pride and Prejudice—leaving only the punctuation in one continuous line of symbols in the order they appear in texts. The remaining punctuation was arranged in a spiral starting at the top center with markings for each chapter and classic illustrations at the center.

The posters are 24″ X 36.”

Some small images to illustrate the concept:

achistmascarol

ataleoftwocities

aliceinwonderland

I’m not an art critic but I can say that unusual or unexpected visualizations of data can lead to new insights. Or should I say different insights than you may have previously held.

Seeing this visualization reminded me of a presentation too any years ago at Cambridge that argued the cantillation (think crudely “accents”) marks in the Hebrew Bible were a reliable guide to clause boundaries and reading.

FYI, the versification and divisions in the oldest known witnesses to the Hebrew Bible were added centuries after the text stabilized. There are generally accepted positions on the text but at best, they are just that, generally accepted positions.

Any number of alternative presentations of texts suggest themselves.

I haven’t performed the experiment but for numeric data, reordering the data so as to force re-casting of formulas, could be a way to explore presumptions that are glossed over the the “usual form.”

Not unlike copying a text by hand as opposed to typing or photocopying the text. Each step of performing the task with less deliberation increases the odds you will miss some decision that you are making unconsciously.

If you like these posters ore know an English major/professor who may, pass this site along to them. (I have no interest, financial or otherwise in this site but I like to encourage creative thinking.)

I first saw this in a tweet by Christopher Phipps.

Paradise Lost (John MILTON, 1608 – 1674) Audio Version

Thursday, December 10th, 2015

Paradise Lost (John MILTON, 1608 – 1674) Audio Version.

As you know, John Milton was blind when he wrote Paradise Lost. His only “interface” for writing, editing and correcting was aural.

Shoppers and worshipers need to attend very closely to the rhetoric of the season. Listening to Paradise Lost even as Milton did, may sharpen your ear for rhetorical devices and words that would otherwise pass unnoticed.

For example, what are the “good tidings” of Christmas hymns? Are they about the “…new born king…” or are they anticipating the sacrifice of that “…new born king…” instead of ourselves?

The first seems traditional and fairly benign, the second, seems more self-centered and selfish than the usual Christmas holiday theme.

If you think that is an aberrant view of the holiday, consider that in A Christmas Carol by Charles Dickens, that Scrooge, spoiler alert, ends the tale by keeping Christmas in his heart all year round.

One of the morals being that we should treat others kindly and with consideration every day of the year. Not as some modern Christians do, half-listening at an hour long service once a week and spending the waking portion of the other 167 hours not being Christians.

Paradise Lost is a complex and nuanced text. Learning to spot its rhetorical moves and devices will make you a more discerning observer of modern discourse.

Enjoy!

Workflow for R & Shakespeare

Friday, October 2nd, 2015

A new data processing workflow for R: dplyr, magrittr, tidyr, ggplot2

From the post:

Over the last year I have changed my data processing and manipulation workflow in R dramatically. Thanks to some great new packages like dplyr, tidyr and magrittr (as well as the less-new ggplot2) I've been able to streamline code and speed up processing. Up until 2014, I had used essentially the same R workflow (aggregate, merge, apply/tapply, reshape etc) for more than 10 years. I have added a few improvements over the years in the form of functions in packages doBy, reshape2 and plyr and I also flirted with the package data.table (which I found to be much faster for big datasets but the syntax made it difficult to work with) — but the basic flow has remained remarkably similar. Until now…

Given how much I've enjoyed the speed and clarity of the new workflow, I thought I would share a quick demonstration.

In this example, I am going to grab data from a sample SQL database provided by Google via Google BigQuery and then give examples of manipulation using dplyr, magrittr and tidyr (and ggplot2 for visualization).

This is a great introduction to a work flow in R that you can generalize for your own purposes.

Word counts won’t impress your English professor but you will have a base for deeper analysis of Shakespeare.

I first saw this in a tweet by Christophe Lalanne.

Mapping the world of Mark Twain (subject confusion)

Sunday, August 2nd, 2015

Mapping the world of Mark Twain by Andrew Hill.

From the post:

Mapping Mark Twain

This weekend I was looking through Project Gutenberg and found something even better than a single book, I found the complete works of Mark Twain. I remembered how geographic the stories of Twain are and so knew immediately I had found a treasure chest. For the last few days, I’ve been parsing the books line-by-line and trying to find the localities that make up the world of Mark Twain. In the end, the data has over 20,000 localities. Even counting the cases where sir names are mistaken for places, it is a really cool dataset. What I’ll show you here is only the tip of the iceberg. I put the results together as an interactive map that maybe will inspire you to take a journey with Twain on your own, extend your life a little.

Sounds great!

Warning: Subject Confusion

Mapping the world of Mark Twain (the map)!

The blog entry: http://andrewxhill.com/blog/2014/01/26/Mapping-the-world-of-Mark-Twain/ has the same name as the map: http://andrewxhill.com/maps/writers/twain/index.html.

Both are excellent and the blog entry includes details on how you can construct similar maps.

Topic maps disambiguate names that would otherwise lead to confusion!

What names do you need to disambiguate?

Or do you need to avoid subject confusion with names used by others? (Unknown to you.)

Harry Potter eBooks

Sunday, February 1st, 2015

All the Harry Potter ebooks are now on subscription site Oyster by Laura Hazard Owen.

Laura reports the Harry Potter books are available on Oyster and Amazon. She says that Oyster has the spin-off titles from the original series where Amazon does not.

Both offer $9.95 per month subscription rates, where Oyster claims “over a million” books and Amazon over 700,000. After reading David Mason’s How many books will you read in your lifetime?, I am not sure the difference in raw numbers will make much difference.

Access to electronic texts will certainly make creating topic maps for popular literature a good deal easier.

Enjoy!

Modelling Plot: On the “conversional novel”

Tuesday, January 20th, 2015

Modelling Plot: On the “conversional novel” by Andrew Piper.

From the post:

I am pleased to announce the acceptance of a new piece that will be appearing soon in New Literary History. In it, I explore techniques for identifying narratives of conversion in the modern novel in German, French and English. A great deal of new work has been circulating recently that addresses the question of plot structures within different genres and how we might or might not be able to model these computationally. My hope is that this piece offers a compelling new way of computationally studying different plot types and understanding their meaning within different genres.

Looking over recent work, in addition to Ben Schmidt’s original post examining plot “arcs” in TV shows using PCA, there have been posts by Ted Underwood and Matthew Jockers looking at novels, as well as a new piece in LLC that tries to identify plot units in fairy tales using the tools of natural language processing (frame nets and identity extraction). In this vein, my work offers an attempt to think about a single plot “type” (narrative conversion) and its role in the development of the novel over the long nineteenth century. How might we develop models that register the novel’s relationship to the narration of profound change, and how might such narratives be indicative of readerly investment? Is there something intrinsic, I have been asking myself, to the way novels ask us to commit to them? If so, does this have something to do with larger linguistic currents within them – not just a single line, passage, or character, or even something like “style” – but the way a greater shift of language over the course of the novel can be generative of affective states such as allegiance, belief or conviction? Can linguistic change, in other words, serve as an efficacious vehicle of readerly devotion?

While the full paper is available here, I wanted to post a distilled version of what I see as its primary findings. It’s a long essay that not only tries to experiment with the project of modelling plot, but also reflects on the process of model building itself and its place within critical reading practices. In many ways, its a polemic against the unfortunate binariness that surrounds debates in our field right now (distant/close, surface/depth etc.). Instead, I want us to see how computational modelling is in many ways conversional in nature, if by that we understand it as a circular process of gradually approaching some imaginary, yet never attainable centre, one that oscillates between both quantitative and qualitative stances (distant and close practices of reading).

Andrew writes of “…critical reading practices….” I’m not sure that technology will increase the use of “…critical reading practices…” but it certainly offers the opportunity to “read” texts in different ways.

I have done this with IT standards but never a novel, attempt reading it from the back forwards, a sentence at a time. At least with authoring you are proofing, it provides a radically different perspective than the more normal front to back. The first thing you notice is that it interrupts your reading/skimming speed so you will catch more errors as well as nuances in the text.

Before you think that literary analysis is a bit far afield from “practical” application, remember that narratives (think literature) are what drive social policy and decision making.

Take the current popular “war on terrorism” narrative that is so popular and unquestioned in the United States. Ask anyone inside the beltway in D.C. and they will blather on and on about the need to defend against terrorism. But there is an absolute paucity of terrorists, at least by deed, in the United States. Why does the narrative persist in the absence of any evidence to support it?

The various Red Scares in U.S. history were similar narratives that have never completely faded. They too had a radical disconnect between the narrative and the “facts on the ground.”

Piper doesn’t offer answers to those sort of questions but a deeper understanding of narrative, such as is found in novels, may lead to hints with profound policy implications.

Pride & Prejudice & Word Embedding Distance

Sunday, November 23rd, 2014

Pride & Prejudice & Word Embedding Distance by Lynn Cherny.

From the webpage:

An experiment: Train a word2vec model on Jane Austen’s books, then replace the nouns in P&P with the nearest word in that model. The graph shows a 2D t-SNE distance plot of the nouns in this book, original and replacement. Mouse over the blue words!

In her blog post, Visualizing Word Embeddings in Pride and Prejudice, Lynn explain more about the project and the process she followed.

From that post:

Overall, the project as launched consists of the text of Pride and Prejudice, with the nouns replaced by the most similar word in a model trained on all of Jane Austen’s books’ text. The resulting text is pretty nonsensical. The blue words are the replaced words, shaded by how close a “match” they are to the original word; if you mouse over them, you see a little tooltip telling you the original word and the score.

I don’t agree that: “The resulting test is pretty nonsensical.”

True, it’s not Jane Austin’s original text and it is challenging to read, but that may be because our assumptions about Pride and Prejudice and literature in general are being defeated by the similar word replacements.

The lack of familiarity and smoothness of a received text may (no guarantees) enable us to see the text differently than we would on a casual re-reading.

What novel corpus would you use for such an experiment?

Discovering Literature: Romantics and Victorians

Thursday, May 29th, 2014

Discovering Literature: Romantics and Victorians (British Library)

From “About this project:”

Exploring the Romantic and Victorian periods, Discovering Literature brings together, for the first time, a wealth of the British Library’s greatest literary treasures, including numerous original manuscripts, first editions and rare illustrations.

A rich variety of contextual material – newspapers, photographs, advertisements and maps – is presented alongside personal letters and diaries from iconic authors. Together they bring to life the historical, political and cultural contexts in which major works were written: works that have shaped our literary heritage.

William Blake’s notebook, childhood writings of the Brontë sisters, the manuscript of the Preface to Charles Dickens’s Oliver Twist, and an early draft of Oscar Wilde’s The Importance of Being Earnest are just some of the unique collections available on the site.

Discovering Literature features over 8000 pages of collection items and explores more than 20 authors through 165 newly-commissioned articles, 25 short documentary films, and 30 lesson plans. More than 60 experts have contributed interpretation, enriching the website with contemporary research. Designed to enhance the study and enjoyment of English literature, the site contains a dedicated Teachers’ Area supporting the curriculum for GCSE and A Level students.

These great works from the Romantic and Victorian periods form the first phase of a wider project to digitise other literary eras, including the 20th century.

On a whim I searched for Bleak House only to find: Bleak House first edition with illustrations, which includes images of the illustrations and the text. Moreover, it has related links, one of which is a review of Jude the Obscure that appeared in the Morning Post.

From the review:

To write a story of over five hundred pages, and longer by far than the majority of three-volume novels, without allowing one single ray of humour, or even cheerfulness, to dispel for a moment the gloomy atmosphere of hopeless pessimism was no ordinary task, and might have taxed the powers of the most relentless observers of life. Even Euripides, had he been given to the writing of novels, might well have faltered before such a tremendous undertaking.

Can you imagine finding such a review on Amazon.com?

Mapping Bleak House into then current legal practice or Jude the Obscure into social customs and records of the time would be fascinating summer projects.

Harry Potter (Neo4j GraphGist)

Friday, November 22nd, 2013

Harry Potter (Neo4j GraphGist)

From the webpage:

v0 of this graph models some of Harrys friends, enemies and their parents. Also have some pets and a few killings. The obvious relation missing is the one between Harry Potter and Voldemort- it took us 7 books to figure that one out, so you’ll have to wait till I add more data 🙂

Great start on a graph representation of Harry Potter!

But the graph model has a different perspective than Harry or others the book series had.

Harry Potter model

I’m a Harry Potter fan. When Harry Potter and the Philosopher’s Stone starts, Harry doesn’t know Ron Weasley, Hermione Granger, Voldemort, or Hedwig.

The graph presents the vantage point of an omniscience observer, who knows facts the rest of us waited seven (7) volumes to discover.

A useful point of view, but it doesn’t show how knowledge and events unfolded to the characters in the story.

We loose any tension over whether Harry will choose Cho Chang or Ginny Weasley

And certainly the outcomes for Albus Dumbledore and Serverus Snape lose their rich texture.

If you object that I am confusing a novel with a graph, are you saying a graph cannot represent the development of information over time?*

That’s a fairly serious shortcoming for any information representation technique.

In stock trading, for example, when I “knew” your shaving lotion causes “purple pustules spelling PIMP” to break out on an user’s face would be critically important.

Did I know before or after I unloaded my shares in your company? 😉

A silly example but illustrates that “when” we know information can be very important.

Not to mention that “static” data is only an illusion of our information systems. Or rather information systems that don’t allow for tracking changing information.

Is your information system one of those?


* I’m in the camp that thinks graphs can represent the development of information over time. Depends on your use case whether you need the extra machinery that enables time-based views.

The granularity of time requirements vary when you are talking about Harry Potter versus the Divine Comedy versus leaks from the current White House.

In topic maps, the range of validity for an association was called its “scope.” Scope and time needs more than one or two other posts.

Interesting times for literary theory

Sunday, August 4th, 2013

Interesting times for literary theory by Ted Underwood.

From the post:

(…)
This could be the beginning of a beautiful friendship. I realize a marriage between machine learning and literary theory sounds implausible: people who enjoy one of these things are pretty likely to believe the other is fraudulent and evil.** But after reading through a couple of ML textbooks,*** I’m convinced that literary theorists and computer scientists wrestle with similar problems, in ways that are at least loosely congruent. Neither field is interested in the mere accumulation of data; both are interested in understanding the way we think and the kinds of patterns we recognize in language. Both fields are interested in problems that lack a single correct answer, and have to be mapped in shades of gray (ML calls these shades “probability”). Both disciplines are preoccupied with the danger of overgeneralization (literary theorists call this “essentialism”; computer scientists call it “overfitting”). Instead of saying “every interpretation is based on some previous assumption,” computer scientists say “every model depends on some prior probability,” but there’s really a similar kind of self-scrutiny involved.
(…)

Computer science and the humanities could enrich each other greatly.

This could be a starting place for that enrichment.

The Sokal Hoax: At Whom Are We Laughing?

Sunday, May 26th, 2013

The Sokal Hoax: At Whom Are We Laughing? by by Mara Beller.

The philosophical pronouncements of Bohr, Born, Heisenberg and Pauli deserve some of the blame for the excesses of the postmodernist critique of science.

The hoax perpetrated by New York University theoretical physicist Alan Sokal in 1996 on the editors of the journal Social Text quickly became widely known and hotly debated. (See Physics Today January 1997, page 61, and March 1997, page 73.) “Transgressing the Boundaries – Toward a Transformative Hermeneutics of Quantum Gravity,” was the title of the parody he slipped past the unsuspecting editors. [1]

Many readers of Sokal’s article characterized it as an ingenious exposure of the decline of the intellectual standards in contemporary academia, and as a brilliant parody of the postmodern nonsense rampant among the cultural studies of science. Sokal’s paper is variously, so we read, “a hilarious compilation of pomo gibberish”, “an imitation of academic babble”, and even “a transformative hermeneutics of total bullshit”. [2] Many scientists reported having “great fun” and “a great laugh” reading Sokal’s article. Yet whom, exactly, are we laughing at?

As telling examples of the views Sokal satirized, one might quote some other statements. Consider the following extrapolation of Heisenberg’s uncertainty and Bohr’s complementarity into the political realm:

“The thesis ‘light consists of particles’ and the antithesis ‘light consists of waves’ fought with one another until they were united in the synthesis of quantum mechanics. …Only why not apply it to the thesis Liberalism (or Capitalism), the antithesis Communism, and expect a synthesis, instead of a complete and permanent victory for the antithesis? There seems to be some inconsistency. But the idea of complementarity goes deeper. In fact, this thesis and antithesis represent two psychological motives and economic forces, both justified in themselves, but, in their extremes, mutually exclusive. …there must exist a relation between the latitudes of freedom df and of regulation dr, of the type df dr=p. …But what is the ‘political constant’ p? I must leave this to a future quantum theory of human affairs.”

Before you burst out laughing at such “absurdities,” let me disclose the author: Max Born, one of the venerated founding fathers of quantum theory [3]. Born’s words were not written tongue in cheek; he soberly declared that “epistemological lessons [from physics] may help towards a deeper understanding of social and political relations”. Such was Born’s enthusiasm to infer from the scientific to the political realm, that he devoted a whole book to the subject, unequivocally titled Physics and Politics [3].
(…)

A helpful illustration that poor or confused writing, accepted on the basis of “authority,” is not limited to the humanities.

The weakness of postmodernism does not lie exclusively in:

While publicly abstaining from criticizing Bohr, many of his contemporaries did not share his peculiar insistence on the impossibility of devising new nonclassical concepts – an insistence that put rigid strictures on the freedom to theorize. It is on this issue that the silence of other physicists had the most far-reaching consequences. This silence created and sustained the illusion that one needed no technical knowledge of quantum mechanics to fully comprehend its revolutionary epistemological lessons. Many postmodernist critics of science have fallen prey to this strategy of argumentation and freely proclaimed that physics itself irrevoably banished the notion of objective reality.

The question of “objective reality” can be answered only within some universe of discourse, such as quantum mechanics for example.

There are no reports of “objective reality” or “subjective reality” that do not originate from some human speaker situated in a cultural, social, espistemological, etc., context.

Postmodernists, Stanley Fish comes to mind, should have made strong epistemological move to say that all reports, of whatever nature, from literature to quantum mechanics, are reports situated in human context.

The rules for acceptable argument vary from one domain to another.

But there is no “out there” where anyone stands to judge between domains.

Should anyone lay claim to an “out there,” you should feel free to ask how they escaped the human condition of context?

And for what purpose do they claim an “out there?”

I suspect you will find they are trying to privilege some form of argumentation or to exclude other forms of argument.

That is a question of motive and not of some “out there.”

I first saw this at Pete Warden’s Five short links.

…2,958 Nineteenth-Century British Novels

Monday, March 18th, 2013

A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method by Ryan Heuser and Long Le-Khac.

From the introduction:

The nineteenth century in Britain saw tumultuous changes that reshaped the fabric of society and altered the course of modernization. It also saw the rise of the novel to the height of its cultural power as the most important literary form of the period. This paper reports on a long-term experiment in tracing such macroscopic changes in the novel during this crucial period. Specifically, we present findings on two interrelated transformations in novelistic language that reveal a systemic concretization in language and fundamental change in the social spaces of the novel. We show how these shifts have consequences for setting, characterization, and narration as well as implications for the responsiveness of the novel to the dramatic changes in British society.

This paper has a second strand as well. This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project.

This branch of the digital humanities, the macroscopic study of cultural history, is a field that is still constructing itself. The right methods and tools are not yet certain, which makes for the excitement and difficulty of the research. We found that such decisions about process cannot be made a priori, but emerge in the messy and non-linear process of working through the research, solving problems as they arise. From this comes the odd, narrative form of this paper, which aims to present the twists and turns of this process of literary and methodological insight. We have divided the paper into two major parts, the development of the methodology (Sections 1 through 3) and the story of our results (Sections 4 and 5). In actuality, these two processes occurred simultaneously; pursuing our literary-historical questions necessitated developing new methodologies. But for the sake of clarity, we present them as separate though intimately related strands.

If this sounds far afield from mining tweets, emails, corporate documents or government archives, can you articulate the difference?

Or do we reflexively treat some genres of texts as “different?”

How useful you will find some of the techniques outlined will depend on the purpose of your analysis.

If you are only doing key-word searching, this isn’t likely to be helpful.

If on the other hand, you are attempting more sophisticated analysis, read on!

I first saw this in Nat Torkington’s Four Short Links: 18 March 2013.

An Interactive Analysis of Tolkien’s Works

Wednesday, February 27th, 2013

An Interactive Analysis of Tolkien’s Works by Emil Johansson.

Description:

Being passionate about both Tolkien and data visualization creating an interactive analysis of Tolkien’s books seemed like a wonderful idea. To the left you will be able to explore character mentions and keyword frequency as well as sentiment analysis of the Silmarillion, the Hobbit and the Lord of the Rings. Information on editions of the books and methods used can be found in the about section.

There you will find:

WORD COUNT AND DENSITY
CHARACTER MENTIONS
KEYWORD FREQUENCY
COMMON WORDS
SENTIMENT ANALYSIS
CHARACTER CO-OCCURENCE
CHAPTER LENGTHS
WORD APPEARANCE
POSTERS

Truly remarkable analysis and visualization!

I suspect users of this portal don’t wonder so much about “how” is it done, but concentrate on the benefits it brings.

Does that sound like a marketing idea for topic maps?

I first saw this in the DashingD3js.com Weekly Newsletter.

The Xenbase literature curation process

Saturday, January 12th, 2013

The Xenbase literature curation process by Jeff B. Bowes, Kevin A. Snyder, Christina James-Zorn, Virgilio G. Ponferrada, Chris J. Jarabek, Kevin A. Burns, Bishnu Bhattacharyya, Aaron M. Zorn and Peter D. Vize.

Abstract:

Xenbase (www.xenbase.org) is the model organism database for Xenopus tropicalis and Xenopus laevis, two frog species used as model systems for developmental and cell biology. Xenbase curation processes centre on associating papers with genes and extracting gene expression patterns. Papers from PubMed with the keyword ‘Xenopus’ are imported into Xenbase and split into two curation tracks. In the first track, papers are automatically associated with genes and anatomy terms, images and captions are semi-automatically imported and gene expression patterns found in those images are manually annotated using controlled vocabularies. In the second track, full text of the same papers are downloaded and indexed by a number of controlled vocabularies and made available to users via the Textpresso search engine and text mining tool.

Which curation workflow will work best for your topic map activities will depend upon a number of factors.

What would you adopt, adapt or alter from the curation workflow in this article?

How would you evaluate the effectiveness of any of your changes?

Les Misérables [Visualized]

Thursday, January 10th, 2013

Novel Views: 4 Static Data Visualizations of the Novel Les Misérables by Andrew Vande Moere.

From the post:

Novel Views [neoformix.com], developed by Jeff Clarck, showcases 4 different visualizations of the text appearing in the novel Les Misérables, which itself spans about 48 books and 365 chapters.

The “Character Mentions” graphic shows where the names of the primary characters are mentioned within the text. The “Radial Word Connections” reveals the connections between the different terms used in the text. The words in the middle are connected using lines of the same color to the chapters where they are used. “Segment Word Clouds” is a small collection of small word clouds, where the size of a word reflects its frequency. Lastly, “Characteristic Verbs” provides an interpretation of the personalities and actions of each character, in that each character is listed with its most common terms and verbs.

Stunning graphics.

In this age of dynamic graphics, I wonder how the depictions would change on a chapter by chapter basis?

So a reader could see how their perception of a character is changing as the novel develops?

Accelerating literature curation with text-mining tools:…

Monday, November 19th, 2012

Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts by Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, Tanya Z. Berardini, Eva Huala, Hung-Yu Kao and Zhiyong Lu.

Abstract:

Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated.

Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/

Presentation on PubTator (slides, PDF).

Hmmm, curating abstracts. That sounds like annotating subjects in documents doesn’t it? Or something very close. 😉

If we start off with a set of subjects, that eases topic map authoring because users are assisted by automatic creation of topic map machinery. Creation triggered by identification of subjects and associations.

Users don’t have to start with bare ground to build a topic map.

Clever users build (and sell) forms, frames, components and modules that serve as the scaffolding for other topic maps.

Data Curation in the Networked Humanities [Semantic Curation?]

Tuesday, October 16th, 2012

Data Curation in the Networked Humanities by Michael Ullyot.

From the post:

These talks are the first phase of Encoding Shakespeare: my SSHRC-funded project for the next three years. Between now and 2015, I’m working to improve the automated encoding of early modern English texts, to enable text analysis.

This post’s three parts are brought to you by the letter p. First I outline the potential of algorithmic text analysis; then the problem of messy data; and finally the protocols for a networked-humanities data curation system.

This third part is the most tentative, as of this writing; Fall 2012 is about defining my protocols and identifying which tags the most text-analysis engines require for the best results — whatever that entails. (So I welcome your comments and resource links.)

A project that promises to touch on many of the issues in modern digital humanities. Do review and contribute if possible.

I have a lingering uneasiness with the notion of “data curation.” With the data and not curation part.

To say “data curation” implies we can identify the “data” that merits curation.

I don’t doubt we can identify some data that needs curation. The question being is it the only data that merits curation?

We know from the early textual history of the Bible that the text was curated and in that process, variant traditions and entire works were lost.

Just my take on it but rather than “data curation,” with the implication of a “correct” text, we need semantic curation.

Semantic curation attempts to preserve the semantics we see in a text, without attempting to find the correct semantics.

Wolfram Plays In Streets of Shakespeare’s London

Monday, April 23rd, 2012

I should have been glad to read: To Compute or Not to Compute—Wolfram|Alpha Analyzes Shakespeare’s Plays. Promoting Shakespeare has to be a first for Wolfram.

But the post reports word counts, unique words, and similar measures as master strokes of engineering, all things familiar since SNOBOL and before. And then makes this “bold” suggestion:

Asking Wolfram|Alpha for information about specific characters is where things really begin to get interesting. We took the dialog from each play and organized them into dialog timelines that show when each character talks within a specific play. For example, if you look at the dialog timeline of Julius Caesar, you’ll notice that Brutus and Cassius have steady dialog throughout the whole play, but Caesar’s dialog stops about halfway through. I wonder why that is?

That sort of analysis was old hat in the 1980’s.

Wolfram needs to catch up on the history of literary and linguistic computing rather than repeating it.

The back issues of Computational Linguistics or Literary and Linguistic Computing should help in that regard. To say nothing of Shakespeare, Computers, and the Mystery of Authorship and similar works.

On digital humanities projects in general, see: Digital Humanities Spotlight: 7 Important Digitization Projects by Maria Popova, for a small sample.