Archive for the ‘Literature’ Category

Accelerating literature curation with text-mining tools:…

Monday, November 19th, 2012

Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts by Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, Tanya Z. Berardini, Eva Huala, Hung-Yu Kao and Zhiyong Lu.

Abstract:

Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated.

Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/

Presentation on PubTator (slides, PDF).

Hmmm, curating abstracts. That sounds like annotating subjects in documents doesn’t it? Or something very close. 😉

If we start off with a set of subjects, that eases topic map authoring because users are assisted by automatic creation of topic map machinery. Creation triggered by identification of subjects and associations.

Users don’t have to start with bare ground to build a topic map.

Clever users build (and sell) forms, frames, components and modules that serve as the scaffolding for other topic maps.

Data Curation in the Networked Humanities [Semantic Curation?]

Tuesday, October 16th, 2012

Data Curation in the Networked Humanities by Michael Ullyot.

From the post:

These talks are the first phase of Encoding Shakespeare: my SSHRC-funded project for the next three years. Between now and 2015, I’m working to improve the automated encoding of early modern English texts, to enable text analysis.

This post’s three parts are brought to you by the letter p. First I outline the potential of algorithmic text analysis; then the problem of messy data; and finally the protocols for a networked-humanities data curation system.

This third part is the most tentative, as of this writing; Fall 2012 is about defining my protocols and identifying which tags the most text-analysis engines require for the best results — whatever that entails. (So I welcome your comments and resource links.)

A project that promises to touch on many of the issues in modern digital humanities. Do review and contribute if possible.

I have a lingering uneasiness with the notion of “data curation.” With the data and not curation part.

To say “data curation” implies we can identify the “data” that merits curation.

I don’t doubt we can identify some data that needs curation. The question being is it the only data that merits curation?

We know from the early textual history of the Bible that the text was curated and in that process, variant traditions and entire works were lost.

Just my take on it but rather than “data curation,” with the implication of a “correct” text, we need semantic curation.

Semantic curation attempts to preserve the semantics we see in a text, without attempting to find the correct semantics.

Wolfram Plays In Streets of Shakespeare’s London

Monday, April 23rd, 2012

I should have been glad to read: To Compute or Not to Compute—Wolfram|Alpha Analyzes Shakespeare’s Plays. Promoting Shakespeare has to be a first for Wolfram.

But the post reports word counts, unique words, and similar measures as master strokes of engineering, all things familiar since SNOBOL and before. And then makes this “bold” suggestion:

Asking Wolfram|Alpha for information about specific characters is where things really begin to get interesting. We took the dialog from each play and organized them into dialog timelines that show when each character talks within a specific play. For example, if you look at the dialog timeline of Julius Caesar, you’ll notice that Brutus and Cassius have steady dialog throughout the whole play, but Caesar’s dialog stops about halfway through. I wonder why that is?

That sort of analysis was old hat in the 1980’s.

Wolfram needs to catch up on the history of literary and linguistic computing rather than repeating it.

The back issues of Computational Linguistics or Literary and Linguistic Computing should help in that regard. To say nothing of Shakespeare, Computers, and the Mystery of Authorship and similar works.

On digital humanities projects in general, see: Digital Humanities Spotlight: 7 Important Digitization Projects by Maria Popova, for a small sample.