Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 19, 2011

MONK

Filed under: Data Mining,Digital Library,Semantics,Text Analytics — Patrick Durusau @ 8:32 pm

MONK

From the Introduction:

The MONK Project provides access to the digitized texts described above along with tools to enable literary research through the discovery, exploration, and visualization of patterns. Users typically start a project with one of the toolsets that has been predefined by the MONK team. Each toolset is made up of individual tools (e.g. a search tool, a browsing tool, a rating tool, and a visualization), and these tools are applied to worksets of texts selected by the user from the MONK datastore. Worksets and results can be saved for later use or modification, and results can be exported in some standard formats (e.g., CSV files).

The public data set:

This instance of the MONK Project includes approximately 525 works of American literature from the 18th and 19th centuries, and 37 plays and 5 works of poetry by William Shakespeare provided by the scholars and libraries at Northwestern University, Indiana University, the University of North Carolina at Chapel Hill, and the University of Virginia. These texts are available to all users, regardless of institutional affiliation.

Digging a bit further:

Each of these texts is normalized (using Abbot, a complex XSL stylesheet) to a TEI schema designed for analytic purposes (TEI-A), and each text has been “adorned” (using Morphadorner) with tokenization, sentence boundaries, standard spellings, parts of speech and lemmata, before being ingested (using Prior) into a database that provides Java access methods for extracting data for many purposes, including searching for objects; direct presentation in end-user applications as tables, lists, concordances, or visualizations; getting feature counts and frequencies for analysis by data-mining and other analytic procedures; and getting tokenized streams of text for working with n-gram and other colocation analyses, repetition analyses, and corpus query-language pattern-matching operations. Finally, MONK’s quantitative analytics (naive Bayesian analysis, support vector machines, Dunnings log likelihood, and raw frequency comparisons), are run through the SEASR environment.

Here’s my topic maps question: So, how do I reliably combine the results from a subfield that uses a different vocabulary than my own? For that matter, how do I discover it in the first place?

I think the MONK project is quite remarkable but lament the impending repetition of research across such a vast archive simply because it is unknown or expressed a “foreign” tongue.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress