ReadMe: Software for Automated Content Analysis by Daniel Hopkins, Gary King, Matthew Knowles, and Steven Melendez.
From the homepage:
The ReadMe software package for R takes as input a set of text documents (such as speeches, blog posts, newspaper articles, judicial opinions, movie reviews, etc.), a categorization scheme chosen by the user (e.g., ordered positive to negative sentiment ratings, unordered policy topics, or any other mutually exclusive and exhaustive set of categories), and a small subset of text documents hand classified into the given categories. If used properly, ReadMe will report, normally within sampling error of the truth, the proportion of documents within each of the given categories among those not hand coded. ReadMe computes quantities of interest to the scientific community based on the distribution within categories but does so by skipping the more error prone intermediate step of classifing individual documents. Other procedures are also included to make processing text easy.
Just in case you tire of hand tagging documents before further processing for feeding into a topic map.
Quite interesting even if it doesn’t address the primary weaknesses in semantic annotation.
Semantic annotation presently is:
- after the fact, and
- removed from the author (who I presume knew what they meant).
Rather than ranting at the mountain of legacy data as too complex, large, difficult, etc., to adequately annotate, why not turn our attention to the present day creation of data?
Imagine if all the copies of MS™ Word, OpenOffice for every document they produced today, did something as simply as insert a metadata pointer to a vocabulary for that document. Could even have defaults for all the documents created by particular offices or divisions. So that when search engines search those documents, they can use the declared vocabularies for search and disambiguation purposes.
ODF 1.2 already has that capacity and one hopes MS™ would follow that lead and use the same technique to avoid creating extra work for search engines.
Would not be all data, would not even fully annotate all the data in those documents.
But it would be a start towards creating smarter documents but creating smarter documents at the outset, at the instigation of their authors. The people who cared enough to author them are much better choices to declare their meanings.
As we develop better techniques, such as ReadMe and/or when ROI is present, we can then address legacy data issues.