Archive for the ‘Content Analysis’ Category

Tika – A content analysis toolkit

Sunday, May 13th, 2012

Tika – A content analysis toolkit

From the webpage:

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.

From the supported formats page:

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

One suspects that even the vastness of “dark data” has a finite number of formats.

Tika may not cover all of them, but perhaps enough to get you started.

ReadMe: Software for Automated Content Analysis

Friday, January 6th, 2012

ReadMe: Software for Automated Content Analysis by Daniel Hopkins, Gary King, Matthew Knowles, and Steven Melendez.

From the homepage:

The ReadMe software package for R takes as input a set of text documents (such as speeches, blog posts, newspaper articles, judicial opinions, movie reviews, etc.), a categorization scheme chosen by the user (e.g., ordered positive to negative sentiment ratings, unordered policy topics, or any other mutually exclusive and exhaustive set of categories), and a small subset of text documents hand classified into the given categories. If used properly, ReadMe will report, normally within sampling error of the truth, the proportion of documents within each of the given categories among those not hand coded. ReadMe computes quantities of interest to the scientific community based on the distribution within categories but does so by skipping the more error prone intermediate step of classifing individual documents. Other procedures are also included to make processing text easy.

Just in case you tire of hand tagging documents before further processing for feeding into a topic map.

Quite interesting even if it doesn’t address the primary weaknesses in semantic annotation.

Semantic annotation presently is:

  1. after the fact, and
  2. removed from the author (who I presume knew what they meant).

Rather than ranting at the mountain of legacy data as too complex, large, difficult, etc., to adequately annotate, why not turn our attention to the present day creation of data?

Imagine if all the copies of MS™ Word, OpenOffice for every document they produced today, did something as simply as insert a metadata pointer to a vocabulary for that document. Could even have defaults for all the documents created by particular offices or divisions. So that when search engines search those documents, they can use the declared vocabularies for search and disambiguation purposes.

ODF 1.2 already has that capacity and one hopes MS™ would follow that lead and use the same technique to avoid creating extra work for search engines.

Would not be all data, would not even fully annotate all the data in those documents.

But it would be a start towards creating smarter documents but creating smarter documents at the outset, at the instigation of their authors. The people who cared enough to author them are much better choices to declare their meanings.

As we develop better techniques, such as ReadMe and/or when ROI is present, we can then address legacy data issues.

Content Analysis

Saturday, December 17th, 2011

Content Analysis by Michael Heise.

From the post:

Dan Katz (MSU) let me know about a beta release of new website, Legal Language Explorer, that will likely interest anyone who does content analysis as well as those looking for a neat (and, according to Jason Mazzone, addictive) toy to burn some time. The site, according to Dan, allows users: “the chance [free of charge] to search the history of the United States Supreme Court (1791-2005) for any phrase and get a frequency plot and the full text case results for that phrase.” Dan also reports that the developers hope to expand coverage beyond Supreme Court decisions in the future.

The site needs a For Amusement Only sticker. Legal language changes over time and probably no place more so than in Supreme Court decisions.

It was a standing joke in law school that the bar association sponsored the “Avoid Probate” sort of books. If you really want to incur legal fees, just try self-help. Same is true for this site. Use it to argue with your friends, settle bets during football games, etc. Don’t rely on it during night time, road side encounters with folks carrying weapons and radios to summons help. (police)