Content Analysis « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 13, 2012

Tika – A content analysis toolkit

Filed under: Content Analysis,Tika — Patrick Durusau @ 9:59 pm

From the webpage:

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.

From the supported formats page:

HyperText Markup Language

XML and derived formats

Microsoft Office document formats

OpenDocument Format

Portable Document Format

Electronic Publication Format

Rich Text Format

Compression and packaging formats

Text formats

Audio formats

Image formats

Video formats

Java class files and archives

The mbox format

One suspects that even the vastness of “dark data” has a finite number of formats.

Tika may not cover all of them, but perhaps enough to get you started.

Comments Off

January 6, 2012

ReadMe: Software for Automated Content Analysis

Filed under: Content Analysis,ReadMe — Patrick Durusau @ 11:39 am

ReadMe: Software for Automated Content Analysis by Daniel Hopkins, Gary King, Matthew Knowles, and Steven Melendez.

From the homepage:

The ReadMe software package for R takes as input a set of text documents (such as speeches, blog posts, newspaper articles, judicial opinions, movie reviews, etc.), a categorization scheme chosen by the user (e.g., ordered positive to negative sentiment ratings, unordered policy topics, or any other mutually exclusive and exhaustive set of categories), and a small subset of text documents hand classified into the given categories. If used properly, ReadMe will report, normally within sampling error of the truth, the proportion of documents within each of the given categories among those not hand coded. ReadMe computes quantities of interest to the scientific community based on the distribution within categories but does so by skipping the more error prone intermediate step of classifing individual documents. Other procedures are also included to make processing text easy.

Just in case you tire of hand tagging documents before further processing for feeding into a topic map.

Quite interesting even if it doesn’t address the primary weaknesses in semantic annotation.

Semantic annotation presently is:

after the fact, and
removed from the author (who I presume knew what they meant).

Rather than ranting at the mountain of legacy data as too complex, large, difficult, etc., to adequately annotate, why not turn our attention to the present day creation of data?

Imagine if all the copies of MS™ Word, OpenOffice for every document they produced today, did something as simply as insert a metadata pointer to a vocabulary for that document. Could even have defaults for all the documents created by particular offices or divisions. So that when search engines search those documents, they can use the declared vocabularies for search and disambiguation purposes.

ODF 1.2 already has that capacity and one hopes MS™ would follow that lead and use the same technique to avoid creating extra work for search engines.

Would not be all data, would not even fully annotate all the data in those documents.

But it would be a start towards creating smarter documents but creating smarter documents at the outset, at the instigation of their authors. The people who cared enough to author them are much better choices to declare their meanings.

As we develop better techniques, such as ReadMe and/or when ROI is present, we can then address legacy data issues.

Comments Off

December 21, 2011

Yahoo! Opens Content Analysis Technology to all Developers

Filed under: Content Analysis,Metadata,Yahoo! — Patrick Durusau @ 7:21 pm

Yahoo! Opens Content Analysis Technology to all Developers

From the post:

As the premier digital media company, Yahoo! publishes tons of content every day. In addition to publishing it, we do a lot of work behind the scenes to analyze and understand that content in a scalable and algorithmic way. Today we’re pleased to open up our content analysis technology to the world to help developers build their own fantastic experiences for their sites and users.

The newly launched Yahoo! Content Analysis service replaces Yahoo!’s popular Term Extraction service and now provides advanced content analysis on either text or a URL, leverages Yahoo!’s state of the art machine-learned ranking (MLR) technology to extract key terms from the content, and, more importantly, to rank them based on their overall importance to the content. The output you receive contains the keywords and their ranks along with other actionable metadata.

Our new service replaces the current Term Extraction Service, which is expected to end on March 31, 2012. We will continue to support the Term Extraction requests, but calls must be directed to our YQL table since we’ll be shutting down the non-YQL service. More details can be found on today’s YDN blog post.

The new features and MLR are supported only in the new request format. Give it a try today!

A very good demonstration of a post I am working on called: Metadata Without Markup (or not much). The premise is that a little intelligence on the front-end can yield a harvest of useful metadata on the backend, with little effort from users.

Which, to be honest, has been the sticking point of most semantic technologies. If you think programmers are lazy, you haven’t seen many users.

Comments Off

December 17, 2011

Content Analysis

Filed under: Content Analysis,Law - Sources,Legal Informatics,Text Analytics — Patrick Durusau @ 6:33 am

Content Analysis by Michael Heise.

From the post:

Dan Katz (MSU) let me know about a beta release of new website, Legal Language Explorer, that will likely interest anyone who does content analysis as well as those looking for a neat (and, according to Jason Mazzone, addictive) toy to burn some time. The site, according to Dan, allows users: “the chance [free of charge] to search the history of the United States Supreme Court (1791-2005) for any phrase and get a frequency plot and the full text case results for that phrase.” Dan also reports that the developers hope to expand coverage beyond Supreme Court decisions in the future.

The site needs a For Amusement Only sticker. Legal language changes over time and probably no place more so than in Supreme Court decisions.

It was a standing joke in law school that the bar association sponsored the “Avoid Probate” sort of books. If you really want to incur legal fees, just try self-help. Same is true for this site. Use it to argue with your friends, settle bets during football games, etc. Don’t rely on it during night time, road side encounters with folks carrying weapons and radios to summons help. (police)

Comments Off