Apache Tika – a content analysis toolkit

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 2, 2010

Apache Tika – a content analysis toolkit

Filed under: Authoring Topic Maps,Data Mining,Software — Patrick Durusau @ 7:57 pm

Apache Tika – a content analysis toolkit

From the website:

Apache Tika™ is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Formats include:

HyperText Markup Language

XML and derived formats

Microsoft Office document format

OpenDocument Format

Portable Document Format

Electronic Publication Format

Rich Text Format

Compression and packaging formats

Text formats

Audio formats

Image formats

Video formats

Java class files and archives

The mbox format

Sounds like we are getting close to pipelines for topic map production.

Comments?

Comments (4)

4 Comments

I played with Tika about a year ago to try extracting MP3 metadata into a search index. It didn’t meet my needs but it seems like a really promising project. I hadn’t thought about it’s potential for topic map production, probably because I haven’t completely figured out how to automate topic map production yet.

Comment by Marijane White — December 2, 2010 @ 9:01 pm
Marijane, well, depends on what part you are trying to automate. 😉

The analysis part, what makes one topic map useful and another less so, that can be augmented but ultimately is a human task.

The syntax/encoding part depends on the both the syntax and semantic variability of the text. You could have a very regular text from a syntax perspective that had wildly varying semantics.

Or uniform semantics with varying syntax. And mixtures in between.

I will try to find some time to play with it between now and the end of the year.

Hope you are looking forward to the holiday season!

Comment by Patrick Durusau — December 2, 2010 @ 9:13 pm
We have been looking at the Tika project closely Wandora integration in mind. So far the set of supported formats overlaps the formats supported by Wandora extractors but we are waiting.

Comment by Aki — December 3, 2010 @ 5:52 am
[…] Apache Tika – a content analysis toolkit « Another Word For It RT @patrickDurusau: Apache Tika – a content analysis toolkit #topicmaps #Tika #Apache #metadata #contentanalysis – http://t.co/fADCf0zc… Source: tm.durusau.net […]

Pingback by Apache Tika – a content analysis toolkit « Another Word For It | Digitization&Metadata | Scoop.it — October 30, 2011 @ 10:55 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.