Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 13, 2012

Tika – A content analysis toolkit

Filed under: Content Analysis,Tika — Patrick Durusau @ 9:59 pm

Tika – A content analysis toolkit

From the webpage:

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.

From the supported formats page:

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

One suspects that even the vastness of “dark data” has a finite number of formats.

Tika may not cover all of them, but perhaps enough to get you started.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress