Apache Tika – a content analysis toolkit
From the website:
Apache Tika™ is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Formats include:
- HyperText Markup Language
- XML and derived formats
- Microsoft Office document format
- OpenDocument Format
- Portable Document Format
- Electronic Publication Format
- Rich Text Format
- Compression and packaging formats
- Text formats
- Audio formats
- Image formats
- Video formats
- Java class files and archives
- The mbox format
Sounds like we are getting close to pipelines for topic map production.