Tika – A content analysis toolkit
From the webpage:
The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.
From the supported formats page:
- HyperText Markup Language
- XML and derived formats
- Microsoft Office document formats
- OpenDocument Format
- Portable Document Format
- Electronic Publication Format
- Rich Text Format
- Compression and packaging formats
- Text formats
- Audio formats
- Image formats
- Video formats
- Java class files and archives
- The mbox format
One suspects that even the vastness of “dark data” has a finite number of formats.
Tika may not cover all of them, but perhaps enough to get you started.