CERMINE: Content ExtRactor and MINEr
From the webpage:
CERMINE is a Java library and a web service for extracting metadata and content from scientific articles in born-digital form. The system analyses the content of a PDF file and attempts to extract information such as:
- Title of the article
- Journal information (title, etc.)
- Bibliographic information (volume, issue, page numbers, etc.)
- Authors and affiliations
- Keywords
- Abstract
- Bibliographic references
I used the following three files for a very subjective test of the online interface: http://cermine.ceon.pl/cermine/index.html.
- Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large Graph Collections
- Synonym Extraction of Medical Terms from Clinical Text Using Combinations ofWord Space Models
- The Nature of Legal Concepts: Inferential Nodes and Ontological Categories
I am mostly interested in extraction of bibliographic entries and can report that while CERMINE made some mistakes, it is quite useful.
I first saw this in a tweet by Docear.