From the webpage:
BioNLP-Corpora is a repository of biologically and linguistically annotated corpora and biological datasets.
It is one of the projects of the BioNLP initiative by the Center for Computational Pharmacology at the University of Colorado Denver Health Sciences Center to create and distribute code, software, and data for applying natural language processing techniques to biomedical texts.
There are many resources available for download at BioNLP-Corpora:
- CRAFT: Colorado Richly Annotated Full-Text Corpus
- Protein Residue Corpora: several corpora relevant to extraction of protein residues in text
- PICorpus: Protein Interaction Corpus, a corpus of annotated protein-protein interactions
- GeneHomonym: Gene identifier values that mean different things or refer to multiple genes by inference
- The Annotation Projects: human annotated text on biological subject matter
- The Medline Mining Projects: automatically mined data from Medline
- The Anaphora Corpus: sample of GeneRIFs annotated with pronominal anaphora and their antecedents
- Test Suite Corpora: structured test suites for natural language
processing applications
Like the guy says in the original Star Wars, “…almost there….”
In addition to being really useful resources, i am following a path that arose from the discovery of one resource.
One more website and then the article I found that lead to all the BioNLP* resources.