NLTK 2.1 – Working with Text Corpora by Vsevolod Dyomkin.
From the post:
Let’s return to start of chapter 2 and explore the tools needed to easily and efficiently work with various linguistic resources.
What are the most used and useful corpora? This is a difficult question to answer because different problems will likely require specific annotations and often a specific corpus. There are even special conferences dedicated to corpus linguistics.
Here’s a list of the most well-known general-purpose corpora:
- Brown Corpus – one of the first big corpora and the only one in the list really easily accessible – we’ve already worked with it in the first chapter
- Penn Treebank – Treebank is a corpus of sentences annotated with their constituency parse trees so that they can be used to train and evaluate parsers
- Reuters Corpus (not to be confused with the ApteMod version provided with NLTK)
- British National Corpus (BNC) – a really huge corpus, but, unfortunately, not freely available
Another very useful resource which isn’t structured specifically as academic corpora mentioned above, but at the same time has other dimensions of useful connections and annotations is Wikipedia. And there’s being a lot of interesting linguistic research performed with it.
Besides there are two additional valuable language resources that can’t be classified as text corpora at all, but rather as language databases: WordNet and Wiktionary. We have already discussed
CL-NLP
interface to Wordnet. And we’ll touch working with Wiktionary in this part.
Vsevolod continues to recast the NLTK into Lisp.
Learning corpus processing along with Lisp. How can you lose?