Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 9, 2013

NLTK 2.1 – Working with Text Corpora

Filed under: Natural Language Processing,NLTK,Text Corpus — Patrick Durusau @ 5:46 pm

NLTK 2.1 – Working with Text Corpora by Vsevolod Dyomkin.

From the post:

Let’s return to start of chapter 2 and explore the tools needed to easily and efficiently work with various linguistic resources.

What are the most used and useful corpora? This is a difficult question to answer because different problems will likely require specific annotations and often a specific corpus. There are even special conferences dedicated to corpus linguistics.

Here’s a list of the most well-known general-purpose corpora:

  • Brown Corpus – one of the first big corpora and the only one in the list really easily accessible – we’ve already worked with it in the first chapter
  • Penn Treebank – Treebank is a corpus of sentences annotated with their constituency parse trees so that they can be used to train and evaluate parsers
  • Reuters Corpus (not to be confused with the ApteMod version provided with NLTK)
  • British National Corpus (BNC) – a really huge corpus, but, unfortunately, not freely available

Another very useful resource which isn’t structured specifically as academic corpora mentioned above, but at the same time has other dimensions of useful connections and annotations is Wikipedia. And there’s being a lot of interesting linguistic research performed with it.

Besides there are two additional valuable language resources that can’t be classified as text corpora at all, but rather as language databases: WordNet and Wiktionary. We have already discussed CL-NLP interface to Wordnet. And we’ll touch working with Wiktionary in this part.

Vsevolod continues to recast the NLTK into Lisp.

Learning corpus processing along with Lisp. How can you lose?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress