Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 11, 2012

Calculating Word and N-Gram Statistics from the Gutenberg Corpus

Filed under: Gutenberg Corpus,N-Gram,NLTK,Statistics — Patrick Durusau @ 6:16 pm

Calculating Word and N-Gram Statistics from the Gutenberg Corpus by Richard Marsden.

From the post:

Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.

A “get your feet wet” sort of exercise with the script included.

The Gutenberg project isn’t “big data” but it is more than your usual inbox.

Think of it as learning about the data set for application of more sophisticated algorithms.

December 17, 2011

NLTK Trees

Filed under: NLTK — Patrick Durusau @ 7:49 pm

NLTK Trees by Richard Marsden.

A number of NLTK functions work with Tree objects. For example, part of speech tagging and chunking classifiers, naturally return trees. Sentence manipulation functions also work with trees. Although Natural Language Processing with Python (Bird et al) includes a couple of pages about NLTK’s Tree module, coverage is generally sparse. The online documentation actually contains some good coverage although it is not always in the most logical location (e.g. the unit tests contain some very good documentation). This article is intended as a quick introduction, and the more informative documentation pages are listed under Further Reading.

A handy introduction to NLTK trees.

September 15, 2011

Statistical machine learning for text classification

Filed under: Natural Language Processing,NLTK,Python — Patrick Durusau @ 7:51 pm

Statistical machine learning for text classification with scikit-learn and NLTK by Olivier Grisel. (PyCon 2011)

The goal of this talk is to give a state-of-the-art overview of machine learning algorithms applied to text classification tasks ranging from language and topic detection in tweets and web pages to sentiment analysis in consumer products reviews.

First third is a review of basic NLP. Review of basic functions of scikit-learn. Same for NLTK. Also covers, briefly, the Google Prediction API.

Compares all three on the movie review database. Discusses analysis of newsgroups (for topics) and identifying language of webpages.

I would not say “state-of-the-art” as much as “an intro to text classification and its potential.”

« Newer Posts

Powered by WordPress