Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 12, 2013

Finding Significant Phrases in Tweets with NLTK

Filed under: Natural Language Processing,NLTK,Tweets — Patrick Durusau @ 3:17 pm

Finding Significant Phrases in Tweets with NLTK by Sujit Pal.

From the post:

Earlier this week, there was a question about finding significant phrases in text on the Natural Language Processing People (login required) group on LinkedIn. I suggested looking at this LingPipe tutorial. The idea is to find statistically significant word collocations, ie, those that occur more frequently than we can explain away as due to chance. I first became aware of this approach from the LLG Book, where two approaches are described – one based on Log-Likelihood Ratios (LLR) and one based on the Chi-Squared test of independence – the latter is used by LingPipe.

I had originally set out to actually provide an implementation for my suggestion (to answer a followup question). However, the Scipy Pydoc notes that the chi-squared test may be invalid when the number of observed or expected frequencies in each category are too small. Our algorithm compares just two observed and expected frequencies, so it probably qualifies. Hence I went with the LLR approach, even though it is slightly more involved.

The idea is to find, for each bigram pair, the likelihood that the components are dependent on each other versus the likelihood that they are not. For bigrams which have a positive LLR, we repeat the analysis by adding its neighbor word, and arrive at a list of trigrams with positive LLR, and so on, until we reach the N-gram level we think makes sense for the corpus. You can find an explanation of the math in one of my earlier posts, but you will probably find a better explanation in the LLG book.

For input data, I decided to use Twitter. I’m not that familiar with the Twitter API, but I’m taking the Introduction to Data Science course on Coursera, and the first assignment provided some code to pull data from the Twitter 1% feed, so I just reused that. I preprocess the feed so I am left with about 65k English tweets using the following code:

An interesting look “behind the glass” on n-grams.

I am using AntConc to generate n-grams for proofing standards prose.

But as a finished tool, AntConc doesn’t give you insight into the technical side of the process.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress