Archive for the ‘Lexicon’ Category

Sentiment Lexicons (a list)

Friday, April 13th, 2012

Sentiment Lexicons (a list)

From the post:

For those interested in sentiment analysis, I culled some of the sentiment lexicons mentioned in Jurafsky’s NLP class lecture 7-3 and also discussed in Chris Potts’ notes here:

Suggestions of other sentiment or other lexicons? The main ones are fairly well known.

The main ones are just that, the main ones. May or may not reflect the sentiment in particular locales.

Twitter Current English Lexicon

Thursday, March 8th, 2012

Twitter Current English Lexicon

From the description:

Twitter Current English Lexicon: Based on the Twitter Stratified Random Sample Corpus, we regularly extract the Twitter Current English Lexicon. Basically, we’re 1) pulling all tweets from the last three months of corpus entries that have been marked as “English” by the collection process (we have to make that call because there is no reliable means provided by Twitter), 2) removing all #hash, @at, and http items, 3) breaking the tweets into tokens, 4) building descriptive and summary statistics for all token-based 1-grams and 2-grams, and 5) pushing the top 10,000 N-grams from each set into a database and text files for review. So, for every top 1-gram and 2-gram, you know how many times it occurred in the corpus, and in how many tweets (plus associated percentages).

This is an interesting set of data, particularly when you compare it with a “regular” English corpus, something traditional like the Brown Corpus. Unlike most corpora, the top token (1-gram) for Twitter is “i” (as in me, myself, and I), there are a lot of intentional misspellings, and you find an undue amount of, shall we say, “callus” language (be forewarned). It’s a brave new world if you’re willing.

To use this data set, we recommend using the database version and KwicData, but you can also use the text version. Download the ZIP file you want, unzip it, then read the README file for more explanation about what’s included.

I grabbed a copy yesterday but haven’t had the time to look at it.

Twitter feed pipeline software you would recommend?