Archive for the ‘Sentiment Analysis’ Category

Open Sentiment Analysis

Thursday, May 2nd, 2013

Open Sentiment Analysis by Pete Warden.

From the post:

Sentiment analysis is fiendishly hard to solve well, but easy to solve to a first approximation. I’ve been frustrated that there have been no easy free libraries that make the technology available to non-specialists like me. The problem isn’t with the code, there are some amazing libraries like NLTK out there, but everyone guards their training sets of word weights jealously. I was pleased to discover that SentiWordNet is now CC-BY-SA, but even better I found that Finn Årup has made a drop-dead simple list of words available under an Open Database License!

With that in hand, I added some basic tokenizing code and was able to implement a new text2sentiment API endpoint for the Data Science Toolkit:

http://www.datasciencetoolkit.org/developerdocs#text2sentiment

BTW, while you are there, take a look at the Data Science Toolkit more generally.

Glad to hear about the open set of word weights.

Sentiment analysis with undisclosed word weights sounds iffy to me.

It’s like getting a list of rounded numbers but you don’t know the rounding factor.

Even worse with sentiment analysis because every rounding factor may be different.

mSDA: A fast and easy-to-use way to improve bag-of-words features

Friday, June 15th, 2012

mSDA: A fast and easy-to-use way to improve bag-of-words features by Kilian Weinberger.

From the description:

Machine learning algorithms rely heavily on the representation of the data they are presented with. In particular, text documents (and often images) are traditionally expressed as bag-of-words feature vectors (e.g. as tf-idf). Recently Glorot et al. showed that stacked denoising autoencoders (SDA), a deep learning algorithm, can learn representations that are far superior over variants of bag-of-words. Unfortunately, training SDAs often requires a prohibitive amount of computation time and is non-trivial for non-experts. In this work, we show that with a few modifications of the SDA model, we can relax the optimization over the hidden weights into convex optimization problems with closed form solutions. Further, we show that the expected value of the hidden weights after infinitely many training iterations can also be computed in closed form. The resulting transformation (which we call marginalized-SDA) can be computed in no more than 20 lines of straight-forward Matlab code and requires no prior expertise in machine learning. The representations learned with mSDA behave similar to those obtained with SDA, but the training time is reduced by several orders of magnitudes. For example, mSDA matches the world-record on the Amazon transfer learning benchmark, however the training time shrinks from several days to a few minutes.

The Glorot et. al. reference is to: Domain Adaptation for Large-Scale Sentiment Classi cation: A Deep Learning Approach by Xavier Glorot, Antoine Bordes, and Yoshua Bengio, Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 2011.

Superficial searching reveals this to be a very active area of research.

I rather like the idea of training being reduced from days to minutes.

Sentiment Lexicons (a list)

Friday, April 13th, 2012

Sentiment Lexicons (a list)

From the post:

For those interested in sentiment analysis, I culled some of the sentiment lexicons mentioned in Jurafsky’s NLP class lecture 7-3 and also discussed in Chris Potts’ notes here:

Suggestions of other sentiment or other lexicons? The main ones are fairly well known.

The main ones are just that, the main ones. May or may not reflect the sentiment in particular locales.

Social Media Monitoring with CEP, pt. 2: Context As Important As Sentiment

Sunday, February 5th, 2012

Social Media Monitoring with CEP, pt. 2: Context As Important As Sentiment by Chris Carlson.

From the post:

When I last wrote about social media monitoring, I made a case for using a technology like Complex Event Processing (“CEP”) to detect rapidly growing and geospatially-oriented social media mentions that can provide early warning detection for the public good (Social Media Monitoring for Early Warning of Public Safety Issues, Oct. 27, 2011).

A recent article by Chris Matyszczyk of CNET highlights the often conflicting and confusing nature of monitoring social media. A 26-year old British citizen, Leigh Van Bryan, gearing up for a holiday of partying in Los Angeles, California (USA), tweeted in British slang his intention to have a good time: “Free this week, for quick gossip/prep before I go and destroy America.” Since I’m not too far removed the culture of youth, I did take this to mean partying, cutting loose, having a good time (and other not-so-current definitions.)

This story does not end happily, as Van Bryan and his friend Emily Bunting were arrested and then sent back to Blighty.

This post will not increase American confidence in the TSAbut does illustrate how context can influence the identification of a subject (or “person of interest”) or to exclude the same.

Context is captured in topic maps using associations. In this particular case, a view of the information on the young man in question would reveal a lack of associations with any known terror suspects, people on the no-fly list, suspicious travel patterns, etc.

Not to imply that having good information leads to good decisions, technology can’t correct that particular disconnect.

Be careful with dictionary-based text analysis

Wednesday, October 12th, 2011

Be careful with dictionary-based text analysis

Brendan O’Connor writes:

OK, everyone loves to run dictionary methods for sentiment and other text analysis — counting words from a predefined lexicon in a big corpus, in order to explore or test hypotheses about the corpus. In particular, this is often done for sentiment analysis: count positive and negative words (according to a sentiment polarity lexicon, which was derived from human raters or previous researchers’ intuitions), and then proclaim the output yields sentiment levels of the documents. More and more papers come out every day that do this. I’ve done this myself. It’s interesting and fun, but it’s easy to get a bunch of meaningless numbers if you don’t carefully validate what’s going on. There are certainly good studies in this area that do further validation and analysis, but it’s hard to trust a study that just presents a graph with a few overly strong speculative claims as to its meaning. This happens more than it ought to.

How does “measurement” of sentiment in a document differ from “measurement” of the semantics of terms in that document?

Have we traded “access” to large numbers of documents (think about the usual Internet search engine) for validated collections? By validated collections I mean the discipline-based indexes where the user did not have to weed out completely irrelevant results.