Archive for the ‘Sentiment Analysis’ Category

Stanford CoreNLP v3.7.0 beta is out! [Time is short, comments, bug reports, now!]

Thursday, November 3rd, 2016

Stanford CoreNLP v3.7.0 beta

The tweets I saw from Stanford NLP Group read:

Stanford CoreNLP v3.7.0 beta is out—improved coreference, dep parsing—KBP relation annotator—Arabic pipeline #NLProc

We‘re doing an official CoreNLP beta release this time, so bugs, comments, and fixes especially appreciated over the next two weeks!

OK, so, what are you waiting for? 😉

Oh, the standard blurb for your boss on why Stanford CoreNLP should be taking up your time:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Choose Stanford CoreNLP if you need:

  • An integrated toolkit with a good range of grammatical analysis tools
  • Fast, reliable analysis of arbitrary texts
  • The overall highest quality text analytics
  • Support for a number of major (human) languages
  • Interfaces available for various major modern programming languages
  • Ability to run as a simple web service

Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A CoreNLP tool pipeline can be run on a piece of plain text with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Using the standard blurb about the Stanford CoreNLP has these advantages:

  • It’s copy-n-paste, you didn’t have to write it
  • It’s appeal to authority (Stanford)
  • It’s truthful

The truthful point is a throw-away these days but thought I should mention it. 😉

Toneapi helps your writing pack an emotional punch [Not For The Ethically Sensitive]

Thursday, February 4th, 2016

Toneapi helps your writing pack an emotional punch by Martin Bryant.

From the post:

Language analysis is a rapidly developing field and there are some interesting startups working on products that help you write better.

Take Toneapi, for example. This product from Northern Irish firm Adoreboard is a Web-based app that analyzes (and potentially improves) the emotional impact of your writing.

Paste in some text, and it will offer a detailed visualization of your writing.

If you aren’t overly concerned about manipulating, sorry, persuading your readers to your point of view, you might want to give Toneapi a spin. Martin reports that IBM’s Watson has Tone Analyzer and you should also consider Textio and Relative Insight.

Before this casts an Orwellian pale over your evening/day, remember that focus groups and testing messages have been the staple of advertising for decades.

What these software services do is make a crude form of that capability available to the average citizen.

Some people have a knack for emotional language, like Donald Trump, but I can’t force myself to write in incomplete sentences or with one syllable words. Maybe there’s an app for that? Suggestions?

Bible vs. Quran – Who’s More Violent?

Friday, January 22nd, 2016

Bible vs. Quran – Text analysis answers: Is the Quran really more violent than the Bible? by Tom H. C. Anderson.

Tom’s series appears in three parts, but sharing the common title:

Part I: The Project

From part 1:

With the proliferation of terrorism connected to Islamic fundamentalism in the late-20th and early 21st centuries, the question of whether or not there is something inherently violent about Islam has become the subject of intense and widespread debate.

Even before 9/11—notably with the publication of Samuel P Huntington’s “Clash of Civilizations” in 1996—pundits have argued that Islam incites followers to violence on a level that sets it apart from the world’s other major religions.

The November 2015 Paris attacks and the politicking of a U.S. presidential election year—particularly candidate Donald Trump’s call for a ban on Muslim’s entering the country and President Obama’s response in the State of the Union address last week—have reanimated the dispute in the mainstream media, and proponents and detractors, alike, have marshalled “experts” to validate their positions.

To understand a religion, it’s only logical to begin by examining its literature. And indeed, extensive studies in a variety of academic disciplines are routinely conducted to scrutinize and compare the texts of the world’s great religions.

We thought it would be interesting to bring to bear the sophisticated data mining technology available today through natural language processing and unstructured text analytics to objectively assess the content of these books at the surface level.

So, we’ve conducted a shallow but wide comparative analysis using OdinText to determine with as little bias as possible whether the Quran is really more violent than its Judeo-Christian counterparts.

Part II: Emotional Analysis Reveals Bible is “Angriest”

From part 2:

In my previous post, I discussed our potentially hazardous plan to perform a comparative analysis using an advanced data mining platform—OdinText—across three of the most important texts in human history: The Old Testament, The New Testament and the Quran.

Author’s note: For more details about the data sources and methodology, please see Part I of this series.

The project was inspired by the ongoing public debate around whether or not terrorism connected with Islamic fundamentalism reflects something inherently and distinctly violent about Islam compared to other major religions.

Before sharing the first set of results with you here today, due to the sensitive nature of this topic, I feel obliged to reiterate that this analysis represents only a cursory, superficial view of just the texts, themselves. It is in no way intended to advance any agenda or to conclusively prove anyone’s point.

Part III – Violence, Mercy and Non-Believers – to appear soon.

A comparison that may be an inducement for some to learn text/sentiment analysis but I would view its results with a great deal of caution.

Two of the comments to the first post read:


(comment) If you’re not completing the analysis in the native language, you’re just analyzing the translators’ understanding and interpretation of the texts; this is very different than the actual texts.

(to which a computational linguist replies) Technically, that is certainly true. However, if you are looking at broad categories of sentiment or topic, as this analysis does, there should be little variation in the results between translations, or by using the original. As well, it could be argued that what is most of interest is the viewpoint of the interpreters of the text, hence the translations may be *more* of interest, to some extent. But I would not expect that this analysis would be very sensitive at all to variations in translation or even language.

I find the position taken by the computational linguist almost incomprehensible.

Not only do we lack anything approaching a full social context for any of the texts in their original languages, moreover, terms that occur once (hapaxes) number approximately 1,300 in the Hebrew Bible and over 3,500 in the New Testament. For a discussion of the Qur’ān, see: Hapaxes in the Qur’ān: identifying and cataloguing lone words (and loadwords) by Shawkat M. Toorawa. Toorawa includes a list of hapaxes for the Qur’ān, a discussion of why they are important and a comparison to other texts.

Here is a quick example of where social context can change how you read a text:

23 The priest is to write these curses on a scroll and then wash them off into the bitter water. 24 He shall have the woman drink the bitter water that brings a curse, and this water will enter her and cause bitter suffering. 25 The priest is to take from her hands the grain offering for jealousy, wave it before the LORD and bring it to the altar. 26 The priest is then to take a handful of the grain offering as a memorial offering and burn it on the altar; after that, he is to have the woman drink the water. 27 If she has defiled herself and been unfaithful to her husband, then when she is made to drink the water that brings a curse, it will go into her and cause bitter suffering; her abdomen will swell and her thigh waste away, and she will become accursed among her people. (Numbers 5:23-27)

Does that sound sexist to you?

Interesting because a Hebrew Bible professor of my argued that it is one of the earliest pro-women passages in the text.

Think about the social context. There are no police, no domestic courts, short of retribution from the wife’s family members, there are no constraints on what a husband can do to his wife. Even killing her wasn’t beyond the pale.

Given that context, setting up a test that no one can fail, in the presence of a priest, which also deters resorting to a violent remedy, sounds like it gets the wife out of a dangerous situation where the priest can say: “See, you were jealous for no reason, etc.”

There’s no guarantee that is the correct interpretation either but it does accord with present understandings of law and custom at the time. The preservation of order in the community, no mean thing in the absence of an organized police force, was an important thing.

The English words used in translations also have their own context, which may be resolved differently from those in the original languages.

As I said, interesting but consider with a great deal of caution.

There’s More Than One Kind Of Reddit Comment?

Friday, December 18th, 2015

‘Sarcasm detection on Reddit comments’

Contest ends: 15th of February, 2016.

From the webpage:

Sentiment analysis is a fairly well-developed field, but on the Internet, people often don’t say exactly what they mean. One of the toughest modes of communication for both people and machines to identify is sarcasm. Sarcastic statements often sound positive if interpreted literally, but through context and other cues the speaker indicates that they mean the opposite of what they say. In English, sarcasm is primarily communicated through verbal cues, meaning that it is difficult, even for native speakers, to determine it in text.

Sarcasm detection is a subtask of opinion mining. It aims at correctly identifying the user opinions expressed in the written text. Sarcasm detection plays a critical role in sentiment analysis by correctly identifying sarcastic sentences which can incorrectly flip the polarity of the sentence otherwise. Understanding sarcasm, which is often a difficult task even for humans, is a challenging task for machines. Common approaches for sarcasm detection are based on machine learning classifiers trained on simple lexical or dictionary based features. To date, some research in sarcasm detection has been done on collections of tweets from Twitter, and reviews on Amazon.com. For this task, we are interested in looking at a more conversational medium—comments on Reddit—in order to develop an algorithm that can use the context of the surrounding text to help determine whether a specific comment is sarcastic or not.

The premise of this competition is there is more than one kind of comment on Reddit, aside from sarcasm.

A surprising assumption I know but there you have it.

I wonder if participants will have to separate sarcastic + sexist, sarcastic + misogynistic, sarcastic + racist, sarcastic + abusive, into separate categories or will all sarcastic comments be classified as sarcasm?

I suppose the default case would be to assume all Reddit comments are some form of sarcasm and see how accurate that model proves to be when judged against the results of the competition.

Training data for sarcasm? Pointers anyone?

Political Futures Tracker

Wednesday, May 20th, 2015

Political Futures Tracker.

From the webpage:

The Political Futures Tracker tells us the top political themes, how positive or negative people feel about them, and how far parties and politicians are looking to the future.

This software will use ground breaking language analysis methods to examine data from Twitter, party websites and speeches. We will also be conducting live analysis on the TV debates running over the next month, seeing how the public respond to what politicians are saying in real time. Leading up to the 2015 UK General Election we will be looking across the political spectrum for emerging trends and innovation insights.

If that sounds interesting, consider the following from: Introducing… the Political Futures Tracker:

We are exploring new ways to analyse a large amount of data from various sources. It is expected that both the amount of data and the speed that it is produced will increase dramatically the closer we get to election date. Using a semi-automatic approach, text analytics technology will sift through content and extract the relevant information. This will then be examined and analysed by the team at Nesta to enable delivery of key insights into hotly debated issues and the polarisation of political opinion around them.

The team at the University of Sheffield has extensive experience in the area of social media analytics and Natural Language Processing (NLP). Technical implementation has started already, firstly with data collection which includes following the Twitter accounts of existing MPs and political parties. Once party candidate lists become available, data harvesting will be expanded accordingly.

In parallel, we are customising the University of Sheffield’s General Architecture for Text Engineering (GATE); an open source text analytics tool, in order to identify sentiment-bearing and future thinking tweets, as well as key target topics within these.

One thing we’re particularly interested in is future thinking. We describe this as making statements concerning events or issues in the future. Given these measures and the views expressed by a certain person, we can model how forward thinking that person is in general, and on particular issues, also comparing this with other people. Sentiment, topics, and opinions will then be aggregated and tracked over time.

Personally I suspect that “future thinking” is used in difference senses by the general population and political candidates. For a political candidate, however the rhetoric is worded, the “future” consists of reaching election day with 50% plus 1 vote. For the general population, the “future” probably includes a longer time span.

I mention this in case you can sell someone on the notion that what political candidates say today has some relevance to what they will do after election. President Obmana has been in office for six (6) years on office, the Guantanamo Bay detention camp remains open, no one has been held accountable for years of illegal spying on U.S. citizens, banks and other corporate interests have all but been granted keys to the U.S. Treasury, to name a few items inconsistent with his previous “future thinking.”

Unless you accept my suggestion that “future thinking” for a politician means election day and no further.

₳ustral Blog

Tuesday, April 14th, 2015

₳ustral Blog

From the post:

We’re software developers and entrepreneurs who wondered what Reddit might be able to tell us about our society.

Social network data have revolutionized advertising, brand management, political campaigns, and more. They have also enabled and inspired vast new areas of research in the social and natural sciences.

Traditional social networks like Facebook focus on mostly-private interactions between personal acquaintances, family members, and friends. Broadcast-style social networks like Twitter enable users at “hubs” in the social graph (those with many followers) to disseminate their ideas widely and interact directly with their “followers”. Both traditional and broadcast networks result in explicit social networks as users choose to associate themselves with other users.

Reddit and similar services such as Hacker News are a bit different. On Reddit, users vote for, and comment on, content. The social network that evolves as a result is implied based on interactions rather than explicit.

Another important difference is that, on Reddit, communication between users largely revolves around external topics or issues such as world news, sports teams, or local events. Instead of discussing their own lives, or topics randomly selected by the community, Redditors discuss specific topics (as determined by community voting) in a structured manner.

This is what we’re trying to harness with Project Austral. By combining Reddit stories, comments, and users with technologies like sentiment analysis and topic identification (more to come soon!) we’re hoping to reveal interesting trends and patterns that would otherwise remain hidden.

Please, check it out and let us know what you think!

Bad assumption on my part! Since ₳ustral uses Neo4j to store the Reddit graph, I was expecting a graph-type visualization. If that was intended, that isn’t what I found. 😉

Most of my searching is content oriented and not so much concerned with trends or patterns. An upsurge in hypergraph queries could happen in Reddit, but aside from references to publications and projects, the upsurge itself would be a curiosity to me.

Nothing against trending, patterns, etc. but just not my use case. May be yours.

Modern Methods for Sentiment Analysis

Monday, April 13th, 2015

Modern Methods for Sentiment Analysis by Michael Czerny.

From the post:

Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment quantification has enjoyed many useful implementations, such as businesses gaining understanding about consumer reactions to a product, or detecting hateful speech in online comments.

The simplest form of sentiment analysis is to use a dictionary of good and bad words. Each word in a sentence has a score, typically +1 for positive sentiment and -1 for negative. Then, we simply add up the scores of all the words in the sentence to get a final sentiment total. Clearly, this has many limitations, the most important being that it neglects context and surrounding words. For example, in our simple model the phrase “not good” may be classified as 0 sentiment, given “not” has a score of -1 and “good” a score of +1. A human would likely classify “not good” as negative, despite the presence of “good”….

Great discussion of Word2Vec and Doc2Vec, along with worked examples of both as well as analyzing sentiment in Emoji tweets.

Another limitation of the +1 / -1 approach is that human sentiments are rarely that sharply defined. Moreover, however strong or weak the “likes” or “dislikes” of a group of users, they are all collapsed into one score.

Be mindful that modeling is a lossy process.

LT-Accelerate

Tuesday, December 16th, 2014

LT-Accelerate: LT-Accelerate is a conference designed to help businesses, researchers and public administrations discover business value via Language Technology.

From the about page:

LT-Accelerate is a joint production of LT-Innovate, the European Association of the Language Technology Industry, and Alta Plana Corporation, a Washington DC based strategy consultancy headed by analyst Seth Grimes.

Held December 4-5, 2014 in Brussels, the website reports seven (7) interviews with key speakers and slides from thirty-eight speakers.

Not as in depth as papers nor as useful as videos of the presentations but still capable of sparking new ideas as you review the slides.

For example, the slides from Multi-Dimensional Sentiment Analysis by Stephen Pulman made me wonder what sentiment detection design would be appropriate for the Michael Brown grand jury transcripts?

Sentiment detection has been successfully used with tweets (140 character limit) and I am reliably informed that most of the text strings in the Michael Brown grand jury transcript are far longer than one hundred and forty (140) characters. 😉

Any sentiment detectives in the audience?

Practical Sentiment Analysis Tutorial

Wednesday, March 5th, 2014

Practical Sentiment Analysis Tutorial by Jason Baldridge.

Slides for tutorial on sentiment analysis.

Includes such classics as:

I saw her duck with a telescope.

How many interpretations do you get? Check yourself against Jason’s slides.

Quite a slide deck, my reader reports four hundred and thirty-five pages, 435 pages.

Enjoy!

Spring XD – Tweets – Hadoop – Sentiment Analysis

Saturday, February 15th, 2014

Using Spring XD to stream Tweets to Hadoop for Sentiment Analysis

From the webpage:

This tutorial will build on the previous tutorial – 13 – Refining and Visualizing Sentiment Data – by using Spring XD to stream in tweets to HDFS. Once in HDFS, we’ll use Apache Hive to process and analyze them, before visualizing in a tool.

I re-ordered the text:

This tutorial is from the Community part of tutorial for Hortonworks Sandbox (1.3) – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series.

This community tutorial submitted by mehzer with source available at Github. Feel free to contribute edits or your own tutorial and help the community learn Hadoop.

not to take anything away from Spring XD or Sentiment Analysis but to emphasize the community tutorial aspects of the Hortonworks Sandbox.

At present count on tutorials:

Hortonworks: 14

Partners: 12

Community: 6

Thoughts on what the next community tutorial should be?

Bridging Semantic Gaps

Tuesday, November 19th, 2013

OK, the real title is: Cross-Language Opinion Lexicon Extraction Using Mutual-Reinforcement Label Propagation by Zheng Lin, Songbo Tan, Yue Liu, Xueqi Cheng, Xueke Xu. (Lin Z, Tan S, Liu Y, Cheng X, Xu X (2013) Cross-Language Opinion Lexicon Extraction Using Mutual-Reinforcement Label Propagation. PLoS ONE 8(11): e79294. doi:10.1371/journal.pone.0079294)

Abstract:

There is a growing interest in automatically building opinion lexicon from sources such as product reviews. Most of these methods depend on abundant external resources such as WordNet, which limits the applicability of these methods. Unsupervised or semi-supervised learning provides an optional solution to multilingual opinion lexicon extraction. However, the datasets are imbalanced in different languages. For some languages, the high-quality corpora are scarce or hard to obtain, which limits the research progress. To solve the above problems, we explore a mutual-reinforcement label propagation framework. First, for each language, a label propagation algorithm is applied to a word relation graph, and then a bilingual dictionary is used as a bridge to transfer information between two languages. A key advantage of this model is its ability to make two languages learn from each other and boost each other. The experimental results show that the proposed approach outperforms baseline significantly.

I have always wondered when someone would notice the WordNet database is limited to the English language. 😉

The authors are seeking to develop “…a language-independent approach for resource-poor language,” saying:

Our approach differs from existing approaches in the following three points: first, it does not depend on rich external resources and it is language-independent. Second, our method is domain-specific since the polarity of opinion word is domain-aware. We aim to extract the domain-dependent opinion lexicon (i.e. an opinion lexicon per domain) instead of a universal opinion lexicon. Third, the most importantly, our approach can mine opinion lexicon for a target language by leveraging data and knowledge available in another language…

Our approach propagates information back and forth between source language and target language, which is called mutual-reinforcement label propagation. The mutual-reinforcement label propagation model follows a two-stage framework. At the first stage, for each language, a label propagation algorithm is applied to a large word relation graph to produce a polarity estimate for any given word. This stage solves the problem of external resource dependency, and can be easily transferred to almost any language because all we need are unlabeled data and a couple of seed words. At the second stage, a bilingual dictionary is introduced as a bridge between source and target languages to start a bootstrapping process. Initially, information about the source language can be utilized to improve the polarity assignment in target language. In turn, the updated information of target language can be utilized to improve the polarity assignment in source language as well.

Two points of particular interest:

  1. The authors focus on creating domain specific lexicons and don’t attempt to boil the ocean. Useful semantic results will arrive sooner if you avoid attempts at universal solutions.
  2. English speakers are a large market, but the target of this exercise is the #1 language of the world, Mandarin Chinese.

    Taking the numbers for English speakers at face value, approximately 0.8 billion speakers, with a world population of 7.125 billion, that leaves 6.3 billion potential customers.

You’ve heard what they say: A billion potential customers here and a billion potential customers there, pretty soon you are talking about a real market opportunity. (The original quote misattributed to Sen. Everett Dirksen.)

Sentiment Analysis and “Human Analytics” (Conference)

Tuesday, October 15th, 2013

Call for Speakers: Sentiment Analysis and “Human Analytics” (March 6, NYC) by Seth Grimes.

Call for Speakers: Closes October 28, 2013.

Symposium: March 6, 2014 New York City.

From the post:

Sentiment, mood, opinion, and emotion play a central role in social and online media, enterprise feedback, and the range of consumer, business, and public data sources. Together with connection, expressed as influence and advocacy in and across social and business networks, they capture immense business value.

“Human Analytics” solutions unlock this value and are the focus of the next Sentiment Analysis Symposium, March 6, 2014 in New York. The Call for Speakers is now open, through October 28.

The key to a great conference is great speakers. Whether you’re a business visionary, experienced user, or technologist, please consider proposing a presentation. Submit your proposal at sentimentsymposium.com/call-for-speakers.html. Choose from among the suggested topics or surprise us.

The New York symposium will be the 7th, covering solutions that measure and exploit emotion, attitude, opinion, and connection in online, social, and enterprise sources. It will be a great program… with your participation!

(For those not familiar with the symposium: Check out FREE videos of presentations and panels from the May, 2013 New York symposium and from prior symposiums.)

More conference material for your enjoyment!

As you know, bot traffic accounts for a large percentage of tweets but if the customer wants sentiment analysis of bots trading tweets, why not?

Opens an interesting potential of botnets, not in a malicious sense but that are organized to simulate public dialogue on current issues.

Recursive Deep Models for Semantic Compositionality…

Tuesday, October 1st, 2013

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank by Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts.

Abstract:

Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effect of contrastive conjunctions as well as negation and its scope at various tree levels for both positive and negative phrases.

You will no doubt want to see the webpage with the demo.

Along with possibly the data set and the code.

I was surprised by “fine-grained sentiment labels” meaning:

  1. Positive
  2. Somewhat positive
  3. Neutral
  4. Somewhat negative
  5. Negative

But then for many purposes, subject recognition on that level of granularity may be sufficient.

Open Sentiment Analysis

Thursday, May 2nd, 2013

Open Sentiment Analysis by Pete Warden.

From the post:

Sentiment analysis is fiendishly hard to solve well, but easy to solve to a first approximation. I’ve been frustrated that there have been no easy free libraries that make the technology available to non-specialists like me. The problem isn’t with the code, there are some amazing libraries like NLTK out there, but everyone guards their training sets of word weights jealously. I was pleased to discover that SentiWordNet is now CC-BY-SA, but even better I found that Finn Årup has made a drop-dead simple list of words available under an Open Database License!

With that in hand, I added some basic tokenizing code and was able to implement a new text2sentiment API endpoint for the Data Science Toolkit:

http://www.datasciencetoolkit.org/developerdocs#text2sentiment

BTW, while you are there, take a look at the Data Science Toolkit more generally.

Glad to hear about the open set of word weights.

Sentiment analysis with undisclosed word weights sounds iffy to me.

It’s like getting a list of rounded numbers but you don’t know the rounding factor.

Even worse with sentiment analysis because every rounding factor may be different.

mSDA: A fast and easy-to-use way to improve bag-of-words features

Friday, June 15th, 2012

mSDA: A fast and easy-to-use way to improve bag-of-words features by Kilian Weinberger.

From the description:

Machine learning algorithms rely heavily on the representation of the data they are presented with. In particular, text documents (and often images) are traditionally expressed as bag-of-words feature vectors (e.g. as tf-idf). Recently Glorot et al. showed that stacked denoising autoencoders (SDA), a deep learning algorithm, can learn representations that are far superior over variants of bag-of-words. Unfortunately, training SDAs often requires a prohibitive amount of computation time and is non-trivial for non-experts. In this work, we show that with a few modifications of the SDA model, we can relax the optimization over the hidden weights into convex optimization problems with closed form solutions. Further, we show that the expected value of the hidden weights after infinitely many training iterations can also be computed in closed form. The resulting transformation (which we call marginalized-SDA) can be computed in no more than 20 lines of straight-forward Matlab code and requires no prior expertise in machine learning. The representations learned with mSDA behave similar to those obtained with SDA, but the training time is reduced by several orders of magnitudes. For example, mSDA matches the world-record on the Amazon transfer learning benchmark, however the training time shrinks from several days to a few minutes.

The Glorot et. al. reference is to: Domain Adaptation for Large-Scale Sentiment Classi cation: A Deep Learning Approach by Xavier Glorot, Antoine Bordes, and Yoshua Bengio, Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 2011.

Superficial searching reveals this to be a very active area of research.

I rather like the idea of training being reduced from days to minutes.

Sentiment Lexicons (a list)

Friday, April 13th, 2012

Sentiment Lexicons (a list)

From the post:

For those interested in sentiment analysis, I culled some of the sentiment lexicons mentioned in Jurafsky’s NLP class lecture 7-3 and also discussed in Chris Potts’ notes here:

Suggestions of other sentiment or other lexicons? The main ones are fairly well known.

The main ones are just that, the main ones. May or may not reflect the sentiment in particular locales.

Social Media Monitoring with CEP, pt. 2: Context As Important As Sentiment

Sunday, February 5th, 2012

Social Media Monitoring with CEP, pt. 2: Context As Important As Sentiment by Chris Carlson.

From the post:

When I last wrote about social media monitoring, I made a case for using a technology like Complex Event Processing (“CEP”) to detect rapidly growing and geospatially-oriented social media mentions that can provide early warning detection for the public good (Social Media Monitoring for Early Warning of Public Safety Issues, Oct. 27, 2011).

A recent article by Chris Matyszczyk of CNET highlights the often conflicting and confusing nature of monitoring social media. A 26-year old British citizen, Leigh Van Bryan, gearing up for a holiday of partying in Los Angeles, California (USA), tweeted in British slang his intention to have a good time: “Free this week, for quick gossip/prep before I go and destroy America.” Since I’m not too far removed the culture of youth, I did take this to mean partying, cutting loose, having a good time (and other not-so-current definitions.)

This story does not end happily, as Van Bryan and his friend Emily Bunting were arrested and then sent back to Blighty.

This post will not increase American confidence in the TSAbut does illustrate how context can influence the identification of a subject (or “person of interest”) or to exclude the same.

Context is captured in topic maps using associations. In this particular case, a view of the information on the young man in question would reveal a lack of associations with any known terror suspects, people on the no-fly list, suspicious travel patterns, etc.

Not to imply that having good information leads to good decisions, technology can’t correct that particular disconnect.

Be careful with dictionary-based text analysis

Wednesday, October 12th, 2011

Be careful with dictionary-based text analysis

Brendan O’Connor writes:

OK, everyone loves to run dictionary methods for sentiment and other text analysis — counting words from a predefined lexicon in a big corpus, in order to explore or test hypotheses about the corpus. In particular, this is often done for sentiment analysis: count positive and negative words (according to a sentiment polarity lexicon, which was derived from human raters or previous researchers’ intuitions), and then proclaim the output yields sentiment levels of the documents. More and more papers come out every day that do this. I’ve done this myself. It’s interesting and fun, but it’s easy to get a bunch of meaningless numbers if you don’t carefully validate what’s going on. There are certainly good studies in this area that do further validation and analysis, but it’s hard to trust a study that just presents a graph with a few overly strong speculative claims as to its meaning. This happens more than it ought to.

How does “measurement” of sentiment in a document differ from “measurement” of the semantics of terms in that document?

Have we traded “access” to large numbers of documents (think about the usual Internet search engine) for validated collections? By validated collections I mean the discipline-based indexes where the user did not have to weed out completely irrelevant results.