Archive for the ‘Lexicon’ Category

Bridging Semantic Gaps

Tuesday, November 19th, 2013

OK, the real title is: Cross-Language Opinion Lexicon Extraction Using Mutual-Reinforcement Label Propagation by Zheng Lin, Songbo Tan, Yue Liu, Xueqi Cheng, Xueke Xu. (Lin Z, Tan S, Liu Y, Cheng X, Xu X (2013) Cross-Language Opinion Lexicon Extraction Using Mutual-Reinforcement Label Propagation. PLoS ONE 8(11): e79294. doi:10.1371/journal.pone.0079294)


There is a growing interest in automatically building opinion lexicon from sources such as product reviews. Most of these methods depend on abundant external resources such as WordNet, which limits the applicability of these methods. Unsupervised or semi-supervised learning provides an optional solution to multilingual opinion lexicon extraction. However, the datasets are imbalanced in different languages. For some languages, the high-quality corpora are scarce or hard to obtain, which limits the research progress. To solve the above problems, we explore a mutual-reinforcement label propagation framework. First, for each language, a label propagation algorithm is applied to a word relation graph, and then a bilingual dictionary is used as a bridge to transfer information between two languages. A key advantage of this model is its ability to make two languages learn from each other and boost each other. The experimental results show that the proposed approach outperforms baseline significantly.

I have always wondered when someone would notice the WordNet database is limited to the English language. 😉

The authors are seeking to develop “…a language-independent approach for resource-poor language,” saying:

Our approach differs from existing approaches in the following three points: first, it does not depend on rich external resources and it is language-independent. Second, our method is domain-specific since the polarity of opinion word is domain-aware. We aim to extract the domain-dependent opinion lexicon (i.e. an opinion lexicon per domain) instead of a universal opinion lexicon. Third, the most importantly, our approach can mine opinion lexicon for a target language by leveraging data and knowledge available in another language…

Our approach propagates information back and forth between source language and target language, which is called mutual-reinforcement label propagation. The mutual-reinforcement label propagation model follows a two-stage framework. At the first stage, for each language, a label propagation algorithm is applied to a large word relation graph to produce a polarity estimate for any given word. This stage solves the problem of external resource dependency, and can be easily transferred to almost any language because all we need are unlabeled data and a couple of seed words. At the second stage, a bilingual dictionary is introduced as a bridge between source and target languages to start a bootstrapping process. Initially, information about the source language can be utilized to improve the polarity assignment in target language. In turn, the updated information of target language can be utilized to improve the polarity assignment in source language as well.

Two points of particular interest:

  1. The authors focus on creating domain specific lexicons and don’t attempt to boil the ocean. Useful semantic results will arrive sooner if you avoid attempts at universal solutions.
  2. English speakers are a large market, but the target of this exercise is the #1 language of the world, Mandarin Chinese.

    Taking the numbers for English speakers at face value, approximately 0.8 billion speakers, with a world population of 7.125 billion, that leaves 6.3 billion potential customers.

You’ve heard what they say: A billion potential customers here and a billion potential customers there, pretty soon you are talking about a real market opportunity. (The original quote misattributed to Sen. Everett Dirksen.)


Monday, November 18th, 2013

jLemmaGen by Michal Hlaváč.

From the webpage:

JLemmaGen is java implmentation of LemmaGen project. It’s open source lemmatizer with 15 prebuilded european lexicons. Of course you can build your own lexicon.

LemmaGen project aims at providing standardized open source multilingual platform for lemmatisation.

Project contains 2 libraries:

  • lemmagen.jar – implementation of lemmatizer and API for building own lemmatizers
  • lemmagen-lang.jar – prebuilded lemmatizers from Multext Eastern dictionaries

Whether you want to expand your market or just to avoid officious U.S. officials for the next decade or so, multilingual resources are the key to making that happen.


Sentiment Lexicons (a list)

Friday, April 13th, 2012

Sentiment Lexicons (a list)

From the post:

For those interested in sentiment analysis, I culled some of the sentiment lexicons mentioned in Jurafsky’s NLP class lecture 7-3 and also discussed in Chris Potts’ notes here:

Suggestions of other sentiment or other lexicons? The main ones are fairly well known.

The main ones are just that, the main ones. May or may not reflect the sentiment in particular locales.

Twitter Current English Lexicon

Thursday, March 8th, 2012

Twitter Current English Lexicon

From the description:

Twitter Current English Lexicon: Based on the Twitter Stratified Random Sample Corpus, we regularly extract the Twitter Current English Lexicon. Basically, we’re 1) pulling all tweets from the last three months of corpus entries that have been marked as “English” by the collection process (we have to make that call because there is no reliable means provided by Twitter), 2) removing all #hash, @at, and http items, 3) breaking the tweets into tokens, 4) building descriptive and summary statistics for all token-based 1-grams and 2-grams, and 5) pushing the top 10,000 N-grams from each set into a database and text files for review. So, for every top 1-gram and 2-gram, you know how many times it occurred in the corpus, and in how many tweets (plus associated percentages).

This is an interesting set of data, particularly when you compare it with a “regular” English corpus, something traditional like the Brown Corpus. Unlike most corpora, the top token (1-gram) for Twitter is “i” (as in me, myself, and I), there are a lot of intentional misspellings, and you find an undue amount of, shall we say, “callus” language (be forewarned). It’s a brave new world if you’re willing.

To use this data set, we recommend using the database version and KwicData, but you can also use the text version. Download the ZIP file you want, unzip it, then read the README file for more explanation about what’s included.

I grabbed a copy yesterday but haven’t had the time to look at it.

Twitter feed pipeline software you would recommend?