Archive for the ‘Latent Dirichlet Allocation (LDA)’ Category

In-browser topic modeling

Friday, April 26th, 2013

In-browser topic modeling by David Mimno.

From the post:

Many people have found topic modeling a useful (and fun!) way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) explore the limits of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations.

About as easy an introduction/exploration as I can imagine.

Enjoy!

Topic Discovery With Apache Pig and Mallet

Friday, February 1st, 2013

Topic Discovery With Apache Pig and Mallet

Only one of two posts from this blog in 2012 but it is a useful one.

From the post:

A common desire when working with natural language is topic discovery. That is, given a set of documents (eg. tweets, blog posts, emails) you would like to discover the topics inherent in those documents. Often this method is used to summarize a large corpus of text so it can be quickly understood what that text is ‘about’. You can go further and use topic discovery as a way to classify new documents or to group and organize the documents you’ve done topic discovery on.

Walks through the use of Pig and Mallet on a newsgroup data set.

I have been thinking about getting one of those unlimited download newsgroup accounts.

Maybe I need to go ahead and start building some newsgroup data sets.

My Intro to Multiple Classification…

Saturday, December 29th, 2012

My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis

From the post:

After the work I did for my last post, I wanted to practice doing multiple classification. I first thought of using the famous iris dataset, but felt that was a little boring. Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors. Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.

The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books. In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!! The following table describes the numerical breakdown of books and chapters from each author:

Introduction to authorship studies as they were known (may still be) in the academic circles of my youth.

I wonder if the same techniques are as viable today as on the Federalist Papers?

The Wheel of Time example demonstrates the technique remains viable for novel authors.

But what about authorship more broadly?

Can we reliably distinguish between news commentary from multiple sources?

Or between statements by elected officials?

How would your topic map represent purported authorship versus attributed authorship?

Or even a common authorship for multiple purported authors? (speech writers)

Prioritizing PubMed articles…

Wednesday, November 21st, 2012

Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information by Sun Kim, Won Kim, Chih-Hsuan Wei, Zhiyong Lu and W. John Wilbur.

Abstract:

The Comparative Toxicogenomics Database (CTD) contains manually curated literature that describes chemical–gene interactions, chemical–disease relationships and gene–disease relationships. Finding articles containing this information is the first and an important step to assist manual curation efficiency. However, the complex nature of named entities and their relationships make it challenging to choose relevant articles. In this article, we introduce a machine learning framework for prioritizing CTD-relevant articles based on our prior system for the protein–protein interaction article classification task in BioCreative III. To address new challenges in the CTD task, we explore a new entity identification method for genes, chemicals and diseases. In addition, latent topics are analyzed and used as a feature type to overcome the small size of the training set. Applied to the BioCreative 2012 Triage dataset, our method achieved 0.8030 mean average precision (MAP) in the official runs, resulting in the top MAP system among participants. Integrated with PubTator, a Web interface for annotating biomedical literature, the proposed system also received a positive review from the CTD curation team.

An interesting summary of entity recognition issues in bioinformatics occurs in this article:

The second problem is that chemical and disease mentions should be identified along with gene mentions. Named entity recognition (NER) has been a main research topic for a long time in the biomedical text-mining community. The common strategy for NER is either to apply certain rules based on dictionaries and natural language processing techniques (5–7) or to apply machine learning approaches such as support vector machines (SVMs) and conditional random fields (8–10). However, most NER systems are class specific, i.e. they are designed to find only objects of one particular class or set of classes (11). This is natural because chemical, gene and disease names have specialized terminologies and complex naming conventions. In particular, gene names are difficult to detect because of synonyms, homonyms, abbreviations and ambiguities (12,13). Moreover, there are no specific rules of how to name a gene that are actually followed in practice (14). Chemicals have systematic naming conventions, but finding chemical names from text is still not easy because there are various ways to express chemicals (15,16). For example, they can be mentioned as IUPAC names, brand names, generic names or even molecular formulas. However, disease names in literature are more standardized (17) compared with gene and chemical names. Hence, using terminological resources such as Medical Subject Headings (MeSH) and Unified Medical Language System (UMLS) Metathesaurus help boost the identification performance (17,18). But, a major drawback of identifying disease names from text is that they often use general English terms.

Having a common representative for a group of identifiers for a single entity, should simplify the creation of mappings between entities.

Yes?

Topic Modeling Tool

Monday, November 5th, 2012

Topic Modeling Tool

From the webpage:

A graphical user interface tool for Latent Dirichlet Allocation topic modeling.

A very easy tool for exploring the use of Latent Dirichlet Allocation topic modeling.

Of course, on non-Mac machines, there is no “Double-click” on the jar file to run it, so use:

java -jar TopicModelingTool.jar

Oh, and the documentation is missing the link to the test files, see:

http://code.google.com/p/topic-modeling-tool/downloads/list

  • testdatanews_music_2084docs.txt 13.3 MB
  • testdata_news_economy_2073docs.txt 13.0 MB
  • testdata_news_fuel_845docs.txt 5.3 MB
  • testdata_braininjury_10000docs.txt 9.6 MB

I used testdata_news_music_2048docs.txt file, set to 100 topics with the default options and the learning process took 52 seconds and the complete process 66.056 seconds. Your mileage will vary but fast enough for smallish data sets.

At least in a session, you can’t change the output directory.

I could see using this in a class to explore a body of material for creation of topic maps.

GraphChi parsers toolkit

Tuesday, September 11th, 2012

GraphChi parsers toolkit by Danny Bickson.

From the post:

To the request of Baoqiang Cao I have started a parsers toolkits in GraphChi to be used for preparing data to be used in GraphLab/ Graphchi. The parsers should be used as template which can be easy customized to user specific needs.

Danny starts us off with an LDA parser (with worked example of its use) and then adds a Twitter parser that creates a graph of retweets.

Enjoy!

A Simple Topic Model

Friday, July 27th, 2012

A Simple Topic Model by Allen Beye Riddell.

From the post:

NB: This is an extended version of the appendix of my paper exploring trends in German Studies in the US between 1928 and 2006. In that paper I used a topic model (Latent Dirichlet Allocation); this tutorial is intended to help readers understand how LDA works.

Topic models typically start with two banal assumptions. The first is that in a large collection of texts there exist a number of distinct groups (or sources) of texts. In the case of academic journal articles, these groups might be associated with different journals, authors, research subfields, or publication periods (e.g. the 1950s and 1980s). The second assumption is that texts from different sources tend to use different vocabulary. If we are presented with an article selected from one of two different academic journals, one dealing with literature and another with archeology, and we are told only that the word “plot” appears frequently in the article, we would be wise to guess the article comes from the literary studies journal.1

A major obstacle to understanding the remaining details about how topic models work is that their description relies on the abstract language of probability. Existing introductions to Latent Dirichlet Allocation (LDA) tend to be pitched either at an audience already fluent in statistics or at an audience with minimal background.2 This being the case, I want to address an audience that has some background in probability and statistics, perhaps at the level of the introductory texts of Hoff (2009), Lee (2004), or Kruschke (2010).

A good walk through on using a topic model (Latent Dirichlet Allocation).

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

Learning Topic Models – Going beyond SVD

Wednesday, April 18th, 2012

Learning Topic Models – Going beyond SVD by Sanjeev Arora, Rong Ge, and Ankur Moitra.

Abstract:

Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular.

Theoretical studies of topic modeling focus on learning the model’s parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves.

This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. A compelling feature of our algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the Correlated Topic Model and the Pachinko Allocation Model.

We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD – just as NMF has come to replace SVD in many applications.

The proposal hinges on the following assumption:

Separability requires that each topic has some near-perfect indicator word – a word that we call the anchor word for this topic— that appears with reasonable probability in that topic but with negligible probability in all other topics (e.g., “soccer” could be an anchor word for the topic “sports”). We give a formal definition in Section 1.1. This property is particularly natural in the context of topic modeling, where the number of distinct words (dictionary size) is very large compared to the number of topics. In a typical application, it is common to have a dictionary size in the thousands or tens of thousands, but the number of topics is usually somewhere in the range from 50 to 100. Note that separability does not mean that the anchor word always occurs (in fact, a typical document may be very likely to contain no anchor words). Instead, it dictates that when an anchor word does occur, it is a strong indicator that the corresponding topic is in the mixture used to generate the document.

The notion of an “anchor word” (or multiple anchor words per topics as the authors point out in the conclusion) resonates with the idea of identifying a subject. It is at least a clue that an author/editor should take into account.

Topic Models

Saturday, December 31st, 2011

Topic Models

From the post:

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999.[1] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics.[2] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.

Just in case you need some starter materials on discovering “topics” (non-topic map sense) in documents.

DYI – Topic Modeling

Saturday, October 8th, 2011

How to do Your Own Topic Modeling

From the post:

In the first Teaching with Technology Tuesday of the fall 2011 semester, David Newman delivered a presentation on topic modeling to a full house in Bass’s L01 classroom. His research concentrates on data mining and machine learning, and he has been working with Yale for the past three years in an IMLS funded project on the applications of topic modeling in museum and library collections. In Tuesday’s talk, David broke down what topic modeling is, how it can be useful, and introduced a tool he designed to make the process accessible to anyone who can use a computer.

Summary of what sounds like an interesting presentation on the use of topic modeling (Latent Dirichlet Allocation/LDA) along with links to software. Enough detail that if topic modeling is unfamiliar, you will get the gist of it.

The usual cautions about LDA apply: It can’t model what’s not present, works at the document level (too coarse for many purposes), your use of the software has a dramatic impact on the results, etc. Useful tool, just be careful how much you rely upon it without checking the results.

Introduction to Latent Dirichlet Allocation

Saturday, October 1st, 2011

Introduction to Latent Dirichlet Allocation by Edwin Chen.

From the introduction:

Suppose you have the following set of sentences:

  • I like to eat broccoli and bananas.
  • I ate a banana and spinach smoothie for breakfast.
  • Chinchillas and kittens are cute.
  • My sister adopted a kitten yesterday.
  • Look at this cute hamster munching on a piece of broccoli.

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

  • Sentences 1 and 2: 100% Topic A
  • Sentences 3 and 4: 100% Topic B
  • Sentence 5: 60% Topic A, 40% Topic B
  • Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
  • Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

The question, of course, is: how does LDA perform this discovery?

About as smooth an explanation of Latent Dirichlet Allocation as you are going to find.

Learning Topic Models by Belief Propagation

Friday, September 16th, 2011

Learning Topic Models by Belief Propagation by Jia Zeng, William K. Cheung, and Jiming Liu.

Abstract:

Latent Dirichlet allocation (LDA) is an important class of hierarchical Bayesian models for probabilistic topic modeling, which attracts worldwide interests and touches many important applications in text mining, computer vision and computational biology. This paper proposes a novel tree-structured factor graph representation for LDA within the Markov random field (MRF) framework, which enables the classic belief propagation (BP) algorithm for exact inference and parameter estimation. Although two commonly-used approximation inference methods, such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great successes in learning LDA, the proposed BP is competitive in both speed and accuracy validated by encouraging experimental results on four large-scale document data sets. Furthermore, the BP algorithm has the potential to become a generic learning scheme for variants of LDA-based topic models. To this end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic models (RTM), using belief propagation based on the factor graph representation.

I have just started reading this paper but wanted to bring it to your attention. I peeked at the results and it looks quite promising.

This work was tested against the following data sets:

1) CORA [30] contains abstracts from the CORA research paper search engine in machine learning area, where the documents can be classified into 7 major categories.

2) MEDL [31] contains abstracts from the MEDLINE biomedical paper search engine, where the documents fall broadly into 4 categories.

3) NIPS [32] includes papers from the conference “Neural Information Processing Systems”, where all papers are grouped into 13 categories. NIPS has no citation link information.

4) BLOG [33] contains a collection of political blogs on the subject of American politics in the year 2008. where all blogs can be broadly classified into 6 categories. BLOG has no author information.

with positive results.

Topic Modeling Bibliography

Friday, September 16th, 2011

Topic Modeling Bibliography

An extensive bibliography on topic modeling (LDA) by David Mimno.

There are a number of related resources on his homepage.

SIGKDD 2011 Conference

Tuesday, September 6th, 2011

A pair of posts from Ryan Rosario on the SIGKDD 2011 Conference.

Day 1 (Graph Mining and David Blei/Topic Models)

Tough sledding on Probabilistic Topic Models but definitely worth the effort to follow.

Days 2/3/4 Summary

Useful summaries and pointers to many additional resources.

If you attended SIGKDD 2011, do you have pointers to other reviews of the conference or other resources?

I added a category for SIGKDD.

What is a good explanation of Latent Dirichlet Allocation? (Quora)

Friday, September 2nd, 2011

What is a good explanation of Latent Dirichlet Allocation? (Quora)

If you need to explain topic modeling to your boss, department chair or funder, you would be hard pressed to find a better source of inspiration.

The explanation here ranges from technical to layman to actual example (Sarah Palin’s emails so you might better check on the audience’s political persuasion). Actually it would not hurt to have LDA examples on hand that run the gamut of political persuasions. (Or national perspectives if you are in the international market.)

BTW, if you not familiar with Quora, give it a look.

This link was forwarded to my attention by Jack Park.

Topic Modeling Sarah Palin’s Emails

Wednesday, June 29th, 2011

Topic Modeling Sarah Palin’s Emails from Edwin Chen.

From the post:

LDA-based Email Browser

Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I did some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.

Interesting analysis and promise of more to follow.

With a US presidential election next year, there is little doubt there will be friendly as well as hostile floods of documents.

Time to sharpen your data extraction tools.

An Architecture for Parallel Topic Models

Wednesday, June 15th, 2011

An Architecture for Parallel Topic Models by Alexander Smola and Shravan Narayanamurthy.

Abstract:

This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.

The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

Interesting how this key, value stuff keeps coming up these days.

The authors plan on making the codebase available for public use.


Updated 30 June 2011 to include the URL supplied by Sam Hunting. (Thanks Sam!)

Topic Modeling Browser (LDA)

Saturday, March 26th, 2011

Topic Modeling Browser (LDA)

From a post by David Blei:

allison chaney has created the “topic model visualization engine,” which can be used to create browsers of document collections based on a topic model. i think this will become a very useful tool for us. the code is on google code:
http://code.google.com/p/tmve/
as an example, here is a browser built from a 50-topic model fit to 100K articles from wikipedia:
http://www.sccs.swarthmore.edu/users/08/ajb/tmve/wiki100k/browse/topic-list.html
allison describes how she built the browser in the README for her code:
http://code.google.com/p/tmve/wiki/TMVE01
finally, to check out the code and build your own browser, see here:
http://code.google.com/p/tmve/source/checkout

Take a look.

As I have mentioned before, LDA could be a good exploration tool for document collections, preparatory to building a topic map.

Latent Dirichlet Allocation in C

Wednesday, March 16th, 2011

Latent Dirichlet Allocation in C

From the website:

This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. For example, click here to see the topics estimated from a small corpus of Associated Press documents. LDA is fully described in Blei et al. (2003) .

This code contains:

  • an implementation of variational inference for the per-document topic proportions and per-word topic assignments
  • a variational EM procedure for estimating the topics and exchangeable Dirichlet hyperparameter

Do be aware that the use of topic in this technique and papers discussing it is not the same thing as topic as defined by ISO 13250-2.

It comes closer to the notion of subject as defined in ISO 13250-2.

Update:

I was sent a pointer to David M. Blei’s
http://www.cs.princeton.edu/~blei/topicmodeling.html, which has more code and other goodies.

PyBrain: The Python Machine Learning Library

Thursday, February 3rd, 2011

PyBrain: The Python Machine Learning Library

From the website:

PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.

How is PyBrain different?

While there are a few machine learning libraries out there, PyBrain aims to be a very easy-to-use modular library that can be used by entry-level students but still offers the flexibility and algorithms for state-of-the-art research. We are constantly working on more and faster algorithms, developing new environments and improving usability.

What PyBrain can do

PyBrain, as its written-out name already suggests, contains algorithms for neural networks, for reinforcement learning (and the combination of the two), for unsupervised learning, and evolution. Since most of the current problems deal with continuous state and action spaces, function approximators (like neural networks) must be used to cope with the large dimensionality. Our library is built around neural networks in the kernel and all of the training methods accept a neural network as the to-be-trained instance. This makes PyBrain a powerful tool for real-life tasks.

Another tool kit to assist in the construction of topic maps.

And another likely contender for the Topic Map Competition!

MALLET: MAchine Learning for LanguagE Toolkit
Topic Map Competition (TMC) Contender?

Thursday, February 3rd, 2011

MALLET: MAchine Learning for LanguagE Toolkit

From the website:

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to “features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

Topic models are useful for analyzing large collections of unlabeled text. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA.

Many of the algorithms in MALLET depend on numerical optimization. MALLET includes an efficient implementation of Limited Memory BFGS, among many other optimization methods.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of “pipes”, which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

An add-on package to MALLET, called GRMM, contains support for inference in general graphical models, and training of CRFs with arbitrary graphical structure.

Another tool to assist in the authoring of a topic map from a large data set.

It would be interesting but beyond the scope of the topic maps class, to organize a competition around several of the natural language processing packages.

To have a common data set, to be released on X date, with topic maps due say within 24 hours (there is a TV show with that in the title or so I am told).

Will have to give that some thought.

Could be both interesting and entertaining.

Reading Tea Leaves: How Humans Interpret Topic Models

Wednesday, December 22nd, 2010

Reading Tea Leaves: How Humans Interpret Topic Models Authors: Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei

Abstract:

Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.

Read the article first but then see the LingPipe Blog review of the same.

Topic Models (warning – not what you may think)

Tuesday, December 21st, 2010

Topic Models Authors: David M. Blei, John D. Lafferty

The term topic and topic model refer to sets of highly probable words that are found to characterize a text.

The sets of words are called topics.

A topic model is the technique applied to a set of texts to extract topics.

In this particular article, latent Dirichlet allocation (LDA).

Apologies for the mis-use of the term topic but I suspect if we looked closely, someone was using the term topic before ISO 13250.

Good introduction to topic models.

Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation

Monday, December 20th, 2010

Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation (video) Authors: Shuang-Hong Yang, Gu Xu, Hang Li slides KDD ’09 paper

Abstract:

This paper addresses Named Entity Mining (NEM), in which we mine knowledge about named entities such as movies, games, and books from a huge amount of data. NEM is potentially useful in many applications including web search, online advertisement, and recommender system. There are three challenges for the task: finding suitable data source, coping with the ambiguities of named entity classes, and incorporating necessary human supervision into the mining process. This paper proposes conducting NEM by using click-through data collected at a web search engine, employing a topic model that generates the click-through data, and learning the topic model by weak supervision from humans. Specifically, it characterizes each named entity by its associated queries and URLs in the click-through data. It uses the topic model to resolve ambiguities of named entity classes by representing the classes as topics. It employs a method, referred to as Weakly Supervised Latent Dirichlet Allocation (WS-LDA), to accurately learn the topic model with partially labeled named entities. Experiments on a large scale click-through data containing over 1.5 billion query-URL pairs show that the proposed approach can conduct very accurate NEM and significantly outperforms the baseline.

With some slight modifications, almost directly applicable to the construction of topic maps.

Questions:

  1. What presumptions underlie the use of supervision to assist with Named Entity Mining? (2-3 pages, no citations)
  2. Are those valid presumptions for click-through data? (2-3 pages, no citations)
  3. How would you suggest investigating the characteristics of click-through data? (2-3 pages, no citations)

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval

Monday, December 20th, 2010

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval Author: Laurence A. F. Park, The University of Melbourne. slides

Abstract:

It has been shown that the use of topic models for Information retrieval provides an increase in precision when used in the appropriate form. Latent Dirichlet Allocation (LDA) is a generative topic model that allows us to model documents using a Dirichlet prior. Using this topic model, we are able to obtain a fitted Dirichlet parameter that provides the maximum likelihood for the document set. In this article, we examine the sensitivity of LDA with respect to the Dirichlet parameter when used for Information retrieval. We compare the topic model computation times, storage requirements and retrieval precision of fitted LDA to LDA with a uniform Dirichlet prior. The results show there there is no significant benefit of using fitted LDA over the LDA with a constant Dirichlet parameter, hence showing that LDA is insensitive with respect to the Dirichlet parameter when used for Information retrieval.

Note that topic is used in semantic analysis (of various kinds) to mean highly probable words and not in the technical sense of the TMDM or XTM.

Extraction of highly probably words from documents can be useful in the construction of topic maps for those documents.