Archive for the ‘Topic Models’ Category

Learning Topic Models – Going beyond SVD

Wednesday, April 18th, 2012

Learning Topic Models – Going beyond SVD by Sanjeev Arora, Rong Ge, and Ankur Moitra.

Abstract:

Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular.

Theoretical studies of topic modeling focus on learning the model’s parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves.

This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. A compelling feature of our algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the Correlated Topic Model and the Pachinko Allocation Model.

We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD – just as NMF has come to replace SVD in many applications.

The proposal hinges on the following assumption:

Separability requires that each topic has some near-perfect indicator word – a word that we call the anchor word for this topic— that appears with reasonable probability in that topic but with negligible probability in all other topics (e.g., “soccer” could be an anchor word for the topic “sports”). We give a formal definition in Section 1.1. This property is particularly natural in the context of topic modeling, where the number of distinct words (dictionary size) is very large compared to the number of topics. In a typical application, it is common to have a dictionary size in the thousands or tens of thousands, but the number of topics is usually somewhere in the range from 50 to 100. Note that separability does not mean that the anchor word always occurs (in fact, a typical document may be very likely to contain no anchor words). Instead, it dictates that when an anchor word does occur, it is a strong indicator that the corresponding topic is in the mixture used to generate the document.

The notion of an “anchor word” (or multiple anchor words per topics as the authors point out in the conclusion) resonates with the idea of identifying a subject. It is at least a clue that an author/editor should take into account.

Topic Models

Saturday, December 31st, 2011

Topic Models

From the post:

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999.[1] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics.[2] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.

Just in case you need some starter materials on discovering “topics” (non-topic map sense) in documents.

Topic Modeling Browser (LDA)

Saturday, March 26th, 2011

Topic Modeling Browser (LDA)

From a post by David Blei:

allison chaney has created the “topic model visualization engine,” which can be used to create browsers of document collections based on a topic model. i think this will become a very useful tool for us. the code is on google code:
http://code.google.com/p/tmve/
as an example, here is a browser built from a 50-topic model fit to 100K articles from wikipedia:
http://www.sccs.swarthmore.edu/users/08/ajb/tmve/wiki100k/browse/topic-list.html
allison describes how she built the browser in the README for her code:
http://code.google.com/p/tmve/wiki/TMVE01
finally, to check out the code and build your own browser, see here:
http://code.google.com/p/tmve/source/checkout

Take a look.

As I have mentioned before, LDA could be a good exploration tool for document collections, preparatory to building a topic map.

NLP (Natural Language Processing) tools

Friday, January 28th, 2011

Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources

From Stanford University.

It may not be every NLP resource but it is the place to start looking if you are looking for a new tool.

This should give you an idea of the range of tools that could be applied to the AF war diaries for example.

Topic Models (warning – not what you may think)

Tuesday, December 21st, 2010

Topic Models Authors: David M. Blei, John D. Lafferty

The term topic and topic model refer to sets of highly probable words that are found to characterize a text.

The sets of words are called topics.

A topic model is the technique applied to a set of texts to extract topics.

In this particular article, latent Dirichlet allocation (LDA).

Apologies for the mis-use of the term topic but I suspect if we looked closely, someone was using the term topic before ISO 13250.

Good introduction to topic models.