Archive for the ‘Topic Models (LDA)’ Category

In-browser topic modeling

Friday, April 26th, 2013

In-browser topic modeling by David Mimno.

From the post:

Many people have found topic modeling a useful (and fun!) way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) explore the limits of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations.

About as easy an introduction/exploration as I can imagine.

Enjoy!

A Simple Topic Model

Friday, July 27th, 2012

A Simple Topic Model by Allen Beye Riddell.

From the post:

NB: This is an extended version of the appendix of my paper exploring trends in German Studies in the US between 1928 and 2006. In that paper I used a topic model (Latent Dirichlet Allocation); this tutorial is intended to help readers understand how LDA works.

Topic models typically start with two banal assumptions. The first is that in a large collection of texts there exist a number of distinct groups (or sources) of texts. In the case of academic journal articles, these groups might be associated with different journals, authors, research subfields, or publication periods (e.g. the 1950s and 1980s). The second assumption is that texts from different sources tend to use different vocabulary. If we are presented with an article selected from one of two different academic journals, one dealing with literature and another with archeology, and we are told only that the word “plot” appears frequently in the article, we would be wise to guess the article comes from the literary studies journal.1

A major obstacle to understanding the remaining details about how topic models work is that their description relies on the abstract language of probability. Existing introductions to Latent Dirichlet Allocation (LDA) tend to be pitched either at an audience already fluent in statistics or at an audience with minimal background.2 This being the case, I want to address an audience that has some background in probability and statistics, perhaps at the level of the introductory texts of Hoff (2009), Lee (2004), or Kruschke (2010).

A good walk through on using a topic model (Latent Dirichlet Allocation).

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

Auto Tagging Articles using Semantic Analysis and Machine Learning

Wednesday, May 2nd, 2012

Auto Tagging Articles using Semantic Analysis and Machine Learning

Description:

The idea is to implement an auto tagging feature that provides tags automatically to the user depending upon the content of the post. The tags will get populated as soon as the user leaves the focus on the content text area or via ajax on the press of a button.I’ll be using semantic analysis and topic modeling techniques to judge the topic of the article and extract keywords also from it. Based on an algorithm and a ranking mechanism the user will be provided with a list of tags from which he can select those that best describe the article and also train a user-content specific semi-supervised machine learning model in the background.

A Drupal sandbox for work on auto tagging posts.

Or, topic map authoring without being “in your face.”

Depends on how you read “tags.”

Learning Topic Models – Going beyond SVD

Wednesday, April 18th, 2012

Learning Topic Models – Going beyond SVD by Sanjeev Arora, Rong Ge, and Ankur Moitra.

Abstract:

Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular.

Theoretical studies of topic modeling focus on learning the model’s parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves.

This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. A compelling feature of our algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the Correlated Topic Model and the Pachinko Allocation Model.

We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD – just as NMF has come to replace SVD in many applications.

The proposal hinges on the following assumption:

Separability requires that each topic has some near-perfect indicator word – a word that we call the anchor word for this topic— that appears with reasonable probability in that topic but with negligible probability in all other topics (e.g., “soccer” could be an anchor word for the topic “sports”). We give a formal definition in Section 1.1. This property is particularly natural in the context of topic modeling, where the number of distinct words (dictionary size) is very large compared to the number of topics. In a typical application, it is common to have a dictionary size in the thousands or tens of thousands, but the number of topics is usually somewhere in the range from 50 to 100. Note that separability does not mean that the anchor word always occurs (in fact, a typical document may be very likely to contain no anchor words). Instead, it dictates that when an anchor word does occur, it is a strong indicator that the corresponding topic is in the mixture used to generate the document.

The notion of an “anchor word” (or multiple anchor words per topics as the authors point out in the conclusion) resonates with the idea of identifying a subject. It is at least a clue that an author/editor should take into account.

Topic Models

Saturday, December 31st, 2011

Topic Models

From the post:

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999.[1] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics.[2] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.

Just in case you need some starter materials on discovering “topics” (non-topic map sense) in documents.

Introduction to Latent Dirichlet Allocation

Saturday, October 1st, 2011

Introduction to Latent Dirichlet Allocation by Edwin Chen.

From the introduction:

Suppose you have the following set of sentences:

  • I like to eat broccoli and bananas.
  • I ate a banana and spinach smoothie for breakfast.
  • Chinchillas and kittens are cute.
  • My sister adopted a kitten yesterday.
  • Look at this cute hamster munching on a piece of broccoli.

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

  • Sentences 1 and 2: 100% Topic A
  • Sentences 3 and 4: 100% Topic B
  • Sentence 5: 60% Topic A, 40% Topic B
  • Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
  • Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

The question, of course, is: how does LDA perform this discovery?

About as smooth an explanation of Latent Dirichlet Allocation as you are going to find.

Topic Modeling Bibliography

Friday, September 16th, 2011

Topic Modeling Bibliography

An extensive bibliography on topic modeling (LDA) by David Mimno.

There are a number of related resources on his homepage.

SIGKDD 2011 Conference

Tuesday, September 6th, 2011

A pair of posts from Ryan Rosario on the SIGKDD 2011 Conference.

Day 1 (Graph Mining and David Blei/Topic Models)

Tough sledding on Probabilistic Topic Models but definitely worth the effort to follow.

Days 2/3/4 Summary

Useful summaries and pointers to many additional resources.

If you attended SIGKDD 2011, do you have pointers to other reviews of the conference or other resources?

I added a category for SIGKDD.

What is a good explanation of Latent Dirichlet Allocation? (Quora)

Friday, September 2nd, 2011

What is a good explanation of Latent Dirichlet Allocation? (Quora)

If you need to explain topic modeling to your boss, department chair or funder, you would be hard pressed to find a better source of inspiration.

The explanation here ranges from technical to layman to actual example (Sarah Palin’s emails so you might better check on the audience’s political persuasion). Actually it would not hurt to have LDA examples on hand that run the gamut of political persuasions. (Or national perspectives if you are in the international market.)

BTW, if you not familiar with Quora, give it a look.

This link was forwarded to my attention by Jack Park.

Getting Started with MALLET and Topic Modeling

Thursday, September 1st, 2011

Getting Started with MALLET and Topic Modeling

If you don’t remember MALLET, take a look at: MALLET: MAchine Learning for LanguagE Toolkit Topic Map Competition (TMC) Contender?

Shawn is very interested in applying topic modeling to a variety of historical texts.

His blog, Electric Archaeology: Digital Media for Learning and Research looks very interesting. Covers: “Agent based modeling, games, virtual worlds, and online education for archaeology and history.”

This is the sort of person who might be interested in topic maps and related technologies.

As far as I know, there is still a real lack of example driven texts that would introduce most humanists to modern software.

An Architecture for Parallel Topic Models

Wednesday, June 15th, 2011

An Architecture for Parallel Topic Models by Alexander Smola and Shravan Narayanamurthy.

Abstract:

This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.

The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

Interesting how this key, value stuff keeps coming up these days.

The authors plan on making the codebase available for public use.


Updated 30 June 2011 to include the URL supplied by Sam Hunting. (Thanks Sam!)

Reading Tea Leaves: How Humans Interpret Topic Models

Wednesday, December 22nd, 2010

Reading Tea Leaves: How Humans Interpret Topic Models Authors: Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei

Abstract:

Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.

Read the article first but then see the LingPipe Blog review of the same.

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval

Monday, December 20th, 2010

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval Author: Laurence A. F. Park, The University of Melbourne. slides

Abstract:

It has been shown that the use of topic models for Information retrieval provides an increase in precision when used in the appropriate form. Latent Dirichlet Allocation (LDA) is a generative topic model that allows us to model documents using a Dirichlet prior. Using this topic model, we are able to obtain a fitted Dirichlet parameter that provides the maximum likelihood for the document set. In this article, we examine the sensitivity of LDA with respect to the Dirichlet parameter when used for Information retrieval. We compare the topic model computation times, storage requirements and retrieval precision of fitted LDA to LDA with a uniform Dirichlet prior. The results show there there is no significant benefit of using fitted LDA over the LDA with a constant Dirichlet parameter, hence showing that LDA is insensitive with respect to the Dirichlet parameter when used for Information retrieval.

Note that topic is used in semantic analysis (of various kinds) to mean highly probable words and not in the technical sense of the TMDM or XTM.

Extraction of highly probably words from documents can be useful in the construction of topic maps for those documents.