Archive for the ‘Topic Models (LDA)’ Category

Applications of Topic Models [Monograph, Free Until 12 August 2017]

Monday, August 7th, 2017

Applications of Topic Models by Jordan Boyd-Graber, Yuening Hu,David Mimno. (Jordan Boyd-Graber, Yuening Hu and David Mimno (2017), “Applications of Topic Models”, Foundations and Trends® in Information Retrieval: Vol. 11: No. 2-3, pp 143-296.


How can a single person understand what’s going on in a collection of millions of documents? This is an increasingly common problem: sifting through an organization’s e-mails, understanding a decade worth of newspapers, or characterizing a scientific field’s research. Topic models are a statistical framework that help users understand large document collections: not just to find individual documents but to understand the general themes present in the collection.

This survey describes the recent academic and industrial applications of topic models with the goal of launching a young researcher capable of building their own applications of topic models. In addition to topic models’ effective application to traditional problems like information retrieval, visualization, statistical inference, multilingual modeling, and linguistic understanding, this survey also reviews topic models’ ability to unlock large text collections for qualitative analysis. We review their successful use by researchers to help understand fiction, non-fiction, scientific publications, and political texts.

The authors discuss the use of topic models for, 4. Historical Documents, 5. Understanding Scientific Publications, 6. Fiction and Literature, 7. Computational Social Science, 8. Multilingual Data and Machine Translation, and provide further guidance in: 9. Building a Topic Model.

If you have haystacks of documents to mine, Applications of Topic Models is a must have on your short reading list.

UNIX, Bi-Grams, Tri-Grams, and Topic Modeling

Sunday, April 17th, 2016

UNIX, Bi-Grams, Tri-Grams, and Topic Modeling by Greg Brown.

From the post:

I’ve built up a list of UNIX commands over the years for doing basic text analysis on written language. I’ve built this list from a number of sources (Jim Martin‘s NLP class, StackOverflow, web searches), but haven’t seen it much in one place. With these commands I can analyze everything from log files to user poll responses.

Mostly this just comes down to how cool UNIX commands are (which you probably already know). But the magic is how you mix them together. Hopefully you find these recipes useful. I’m always looking for more so please drop into the comments to tell me what I’m missing.

For all of these examples I assume that you are analyzing a series of user responses with one response per line in a single file: data.txt. With a few cut and paste commands I often apply the same methods to CSV files and log files.

My favorite comment on this post was a reader who extended the tri-gram generator to build a hexagram!

If that sounds unreasonable, you haven’t read very many government reports. 😉

While you are at Greg’s blog, notice a number of useful posts on Elasticsearch.

Topic Extraction and Bundling of Related Scientific Articles

Wednesday, May 6th, 2015

Topic Extraction and Bundling of Related Scientific Articles by Shameem A Puthiya Parambath.


Automatic classification of scientific articles based on common characteristics is an interesting problem with many applications in digital library and information retrieval systems. Properly organized articles can be useful for automatic generation of taxonomies in scientific writings, textual summarization, efficient information retrieval etc. Generating article bundles from a large number of input articles, based on the associated features of the articles is tedious and computationally expensive task. In this report we propose an automatic two-step approach for topic extraction and bundling of related articles from a set of scientific articles in real-time. For topic extraction, we make use of Latent Dirichlet Allocation (LDA) topic modeling techniques and for bundling, we make use of hierarchical agglomerative clustering techniques.

We run experiments to validate our bundling semantics and compare it with existing models in use. We make use of an online crowdsourcing marketplace provided by Amazon called Amazon Mechanical Turk to carry out experiments. We explain our experimental setup and empirical results in detail and show that our method is advantageous over existing ones.

On “bundling” from the introduction:

Effective grouping of data requires a precise definition of closeness between a pair of data items and the notion of closeness always depend on the data and the problem context. Closeness is defined in terms of similarity of the data pairs which in turn is measured in terms of dissimilarity or distance between pair of items. In this report we use the term similarity,dissimilarity and distance to denote the measure of closeness between data items. Most of the bundling scheme start with identifying the common attributes(metadata) of the data set, here scientific articles, and create bundling semantics based on the combination of these attributes. Here we suggest a two step algorithm to bundle scientific articles. In the first step we group articles based on the latent topics in the documents and in the second step we carry out agglomerative hierarchical clustering based on the inter-textual distance and co-authorship similarity between articles. We run experiments to validate the bundling semantics and to compare it with content only based similarity. We used 19937 articles related to Computer Science from arviv [htt12a] for our experiments.

Is a “bundle” the same thing as a topic that represents “all articles on subject X?”

I have seen a number of topic map examples that use the equivalent proper noun, a proper subject, that is a singular and unique subject.

But there is no reason why I could not have a topic that represents all the articles on deep learning written in 2014, for example. Methods such as the bundling techniques described here could prove to be quite useful in such cases.

₳ustral Blog

Tuesday, April 14th, 2015

₳ustral Blog

From the post:

We’re software developers and entrepreneurs who wondered what Reddit might be able to tell us about our society.

Social network data have revolutionized advertising, brand management, political campaigns, and more. They have also enabled and inspired vast new areas of research in the social and natural sciences.

Traditional social networks like Facebook focus on mostly-private interactions between personal acquaintances, family members, and friends. Broadcast-style social networks like Twitter enable users at “hubs” in the social graph (those with many followers) to disseminate their ideas widely and interact directly with their “followers”. Both traditional and broadcast networks result in explicit social networks as users choose to associate themselves with other users.

Reddit and similar services such as Hacker News are a bit different. On Reddit, users vote for, and comment on, content. The social network that evolves as a result is implied based on interactions rather than explicit.

Another important difference is that, on Reddit, communication between users largely revolves around external topics or issues such as world news, sports teams, or local events. Instead of discussing their own lives, or topics randomly selected by the community, Redditors discuss specific topics (as determined by community voting) in a structured manner.

This is what we’re trying to harness with Project Austral. By combining Reddit stories, comments, and users with technologies like sentiment analysis and topic identification (more to come soon!) we’re hoping to reveal interesting trends and patterns that would otherwise remain hidden.

Please, check it out and let us know what you think!

Bad assumption on my part! Since ₳ustral uses Neo4j to store the Reddit graph, I was expecting a graph-type visualization. If that was intended, that isn’t what I found. 😉

Most of my searching is content oriented and not so much concerned with trends or patterns. An upsurge in hypergraph queries could happen in Reddit, but aside from references to publications and projects, the upsurge itself would be a curiosity to me.

Nothing against trending, patterns, etc. but just not my use case. May be yours.

Neo4j: Building a topic graph with Prismatic Interest Graph API

Sunday, February 22nd, 2015

Neo4j: Building a topic graph with Prismatic Interest Graph API by Mark Needham.

From the post:

Over the last few weeks I’ve been using various NLP libraries to derive topics for my corpus of How I met your mother episodes without success and was therefore enthused to see the release of Prismatic’s Interest Graph API.

The Interest Graph API exposes a web service to which you feed a block of text and get back a set of topics and associated score.

It has been trained over the last few years with millions of articles that people share on their social media accounts and in my experience using Prismatic the topics have been very useful for finding new material to read.

A great walk through from accessing the Interest Graph API to loading the data into Neo4j and querying it with Cypher.

I can’t profess a lot of interest in How I Met Your Mother episodes but the techniques can be applied to other content. 😉

LDAvis: Interactive Visualization of Topic Models

Tuesday, January 27th, 2015

LDAvis: Interactive Visualization of Topic Models by Carson Sievert and Kenny Shirley.

From the webpage:

Tools to create an interactive web-based visualization of a topic model that has been fit to a corpus of text data using Latent Dirichlet Allocation (LDA). Given the estimated parameters of the topic model, it computes various summary statistics as input to an interactive visualization built with D3.js that is accessed via a browser. The goal is to help users interpret the topics in their LDA topic model.

From the description:

This video (recorded September 2014) shows how interactive visualization is used to help interpret a topic model using LDAvis. LDAvis is an R package which extracts information from a topic model and creates a web-based visualization where users can interactively explore the model. More details, examples, and instructions for using LDAvis can be found here —

Excellent exploration of a data set using LDAvis.

Will all due respect to “agile” programming, modeling before you understand a data set isn’t a winning proposition.

Topics and xkcd Comics

Wednesday, June 18th, 2014

Finding structure in xkcd comics with Latent Dirichlet Allocation by Carson Sievert.

From the post:

xkcd is self-proclaimed as “a webcomic of romance, sarcasm, math, and language”. There was a recent effort to quantify whether or not these “topics” agree with topics derived from the xkcd text corpus using Latent Dirichlet Allocation (LDA). That analysis makes the all too common folly of choosing an arbitrary number of topics. Maybe xkcd’s tagline does provide a strong prior belief of a small number of topics, but here we take a more objective approach and let the data choose the number of topics. An “optimal” number of topics is found using the Bayesian model selection approach (with uniform prior belief on the number of topics) suggested by Griffiths and Steyvers (2004). After an optimal number is decided, topic interpretations and trends over time are explored.

Great interactive visualization, code for extracting data for xkcd comics, exploring “keywords that are most ‘relevant’ or ‘informative’ to a given topic’s meaning.”

Easy to see this post forming the basis for several sessions on LDA, starting with extracting the data, exploring the choices that influence the results and then visualizing the results of analysis.


I first saw this in a tweet by Zoltan Varju.

Text Coherence

Tuesday, May 13th, 2014

Christopher Phipps mentioned Automatic Evaluation of Text Coherence: Models and Representations by Mirella Lapata and Regina Barzilay in a tweet today. Running that article down, I discovered it was published in the proceedings of International Joint Conferences on Artificial Intelligence in 2005.

Useful but a bit dated.

A more recent resource: A Bibliography of Coherence and Cohesion, Wolfram Bublitz (Universität Augsburg). Last updated: 2010.

The Bublitz bibliography is more recent but current bibliography would be even more useful.

Can you suggest a more recent bibliography on text coherence/cohesion?

I ask because while looking for such a bibliography, I encountered: Improving Topic Coherence with Regularized Topic Models by David Newman, Edwin V. Bonilla, and, Wray Buntine.

The abstract reads:

Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To over-come this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.

I don’t think the “…small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful” is a surprise to anyone. I take that as the traditional “garbage in, garbage out.”

However, “regularizers” may be useful for automatic/assisted authoring of topics in the topic map sense of the word topic. Assuming you want to mine “small or small and noisy texts.” The authors say the technique should apply to large texts and promise future research on applying “regularizers” to large texts.

I checked the authors’ recent publications but didn’t see anything I would call a “large” text application of “regularizers.” Open area of research if you want to take the lead.

Provable Algorithms for Machine Learning Problems

Tuesday, December 31st, 2013

Provable Algorithms for Machine Learning Problems by Rong Ge.


Modern machine learning algorithms can extract useful information from text, images and videos. All these applications involve solving NP-hard problems in average case using heuristics. What properties of the input allow it to be solved effciently? Theoretically analyzing the heuristics is very challenging. Few results were known.

This thesis takes a di fferent approach: we identify natural properties of the input, then design new algorithms that provably works assuming the input has these properties. We are able to give new, provable and sometimes practical algorithms for learning tasks related to text corpus, images and social networks.

The first part of the thesis presents new algorithms for learning thematic structure in documents. We show under a reasonable assumption, it is possible to provably learn many topic models, including the famous Latent Dirichlet Allocation. Our algorithm is the first provable algorithms for topic modeling. An implementation runs 50 times faster than latest MCMC implementation and produces comparable results.

The second part of the thesis provides ideas for provably learning deep, sparse representations. We start with sparse linear representations, and give the fi rst algorithm for dictionary learning problem with provable guarantees. Then we apply similar ideas to deep learning: under reasonable assumptions our algorithms can learn a deep network built by denoising autoencoders.

The fi nal part of the thesis develops a framework for learning latent variable models. We demonstrate how various latent variable models can be reduced to orthogonal tensor decomposition, and then be solved using tensor power method. We give a tight sample complexity analysis for tensor power method, which reduces the number of sample required for learning many latent variable models.

In theory, the assumptions in this thesis help us understand why intractable problems in machine learning can often be solved; in practice, the results suggest inherently new approaches for machine learning. We hope the assumptions and algorithms inspire new research problems and learning algorithms.

Admittedly an odd notion, starting with the data rather than an answer and working back towards data but it does happen. 😉

Given the performance improvements for LDA (50X), I anticipate this approach being applied to algorithms for “big data.”

I first saw this in a tweet by Chris Deihl.

What is xkcd all about?…

Sunday, December 15th, 2013

What is xkcd all about? Text mining a web comic by Jonathan Stray.

From the post:

I recently ran into a very cute visualization of the topics of XKCD comics. It’s made using a topic modeling algorithm where the computer automatically figures out what topics xkcd covers, and the relationships between them. I decided to compare this xkcd topic visualization to Overview, which does a similar sort of thing in a different way (here’s how Overview’s clustering works).

Stand back, I’m going to try science!

I knew that topic modeling had to have some practical use. 😉

Jonathan uses the wildly popular xkcd comic to illustrate some of the features of Overview.

Emphasis on “some.”

Something fun to start the week with!

Besides, you are comparing topic modeling algorithms on a known document base.

What could be more work related than that?

Foundations of Data Science

Sunday, September 29th, 2013

Foundations of Data Science by John Hopcroft and Ravindran Kannan.

From the introduction:

Computer science as an academic discipline began in the 60’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered nite automata, regular expressions, context free languages, and computability. In the 70’s, algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other elds calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks, which are by far the largest such structures, presents both opportunities and challenges for theory.

While traditional areas of computer science are still important and highly skilled individuals are needed in these areas, the majority of researchers will be involved with using computers to understand and make usable massive data arising in applications, not just
how to make computers useful on specifi c well-defi ned problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods.

In draft form but impressive!

Current chapters:

  1. Introduction
  2. High-Dimensional Space
  3. Random Graphs
  4. Singular Value Decomposition (SVD)
  5. Random Walks and Markov Chains
  6. Learning and the VC-dimension
  7. Algorithms for Massive Data Problems
  8. Clustering
  9. Topic Models, Hidden Markov Process, Graphical Models, and Belief Propagation
  10. Other Topics [Rankings, Hare System for Voting, Compressed Sensing and Sparse Vectors]
  11. Appendix

I am certain the authors would appreciate comments and suggestions concerning the text.

I first saw this in a tweet by CompSciFact.

In-browser topic modeling

Friday, April 26th, 2013

In-browser topic modeling by David Mimno.

From the post:

Many people have found topic modeling a useful (and fun!) way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) explore the limits of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations.

About as easy an introduction/exploration as I can imagine.


A Simple Topic Model

Friday, July 27th, 2012

A Simple Topic Model by Allen Beye Riddell.

From the post:

NB: This is an extended version of the appendix of my paper exploring trends in German Studies in the US between 1928 and 2006. In that paper I used a topic model (Latent Dirichlet Allocation); this tutorial is intended to help readers understand how LDA works.

Topic models typically start with two banal assumptions. The first is that in a large collection of texts there exist a number of distinct groups (or sources) of texts. In the case of academic journal articles, these groups might be associated with different journals, authors, research subfields, or publication periods (e.g. the 1950s and 1980s). The second assumption is that texts from different sources tend to use different vocabulary. If we are presented with an article selected from one of two different academic journals, one dealing with literature and another with archeology, and we are told only that the word “plot” appears frequently in the article, we would be wise to guess the article comes from the literary studies journal.1

A major obstacle to understanding the remaining details about how topic models work is that their description relies on the abstract language of probability. Existing introductions to Latent Dirichlet Allocation (LDA) tend to be pitched either at an audience already fluent in statistics or at an audience with minimal background.2 This being the case, I want to address an audience that has some background in probability and statistics, perhaps at the level of the introductory texts of Hoff (2009), Lee (2004), or Kruschke (2010).

A good walk through on using a topic model (Latent Dirichlet Allocation).

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

Auto Tagging Articles using Semantic Analysis and Machine Learning

Wednesday, May 2nd, 2012

Auto Tagging Articles using Semantic Analysis and Machine Learning


The idea is to implement an auto tagging feature that provides tags automatically to the user depending upon the content of the post. The tags will get populated as soon as the user leaves the focus on the content text area or via ajax on the press of a button.I’ll be using semantic analysis and topic modeling techniques to judge the topic of the article and extract keywords also from it. Based on an algorithm and a ranking mechanism the user will be provided with a list of tags from which he can select those that best describe the article and also train a user-content specific semi-supervised machine learning model in the background.

A Drupal sandbox for work on auto tagging posts.

Or, topic map authoring without being “in your face.”

Depends on how you read “tags.”

Learning Topic Models – Going beyond SVD

Wednesday, April 18th, 2012

Learning Topic Models – Going beyond SVD by Sanjeev Arora, Rong Ge, and Ankur Moitra.


Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular.

Theoretical studies of topic modeling focus on learning the model’s parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves.

This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. A compelling feature of our algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the Correlated Topic Model and the Pachinko Allocation Model.

We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD – just as NMF has come to replace SVD in many applications.

The proposal hinges on the following assumption:

Separability requires that each topic has some near-perfect indicator word – a word that we call the anchor word for this topic— that appears with reasonable probability in that topic but with negligible probability in all other topics (e.g., “soccer” could be an anchor word for the topic “sports”). We give a formal definition in Section 1.1. This property is particularly natural in the context of topic modeling, where the number of distinct words (dictionary size) is very large compared to the number of topics. In a typical application, it is common to have a dictionary size in the thousands or tens of thousands, but the number of topics is usually somewhere in the range from 50 to 100. Note that separability does not mean that the anchor word always occurs (in fact, a typical document may be very likely to contain no anchor words). Instead, it dictates that when an anchor word does occur, it is a strong indicator that the corresponding topic is in the mixture used to generate the document.

The notion of an “anchor word” (or multiple anchor words per topics as the authors point out in the conclusion) resonates with the idea of identifying a subject. It is at least a clue that an author/editor should take into account.

Topic Models

Saturday, December 31st, 2011

Topic Models

From the post:

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999.[1] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics.[2] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.

Just in case you need some starter materials on discovering “topics” (non-topic map sense) in documents.

Introduction to Latent Dirichlet Allocation

Saturday, October 1st, 2011

Introduction to Latent Dirichlet Allocation by Edwin Chen.

From the introduction:

Suppose you have the following set of sentences:

  • I like to eat broccoli and bananas.
  • I ate a banana and spinach smoothie for breakfast.
  • Chinchillas and kittens are cute.
  • My sister adopted a kitten yesterday.
  • Look at this cute hamster munching on a piece of broccoli.

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

  • Sentences 1 and 2: 100% Topic A
  • Sentences 3 and 4: 100% Topic B
  • Sentence 5: 60% Topic A, 40% Topic B
  • Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
  • Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

The question, of course, is: how does LDA perform this discovery?

About as smooth an explanation of Latent Dirichlet Allocation as you are going to find.

Topic Modeling Bibliography

Friday, September 16th, 2011

Topic Modeling Bibliography

An extensive bibliography on topic modeling (LDA) by David Mimno.

There are a number of related resources on his homepage.

SIGKDD 2011 Conference

Tuesday, September 6th, 2011

A pair of posts from Ryan Rosario on the SIGKDD 2011 Conference.

Day 1 (Graph Mining and David Blei/Topic Models)

Tough sledding on Probabilistic Topic Models but definitely worth the effort to follow.

Days 2/3/4 Summary

Useful summaries and pointers to many additional resources.

If you attended SIGKDD 2011, do you have pointers to other reviews of the conference or other resources?

I added a category for SIGKDD.

What is a good explanation of Latent Dirichlet Allocation? (Quora)

Friday, September 2nd, 2011

What is a good explanation of Latent Dirichlet Allocation? (Quora)

If you need to explain topic modeling to your boss, department chair or funder, you would be hard pressed to find a better source of inspiration.

The explanation here ranges from technical to layman to actual example (Sarah Palin’s emails so you might better check on the audience’s political persuasion). Actually it would not hurt to have LDA examples on hand that run the gamut of political persuasions. (Or national perspectives if you are in the international market.)

BTW, if you not familiar with Quora, give it a look.

This link was forwarded to my attention by Jack Park.

Getting Started with MALLET and Topic Modeling

Thursday, September 1st, 2011

Getting Started with MALLET and Topic Modeling

If you don’t remember MALLET, take a look at: MALLET: MAchine Learning for LanguagE Toolkit Topic Map Competition (TMC) Contender?

Shawn is very interested in applying topic modeling to a variety of historical texts.

His blog, Electric Archaeology: Digital Media for Learning and Research looks very interesting. Covers: “Agent based modeling, games, virtual worlds, and online education for archaeology and history.”

This is the sort of person who might be interested in topic maps and related technologies.

As far as I know, there is still a real lack of example driven texts that would introduce most humanists to modern software.

An Architecture for Parallel Topic Models

Wednesday, June 15th, 2011

An Architecture for Parallel Topic Models by Alexander Smola and Shravan Narayanamurthy.


This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.

The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

Interesting how this key, value stuff keeps coming up these days.

The authors plan on making the codebase available for public use.

Updated 30 June 2011 to include the URL supplied by Sam Hunting. (Thanks Sam!)

Reading Tea Leaves: How Humans Interpret Topic Models

Wednesday, December 22nd, 2010

Reading Tea Leaves: How Humans Interpret Topic Models Authors: Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei


Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.

Read the article first but then see the LingPipe Blog review of the same.

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval

Monday, December 20th, 2010

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval Author: Laurence A. F. Park, The University of Melbourne. slides


It has been shown that the use of topic models for Information retrieval provides an increase in precision when used in the appropriate form. Latent Dirichlet Allocation (LDA) is a generative topic model that allows us to model documents using a Dirichlet prior. Using this topic model, we are able to obtain a fitted Dirichlet parameter that provides the maximum likelihood for the document set. In this article, we examine the sensitivity of LDA with respect to the Dirichlet parameter when used for Information retrieval. We compare the topic model computation times, storage requirements and retrieval precision of fitted LDA to LDA with a uniform Dirichlet prior. The results show there there is no significant benefit of using fitted LDA over the LDA with a constant Dirichlet parameter, hence showing that LDA is insensitive with respect to the Dirichlet parameter when used for Information retrieval.

Note that topic is used in semantic analysis (of various kinds) to mean highly probable words and not in the technical sense of the TMDM or XTM.

Extraction of highly probably words from documents can be useful in the construction of topic maps for those documents.