Archive for the ‘Latent Dirichlet Allocation (LDA)’ Category
Friday, April 26th, 2013
In-browser topic modeling by David Mimno.
From the post:
Many people have found topic modeling a useful (and fun!) way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) explore the limits of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations.
About as easy an introduction/exploration as I can imagine.
Enjoy!
Posted in Latent Dirichlet Allocation (LDA), Topic Models (LDA) | No Comments »
Friday, February 1st, 2013
Topic Discovery With Apache Pig and Mallet
Only one of two posts from this blog in 2012 but it is a useful one.
From the post:
A common desire when working with natural language is topic discovery. That is, given a set of documents (eg. tweets, blog posts, emails) you would like to discover the topics inherent in those documents. Often this method is used to summarize a large corpus of text so it can be quickly understood what that text is ‘about’. You can go further and use topic discovery as a way to classify new documents or to group and organize the documents you’ve done topic discovery on.
Walks through the use of Pig and Mallet on a newsgroup data set.
I have been thinking about getting one of those unlimited download newsgroup accounts.
Maybe I need to go ahead and start building some newsgroup data sets.
Posted in Latent Dirichlet Allocation (LDA), MALLET, Pig | No Comments »
Saturday, December 29th, 2012
My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis
From the post:
After the work I did for my last post, I wanted to practice doing multiple classification. I first thought of using the famous iris dataset, but felt that was a little boring. Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors. Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.
The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books. In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!! The following table describes the numerical breakdown of books and chapters from each author:
Introduction to authorship studies as they were known (may still be) in the academic circles of my youth.
I wonder if the same techniques are as viable today as on the Federalist Papers?
The Wheel of Time example demonstrates the technique remains viable for novel authors.
But what about authorship more broadly?
Can we reliably distinguish between news commentary from multiple sources?
Or between statements by elected officials?
How would your topic map represent purported authorship versus attributed authorship?
Or even a common authorship for multiple purported authors? (speech writers)
Posted in Classification, Inference, Latent Dirichlet Allocation (LDA), Random Forests | 1 Comment »
Monday, November 5th, 2012
Topic Modeling Tool
From the webpage:
A graphical user interface tool for Latent Dirichlet Allocation topic modeling.
A very easy tool for exploring the use of Latent Dirichlet Allocation topic modeling.
Of course, on non-Mac machines, there is no “Double-click” on the jar file to run it, so use:
java -jar TopicModelingTool.jar
Oh, and the documentation is missing the link to the test files, see:
http://code.google.com/p/topic-modeling-tool/downloads/list
- testdatanews_music_2084docs.txt 13.3 MB
- testdata_news_economy_2073docs.txt 13.0 MB
- testdata_news_fuel_845docs.txt 5.3 MB
- testdata_braininjury_10000docs.txt 9.6 MB
I used testdata_news_music_2048docs.txt file, set to 100 topics with the default options and the learning process took 52 seconds and the complete process 66.056 seconds. Your mileage will vary but fast enough for smallish data sets.
At least in a session, you can’t change the output directory.
I could see using this in a class to explore a body of material for creation of topic maps.
Posted in Latent Dirichlet Allocation (LDA), Machine Learning | No Comments »
Friday, July 27th, 2012
A Simple Topic Model by Allen Beye Riddell.
From the post:
NB: This is an extended version of the appendix of my paper exploring trends in German Studies in the US between 1928 and 2006. In that paper I used a topic model (Latent Dirichlet Allocation); this tutorial is intended to help readers understand how LDA works.
Topic models typically start with two banal assumptions. The first is that in a large collection of texts there exist a number of distinct groups (or sources) of texts. In the case of academic journal articles, these groups might be associated with different journals, authors, research subfields, or publication periods (e.g. the 1950s and 1980s). The second assumption is that texts from different sources tend to use different vocabulary. If we are presented with an article selected from one of two different academic journals, one dealing with literature and another with archeology, and we are told only that the word “plot” appears frequently in the article, we would be wise to guess the article comes from the literary studies journal.1
A major obstacle to understanding the remaining details about how topic models work is that their description relies on the abstract language of probability. Existing introductions to Latent Dirichlet Allocation (LDA) tend to be pitched either at an audience already fluent in statistics or at an audience with minimal background.2 This being the case, I want to address an audience that has some background in probability and statistics, perhaps at the level of the introductory texts of Hoff (2009), Lee (2004), or Kruschke (2010).
A good walk through on using a topic model (Latent Dirichlet Allocation).
I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.
Posted in Latent Dirichlet Allocation (LDA), Topic Models (LDA) | No Comments »
Wednesday, April 18th, 2012
Learning Topic Models – Going beyond SVD by Sanjeev Arora, Rong Ge, and Ankur Moitra.
Abstract:
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular.
Theoretical studies of topic modeling focus on learning the model’s parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves.
This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. A compelling feature of our algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the Correlated Topic Model and the Pachinko Allocation Model.
We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD – just as NMF has come to replace SVD in many applications.
The proposal hinges on the following assumption:
Separability requires that each topic has some near-perfect indicator word – a word that we call the anchor word for this topic— that appears with reasonable probability in that topic but with negligible probability in all other topics (e.g., “soccer” could be an anchor word for the topic “sports”). We give a formal definition in Section 1.1. This property is particularly natural in the context of topic modeling, where the number of distinct words (dictionary size) is very large compared to the number of topics. In a typical application, it is common to have a dictionary size in the thousands or tens of thousands, but the number of topics is usually somewhere in the range from 50 to 100. Note that separability does not mean that the anchor word always occurs (in fact, a typical document may be very likely to contain no anchor words). Instead, it dictates that when an anchor word does occur, it is a strong indicator that the corresponding topic is in the mixture used to generate the document.
The notion of an “anchor word” (or multiple anchor words per topics as the authors point out in the conclusion) resonates with the idea of identifying a subject. It is at least a clue that an author/editor should take into account.
Posted in BigData, Latent Dirichlet Allocation (LDA), Topic Models, Topic Models (LDA) | No Comments »
Saturday, December 31st, 2011
Topic Models
From the post:
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999.[1] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics.[2] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.
Just in case you need some starter materials on discovering “topics” (non-topic map sense) in documents.
Posted in Latent Dirichlet Allocation (LDA), Topic Models, Topic Models (LDA) | No Comments »
Saturday, October 8th, 2011
How to do Your Own Topic Modeling
From the post:
In the first Teaching with Technology Tuesday of the fall 2011 semester, David Newman delivered a presentation on topic modeling to a full house in Bass’s L01 classroom. His research concentrates on data mining and machine learning, and he has been working with Yale for the past three years in an IMLS funded project on the applications of topic modeling in museum and library collections. In Tuesday’s talk, David broke down what topic modeling is, how it can be useful, and introduced a tool he designed to make the process accessible to anyone who can use a computer.
Summary of what sounds like an interesting presentation on the use of topic modeling (Latent Dirichlet Allocation/LDA) along with links to software. Enough detail that if topic modeling is unfamiliar, you will get the gist of it.
The usual cautions about LDA apply: It can’t model what’s not present, works at the document level (too coarse for many purposes), your use of the software has a dramatic impact on the results, etc. Useful tool, just be careful how much you rely upon it without checking the results.
Posted in Latent Dirichlet Allocation (LDA), Software | No Comments »
Saturday, October 1st, 2011
Introduction to Latent Dirichlet Allocation by Edwin Chen.
From the introduction:
Suppose you have the following set of sentences:
- I like to eat broccoli and bananas.
- I ate a banana and spinach smoothie for breakfast.
- Chinchillas and kittens are cute.
- My sister adopted a kitten yesterday.
- Look at this cute hamster munching on a piece of broccoli.
What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like
- Sentences 1 and 2: 100% Topic A
- Sentences 3 and 4: 100% Topic B
- Sentence 5: 60% Topic A, 40% Topic B
- Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
- Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
The question, of course, is: how does LDA perform this discovery?
About as smooth an explanation of Latent Dirichlet Allocation as you are going to find.
Posted in Latent Dirichlet Allocation (LDA), Topic Models (LDA) | No Comments »
Friday, September 16th, 2011
Learning Topic Models by Belief Propagation by Jia Zeng, William K. Cheung, and Jiming Liu.
Abstract:
Latent Dirichlet allocation (LDA) is an important class of hierarchical Bayesian models for probabilistic topic modeling, which attracts worldwide interests and touches many important applications in text mining, computer vision and computational biology. This paper proposes a novel tree-structured factor graph representation for LDA within the Markov random field (MRF) framework, which enables the classic belief propagation (BP) algorithm for exact inference and parameter estimation. Although two commonly-used approximation inference methods, such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great successes in learning LDA, the proposed BP is competitive in both speed and accuracy validated by encouraging experimental results on four large-scale document data sets. Furthermore, the BP algorithm has the potential to become a generic learning scheme for variants of LDA-based topic models. To this end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic models (RTM), using belief propagation based on the factor graph representation.
I have just started reading this paper but wanted to bring it to your attention. I peeked at the results and it looks quite promising.
This work was tested against the following data sets:
1) CORA [30] contains abstracts from the CORA research paper search engine in machine learning area, where the documents can be classified into 7 major categories.
2) MEDL [31] contains abstracts from the MEDLINE biomedical paper search engine, where the documents fall broadly into 4 categories.
3) NIPS [32] includes papers from the conference “Neural Information Processing Systems”, where all papers are grouped into 13 categories. NIPS has no citation link information.
4) BLOG [33] contains a collection of political blogs on the subject of American politics in the year 2008. where all blogs can be broadly classified into 6 categories. BLOG has no author information.
with positive results.
Posted in Bayesian Models, Latent Dirichlet Allocation (LDA) | No Comments »
Friday, September 16th, 2011
Topic Modeling Bibliography
An extensive bibliography on topic modeling (LDA) by David Mimno.
There are a number of related resources on his homepage.
Posted in Latent Dirichlet Allocation (LDA), Topic Models (LDA) | No Comments »
Tuesday, September 6th, 2011
A pair of posts from Ryan Rosario on the SIGKDD 2011 Conference.
Day 1 (Graph Mining and David Blei/Topic Models)
Tough sledding on Probabilistic Topic Models but definitely worth the effort to follow.
Days 2/3/4 Summary
Useful summaries and pointers to many additional resources.
If you attended SIGKDD 2011, do you have pointers to other reviews of the conference or other resources?
I added a category for SIGKDD.
Posted in Conferences, Knowledge Capture, Knowledge Management, Knowledge Representation, Latent Dirichlet Allocation (LDA), Probalistic Models, SIGKDD, Topic Models (LDA) | No Comments »
Friday, September 2nd, 2011
What is a good explanation of Latent Dirichlet Allocation? (Quora)
If you need to explain topic modeling to your boss, department chair or funder, you would be hard pressed to find a better source of inspiration.
The explanation here ranges from technical to layman to actual example (Sarah Palin’s emails so you might better check on the audience’s political persuasion). Actually it would not hurt to have LDA examples on hand that run the gamut of political persuasions. (Or national perspectives if you are in the international market.)
BTW, if you not familiar with Quora, give it a look.
This link was forwarded to my attention by Jack Park.
Posted in Latent Dirichlet Allocation (LDA), Topic Models (LDA) | No Comments »
Wednesday, June 29th, 2011
Topic Modeling Sarah Palin’s Emails from Edwin Chen.
From the post:
LDA-based Email Browser
Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I did some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.
Interesting analysis and promise of more to follow.
With a US presidential election next year, there is little doubt there will be friendly as well as hostile floods of documents.
Time to sharpen your data extraction tools.
Posted in Latent Dirichlet Allocation (LDA), Linguistics | No Comments »
Wednesday, June 15th, 2011
An Architecture for Parallel Topic Models by Alexander Smola and Shravan Narayanamurthy.
Abstract:
This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.
The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.
Interesting how this key, value stuff keeps coming up these days.
The authors plan on making the codebase available for public use.
Updated 30 June 2011 to include the URL supplied by Sam Hunting. (Thanks Sam!)
Posted in Latent Dirichlet Allocation (LDA), Topic Models (LDA) | 1 Comment »
Saturday, March 26th, 2011
Topic Modeling Browser (LDA)
From a post by David Blei:
allison chaney has created the “topic model visualization engine,” which can be used to create browsers of document collections based on a topic model. i think this will become a very useful tool for us. the code is on google code:
http://code.google.com/p/tmve/
as an example, here is a browser built from a 50-topic model fit to 100K articles from wikipedia:
http://www.sccs.swarthmore.edu/users/08/ajb/tmve/wiki100k/browse/topic-list.html
allison describes how she built the browser in the README for her code:
http://code.google.com/p/tmve/wiki/TMVE01
finally, to check out the code and build your own browser, see here:
http://code.google.com/p/tmve/source/checkout
Take a look.
As I have mentioned before, LDA could be a good exploration tool for document collections, preparatory to building a topic map.
Posted in Data Mining, Interface Research/Design, Latent Dirichlet Allocation (LDA), Topic Models | No Comments »
Wednesday, March 16th, 2011
Latent Dirichlet Allocation in C
From the website:
This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. For example, click here to see the topics estimated from a small corpus of Associated Press documents. LDA is fully described in Blei et al. (2003) .
This code contains:
- an implementation of variational inference for the per-document topic proportions and per-word topic assignments
- a variational EM procedure for estimating the topics and exchangeable Dirichlet hyperparameter
Do be aware that the use of topic in this technique and papers discussing it is not the same thing as topic as defined by ISO 13250-2.
It comes closer to the notion of subject as defined in ISO 13250-2.
Update:
I was sent a pointer to David M. Blei’s
http://www.cs.princeton.edu/~blei/topicmodeling.html, which has more code and other goodies.
Posted in Latent Dirichlet Allocation (LDA) | No Comments »
Wednesday, December 22nd, 2010
Reading Tea Leaves: How Humans Interpret Topic Models Authors: Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei
Abstract:
Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.
Read the article first but then see the LingPipe Blog review of the same.
Posted in Latent Dirichlet Allocation (LDA), Topic Models (LDA) | No Comments »
Tuesday, December 21st, 2010
Topic Models Authors: David M. Blei, John D. Lafferty
The term topic and topic model refer to sets of highly probable words that are found to characterize a text.
The sets of words are called topics.
A topic model is the technique applied to a set of texts to extract topics.
In this particular article, latent Dirichlet allocation (LDA).
Apologies for the mis-use of the term topic but I suspect if we looked closely, someone was using the term topic before ISO 13250.
Good introduction to topic models.
Posted in Latent Dirichlet Allocation (LDA), Topic Models | No Comments »
Monday, December 20th, 2010
Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation (video) Authors: Shuang-Hong Yang, Gu Xu, Hang Li slides KDD ’09 paper
Abstract:
This paper addresses Named Entity Mining (NEM), in which we mine knowledge about named entities such as movies, games, and books from a huge amount of data. NEM is potentially useful in many applications including web search, online advertisement, and recommender system. There are three challenges for the task: finding suitable data source, coping with the ambiguities of named entity classes, and incorporating necessary human supervision into the mining process. This paper proposes conducting NEM by using click-through data collected at a web search engine, employing a topic model that generates the click-through data, and learning the topic model by weak supervision from humans. Specifically, it characterizes each named entity by its associated queries and URLs in the click-through data. It uses the topic model to resolve ambiguities of named entity classes by representing the classes as topics. It employs a method, referred to as Weakly Supervised Latent Dirichlet Allocation (WS-LDA), to accurately learn the topic model with partially labeled named entities. Experiments on a large scale click-through data containing over 1.5 billion query-URL pairs show that the proposed approach can conduct very accurate NEM and significantly outperforms the baseline.
With some slight modifications, almost directly applicable to the construction of topic maps.
Questions:
- What presumptions underlie the use of supervision to assist with Named Entity Mining? (2-3 pages, no citations)
- Are those valid presumptions for click-through data? (2-3 pages, no citations)
- How would you suggest investigating the characteristics of click-through data? (2-3 pages, no citations)
Posted in Latent Dirichlet Allocation (LDA), NEM, Named Entity Mining, WS-LDA | No Comments »
Monday, December 20th, 2010
The Sensitivity of Latent Dirichlet Allocation for Information Retrieval Author: Laurence A. F. Park, The University of Melbourne. slides
Abstract:
It has been shown that the use of topic models for Information retrieval provides an increase in precision when used in the appropriate form. Latent Dirichlet Allocation (LDA) is a generative topic model that allows us to model documents using a Dirichlet prior. Using this topic model, we are able to obtain a fitted Dirichlet parameter that provides the maximum likelihood for the document set. In this article, we examine the sensitivity of LDA with respect to the Dirichlet parameter when used for Information retrieval. We compare the topic model computation times, storage requirements and retrieval precision of fitted LDA to LDA with a uniform Dirichlet prior. The results show there there is no significant benefit of using fitted LDA over the LDA with a constant Dirichlet parameter, hence showing that LDA is insensitive with respect to the Dirichlet parameter when used for Information retrieval.
Note that topic is used in semantic analysis (of various kinds) to mean highly probable words and not in the technical sense of the TMDM or XTM.
Extraction of highly probably words from documents can be useful in the construction of topic maps for those documents.
Posted in Information Retrieval, Latent Dirichlet Allocation (LDA), Topic Models (LDA) | No Comments »