Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 21, 2010

Topic Models (warning – not what you may think)

Filed under: Latent Dirichlet Allocation (LDA),Topic Models — Patrick Durusau @ 5:06 pm

Topic Models Authors: David M. Blei, John D. Lafferty

The term topic and topic model refer to sets of highly probable words that are found to characterize a text.

The sets of words are called topics.

A topic model is the technique applied to a set of texts to extract topics.

In this particular article, latent Dirichlet allocation (LDA).

Apologies for the mis-use of the term topic but I suspect if we looked closely, someone was using the term topic before ISO 13250.

Good introduction to topic models.

December 20, 2010

Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation

Filed under: Latent Dirichlet Allocation (LDA),Named Entity Mining,NEM,WS-LDA — Patrick Durusau @ 6:19 am

Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation (video) Authors: Shuang-Hong Yang, Gu Xu, Hang Li slides KDD ’09 paper

Abstract:

This paper addresses Named Entity Mining (NEM), in which we mine knowledge about named entities such as movies, games, and books from a huge amount of data. NEM is potentially useful in many applications including web search, online advertisement, and recommender system. There are three challenges for the task: finding suitable data source, coping with the ambiguities of named entity classes, and incorporating necessary human supervision into the mining process. This paper proposes conducting NEM by using click-through data collected at a web search engine, employing a topic model that generates the click-through data, and learning the topic model by weak supervision from humans. Specifically, it characterizes each named entity by its associated queries and URLs in the click-through data. It uses the topic model to resolve ambiguities of named entity classes by representing the classes as topics. It employs a method, referred to as Weakly Supervised Latent Dirichlet Allocation (WS-LDA), to accurately learn the topic model with partially labeled named entities. Experiments on a large scale click-through data containing over 1.5 billion query-URL pairs show that the proposed approach can conduct very accurate NEM and significantly outperforms the baseline.

With some slight modifications, almost directly applicable to the construction of topic maps.

Questions:

  1. What presumptions underlie the use of supervision to assist with Named Entity Mining? (2-3 pages, no citations)
  2. Are those valid presumptions for click-through data? (2-3 pages, no citations)
  3. How would you suggest investigating the characteristics of click-through data? (2-3 pages, no citations)

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval Author: Laurence A. F. Park, The University of Melbourne. slides

Abstract:

It has been shown that the use of topic models for Information retrieval provides an increase in precision when used in the appropriate form. Latent Dirichlet Allocation (LDA) is a generative topic model that allows us to model documents using a Dirichlet prior. Using this topic model, we are able to obtain a fitted Dirichlet parameter that provides the maximum likelihood for the document set. In this article, we examine the sensitivity of LDA with respect to the Dirichlet parameter when used for Information retrieval. We compare the topic model computation times, storage requirements and retrieval precision of fitted LDA to LDA with a uniform Dirichlet prior. The results show there there is no significant benefit of using fitted LDA over the LDA with a constant Dirichlet parameter, hence showing that LDA is insensitive with respect to the Dirichlet parameter when used for Information retrieval.

Note that topic is used in semantic analysis (of various kinds) to mean highly probable words and not in the technical sense of the TMDM or XTM.

Extraction of highly probably words from documents can be useful in the construction of topic maps for those documents.

« Newer Posts

Powered by WordPress