Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 21, 2010

Language Pyramid and Multi-Scale Text Analysis

Filed under: Bag-of-Words (BOW),Language Pyramid (LaP) — Patrick Durusau @ 3:22 pm

Language Pyramid and Multi-Scale Text Analysis Authors: Shuang-Hong Yang, Hongyuan Zha Keywords: bag of word, language pyramid, multi-scale language models, multi-scale text analysis, multi-scale text kernel, text spatial contents modeling

Abstract:

The classical Bag-of-Word (BOW) model represents a document as a histogram of word occurrence, losing the spatial information that is invaluable for many text analysis tasks. In this paper, we present the Language Pyramid (LaP) model, which casts a document as a probabilistic distribution over the joint semantic-spatial space and motivates a multi-scale 2D local smoothing framework for nonparametric text coding. LaP efficiently encodes both semantic and spatial contents of a document into a pyramid of matrices that are smoothed both semantically and spatially at a sequence of resolutions, providing a convenient multi-scale imagic view for natural language understanding. The LaP representation can be used in text analysis in a variety of ways, among which we investigate two instantiations in the current paper: (1) multi-scale text kernels for document categorization, and (2) multi-scale language models for ad hoc text retrieval. Experimental results illustrate that: for classification, LaP outperforms BOW by (up to) 4% on moderate-length texts (RCV1 text benchmark) and 15% on short texts (Yahoo! queries); and for retrieval, LaP gains 12% MAP improvement over uni-gram language models on the OHSUMED data set.

The text that stands out for me reads:

More pessimistically, different concepts usually stands at different scales, making it impossible to capture all the right meanings with a single scale. For example, named entities usually range from unigram (e.g., “new”) to bigram (e.g., “New York”), to multigram (e.g., “New York Times”), and even to a whole long sequence (e.g., a song name “ Another Lonely Night In New York”).

OK, so if you put it that way, then BOW (bag-of-words) is clearly not the best idea.

But we already treat text at different scales.

All together now: markup!

I checked the documentation on the latest Marklogic release and found:

By default, MarkLogic Server assumes that any XML element constructor acts as a phrase boundary.
(Administrator’s Guide, 4.2, page 224)

Can someone supply the behavior for some of the other XML indexing engines? Thanks!

Powered by WordPress