Language Pyramid and Multi-Scale Text Analysis Authors: Shuang-Hong Yang, Hongyuan Zha Keywords: bag of word, language pyramid, multi-scale language models, multi-scale text analysis, multi-scale text kernel, text spatial contents modeling
Abstract:
The classical Bag-of-Word (BOW) model represents a document as a histogram of word occurrence, losing the spatial information that is invaluable for many text analysis tasks. In this paper, we present the Language Pyramid (LaP) model, which casts a document as a probabilistic distribution over the joint semantic-spatial space and motivates a multi-scale 2D local smoothing framework for nonparametric text coding. LaP efficiently encodes both semantic and spatial contents of a document into a pyramid of matrices that are smoothed both semantically and spatially at a sequence of resolutions, providing a convenient multi-scale imagic view for natural language understanding. The LaP representation can be used in text analysis in a variety of ways, among which we investigate two instantiations in the current paper: (1) multi-scale text kernels for document categorization, and (2) multi-scale language models for ad hoc text retrieval. Experimental results illustrate that: for classification, LaP outperforms BOW by (up to) 4% on moderate-length texts (RCV1 text benchmark) and 15% on short texts (Yahoo! queries); and for retrieval, LaP gains 12% MAP improvement over uni-gram language models on the OHSUMED data set.
The text that stands out for me reads:
More pessimistically, different concepts usually stands at different scales, making it impossible to capture all the right meanings with a single scale. For example, named entities usually range from unigram (e.g., “new”) to bigram (e.g., “New York”), to multigram (e.g., “New York Times”), and even to a whole long sequence (e.g., a song name “ Another Lonely Night In New York”).
OK, so if you put it that way, then BOW (bag-of-words) is clearly not the best idea.
But we already treat text at different scales.
All together now: markup!
I checked the documentation on the latest Marklogic release and found:
By default, MarkLogic Server assumes that any XML element constructor acts as a phrase boundary.
(Administrator’s Guide, 4.2, page 224)
Can someone supply the behavior for some of the other XML indexing engines? Thanks!