Archive for the ‘Bag-of-Words (BOW)’ Category

mSDA: A fast and easy-to-use way to improve bag-of-words features

Friday, June 15th, 2012

mSDA: A fast and easy-to-use way to improve bag-of-words features by Kilian Weinberger.

From the description:

Machine learning algorithms rely heavily on the representation of the data they are presented with. In particular, text documents (and often images) are traditionally expressed as bag-of-words feature vectors (e.g. as tf-idf). Recently Glorot et al. showed that stacked denoising autoencoders (SDA), a deep learning algorithm, can learn representations that are far superior over variants of bag-of-words. Unfortunately, training SDAs often requires a prohibitive amount of computation time and is non-trivial for non-experts. In this work, we show that with a few modifications of the SDA model, we can relax the optimization over the hidden weights into convex optimization problems with closed form solutions. Further, we show that the expected value of the hidden weights after infinitely many training iterations can also be computed in closed form. The resulting transformation (which we call marginalized-SDA) can be computed in no more than 20 lines of straight-forward Matlab code and requires no prior expertise in machine learning. The representations learned with mSDA behave similar to those obtained with SDA, but the training time is reduced by several orders of magnitudes. For example, mSDA matches the world-record on the Amazon transfer learning benchmark, however the training time shrinks from several days to a few minutes.

The Glorot et. al. reference is to: Domain Adaptation for Large-Scale Sentiment Classi cation: A Deep Learning Approach by Xavier Glorot, Antoine Bordes, and Yoshua Bengio, Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 2011.

Superficial searching reveals this to be a very active area of research.

I rather like the idea of training being reduced from days to minutes.

Language Pyramid and Multi-Scale Text Analysis

Tuesday, December 21st, 2010

Language Pyramid and Multi-Scale Text Analysis Authors: Shuang-Hong Yang, Hongyuan Zha Keywords: bag of word, language pyramid, multi-scale language models, multi-scale text analysis, multi-scale text kernel, text spatial contents modeling


The classical Bag-of-Word (BOW) model represents a document as a histogram of word occurrence, losing the spatial information that is invaluable for many text analysis tasks. In this paper, we present the Language Pyramid (LaP) model, which casts a document as a probabilistic distribution over the joint semantic-spatial space and motivates a multi-scale 2D local smoothing framework for nonparametric text coding. LaP efficiently encodes both semantic and spatial contents of a document into a pyramid of matrices that are smoothed both semantically and spatially at a sequence of resolutions, providing a convenient multi-scale imagic view for natural language understanding. The LaP representation can be used in text analysis in a variety of ways, among which we investigate two instantiations in the current paper: (1) multi-scale text kernels for document categorization, and (2) multi-scale language models for ad hoc text retrieval. Experimental results illustrate that: for classification, LaP outperforms BOW by (up to) 4% on moderate-length texts (RCV1 text benchmark) and 15% on short texts (Yahoo! queries); and for retrieval, LaP gains 12% MAP improvement over uni-gram language models on the OHSUMED data set.

The text that stands out for me reads:

More pessimistically, different concepts usually stands at different scales, making it impossible to capture all the right meanings with a single scale. For example, named entities usually range from unigram (e.g., “new”) to bigram (e.g., “New York”), to multigram (e.g., “New York Times”), and even to a whole long sequence (e.g., a song name “ Another Lonely Night In New York”).

OK, so if you put it that way, then BOW (bag-of-words) is clearly not the best idea.

But we already treat text at different scales.

All together now: markup!

I checked the documentation on the latest Marklogic release and found:

By default, MarkLogic Server assumes that any XML element constructor acts as a phrase boundary.
(Administrator’s Guide, 4.2, page 224)

Can someone supply the behavior for some of the other XML indexing engines? Thanks!