Similarity Search on Bregman Divergence: Towards NonMetric Indexing by Zhenjie Zhang, Beng Chin, Ooi Srinivasan, Parthasarathy Anthony, K. H. Tung. (2009)
Abstract:
In this paper, we examine the problem of indexing over non-metric distance functions. In particular, we focus on a general class of distance functions, namely Bregman Divergence [6], to support nearest neighbor and range queries. Distance functions such as KL-divergence and Itakura-Saito distance, are special cases of Bregman divergence, with wide applications in statistics, speech recognition and time series analysis among others. Unlike in metric spaces, key properties such as triangle inequality and distance symmetry do not hold for such distance functions. A direct adaptation of existing indexing infrastructure developed for metric spaces is thus not possible. We devise a novel solution to handle this class of distance measures by expanding and mapping points in the original space to a new extended space. Subsequently, we show how state-of-the-art tree-based indexing methods, for low to moderate dimensional datasets, and vector approximation file (VA-file) methods, for high dimensional datasets, can be adapted on this extended space to answer such queries efficiently. Improved distance bounding techniques and distribution-based index optimization are also introduced to improve the performance of query answering and index construction respectively, which can be applied on both the R-trees and VA files. Extensive experiments are conducted to validate our approach on a variety of datasets and a range of Bregman divergence functions.
This paper hasn’t been cited a lot (yet) but I think it will be of particular interest to the topic map community. Mostly because natural languages, the stuff most users use to describe/identify subjects, are inherently nonmetric.
Just as a placeholder for future conversation, it occurs to me that there is a performance continuum for subject-sameness tests. How complex must matching become with some number X of topics before performance degrades? Or perhaps better,, what are the comparative performance characteristics of different subject-sameness tests?