Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 23, 2011

Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Filed under: Edit Distance,Indexing,Similarity,String Matching — Patrick Durusau @ 4:28 pm

Bed-tree: an all-purpose index structure for string similarity search based on edit distance by Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava.

Abstract:

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.

The authors classify similarity queries as:

Type Example
range address in customer database
top-k results of search engine
all-pairs joins
pairs of proteins or genes

There are a couple of things that trouble me about the paper.

First:

6.3 Top-K construction

In many cases, top-k queries are more practical than range queries. However, existing indexing schemes with inverted lists do not naturally support such queries. To illustrate
the performance benefits of our proposals, we implemented a simple strategy with Flamingo, by increasing the range query threshold gradually until more than k string results
are found. Notice that we use the same Bed-tree structures to support all different types of queries. Thus, we skip the performance comparisons on index construction but focus on query processing efficiency. (emphasis added)

I am not sure what is meant by inverted lists “…do not naturally support …[top-k queries].” Inverted list structures are wildly popular among WWW search engines so I would like to know more about this notion of “…naturally support….”

Moreover, indexes aren’t simply used, they are created as well. Puzzling that we are left to wonder how long it will take to have a Bed-tree database that we can use.

Second, there are a couple of fairly serious mis-citation errors in the paper. The authors refer to “Flamingo” and “Mismatch” (from 2008) as comparisons but the articles cited: “[15] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008” and “C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008, respectively, are innocent of any such implementations.

C. Li is the faculty adviser for the Flamingo project, which has a new release since I mentioned it at: The FLAMINGO Project on Data Cleaning, but you don’t cite a project by picking a paper at random that doesn’t mention the project. (If you haven’t looked at the FLAMINGO project, its software and papers you really should.)

C. Xiao and company have a “mismatch” filter but it isn’t ever used as the name of an implementation.

Tracing the history of advances in computer science is easier or at least less time consuming if researchers don’t have to chase rabbits in the form of bad citations. Not to mention that if you aren’t cited properly, you may not get full credit for all the work you have actually done. Good citation practices are in everyone’s interest.

2 Comments

  1. Free PDF version of paper here: http://www.comp.nus.edu.sg/~ooibc/sigmod10bedtree.pdf

    Presentation slides (PDF) here: http://cse.vnit.ac.in/comad2010/pdf/SIGMOD%20Highlight%20Session/sigmod10-bed-slides.pdf

    Comment by larsga@garshol.priv.no — January 5, 2012 @ 6:44 am

  2. Lars: Thanks! I stayed logged in most places and don’t always notice when I have cited “protected” material.

    Comment by Patrick Durusau — January 5, 2012 @ 11:14 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress