Bed-tree: an all-purpose index structure for string similarity search based on edit distance by Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava.
Abstract:
Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.
The authors classify similarity queries as:
Type |
Example |
range |
address in customer database |
top-k |
results of search engine |
all-pairs joins |
pairs of proteins or genes |
There are a couple of things that trouble me about the paper.
First:
6.3 Top-K construction
In many cases, top-k queries are more practical than range queries. However, existing indexing schemes with inverted lists do not naturally support such queries. To illustrate
the performance benefits of our proposals, we implemented a simple strategy with Flamingo, by increasing the range query threshold gradually until more than k string results
are found. Notice that we use the same Bed-tree structures to support all different types of queries. Thus, we skip the performance comparisons on index construction but focus on query processing efficiency. (emphasis added)
I am not sure what is meant by inverted lists “…do not naturally support …[top-k queries].” Inverted list structures are wildly popular among WWW search engines so I would like to know more about this notion of “…naturally support….”
Moreover, indexes aren’t simply used, they are created as well. Puzzling that we are left to wonder how long it will take to have a Bed-tree database that we can use.
Second, there are a couple of fairly serious mis-citation errors in the paper. The authors refer to “Flamingo” and “Mismatch” (from 2008) as comparisons but the articles cited: “[15] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008” and “C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008, respectively, are innocent of any such implementations.
C. Li is the faculty adviser for the Flamingo project, which has a new release since I mentioned it at: The FLAMINGO Project on Data Cleaning, but you don’t cite a project by picking a paper at random that doesn’t mention the project. (If you haven’t looked at the FLAMINGO project, its software and papers you really should.)
C. Xiao and company have a “mismatch” filter but it isn’t ever used as the name of an implementation.
Tracing the history of advances in computer science is easier or at least less time consuming if researchers don’t have to chase rabbits in the form of bad citations. Not to mention that if you aren’t cited properly, you may not get full credit for all the work you have actually done. Good citation practices are in everyone’s interest.