Archive for the ‘Edit Distance’ Category

Encyclopedia of Distances

Thursday, November 3rd, 2016

Encyclopedia of Distances (4th edition) by Michel Marie Deza and Elena Deza.

Springer description:

This 4-th edition of the leading reference volume on distance metrics is characterized by updated and rewritten sections on some items suggested by experts and readers, as well a general streamlining of content and the addition of essential new topics. Though the structure remains unchanged, the new edition also explores recent advances in the use of distances and metrics for e.g. generalized distances, probability theory, graph theory, coding theory, data analysis.

New topics in the purely mathematical sections include e.g. the Vitanyi multiset-metric, algebraic point-conic distance, triangular ratio metric, Rossi-Hamming metric, Taneja distance, spectral semimetric between graphs, channel metrization, and Maryland bridge distance. The multidisciplinary sections have also been supplemented with new topics, including: dynamic time wrapping distance, memory distance, allometry, atmospheric depth, elliptic orbit distance, VLBI distance measurements, the astronomical system of units, and walkability distance.

Leaving aside the practical questions that arise during the selection of a ‘good’ distance function, this work focuses on providing the research community with an invaluable comprehensive listing of the main available distances.

As well as providing standalone introductions and definitions, the encyclopedia facilitates swift cross-referencing with easily navigable bold-faced textual links to core entries. In addition to distances themselves, the authors have collated numerous fascinating curiosities in their Who’s Who of metrics, including distance-related notions and paradigms that enable applied mathematicians in other sectors to deploy research tools that non-specialists justly view as arcane. In expanding access to these techniques, and in many cases enriching the context of distances themselves, this peerless volume is certain to stimulate fresh research.

Ransomed for $149 (US) per digital copy, this remarkable work that should have a broad readership.

From the introduction to the 2009 edition:


Distance metrics and distances have now become an essential tool in many areas of Mathematics and its applications including Geometry, Probability, Statistics, Coding/Graph Theory, Clustering, Data Analysis, Pattern Recognition, Networks, Engineering, Computer Graphics/Vision, Astronomy, Cosmology, Molecular Biology, and many other areas of science. Devising the most suitable distance metrics and similarities, to quantify the proximity between objects, has become a standard task for many researchers. Especially intense ongoing search for such distances occurs, for example, in Computational Biology, Image Analysis, Speech Recognition, and Information Retrieval.

Often the same distance metric appears independently in several different areas; for example, the edit distance between words, the evolutionary distance in Biology, the Levenstein distance in Coding Theory, and the Hamming+Gap or shuffle-Hamming distance.

(emphasis added)

I highlighted that last sentence to emphasize that Encyclopedia of Distances is a static and undisclosed topic map.

While readers familiar with the concepts:

edit distance between words, the evolutionary distance in Biology, the Levenstein distance in Coding Theory, and the Hamming+Gap or shuffle-Hamming distance.

could enumerate why those merit being spoken of as being “the same distance metric,” no indexing program can accomplish the same feat.

If each of those concepts had enumerated properties, which could be compared by an indexing program, readers could not only discover those “same distance metrics” but could also discover new rediscoveries of that same metric.

As it stands, readers must rely upon the undisclosed judgments of the Deza’s and hope they continue to revise and extend this work.

When they cease to do so, successive editors will be forced to re-acquire the basis for adding new/re-discovered metrics to it.

PS: Suggestions of similar titles that deal with non-metric distances? I’m familiar with works that impose metrics on non-metric distances but that’s not what I have in mind. That’s an arbitrary and opaque mapping from non-metric to metric.

Damerau-Levenshtein Edit Distance

Saturday, September 22nd, 2012

Damerau-Levenshtein Edit Distance by Kevin Stern.

From the post:

The Damerau-Levenshtein distance admits all of the operations from the Levenshtein distance and further allows for swapping of adjacent characters, with the caveat that cost of two adjacent character swaps be at least the cost of a character deletion plus the cost of a character insertion (this caveat enables a fast dynamic programming solution to the problem). There is a sub-variant of the Damerau-Levenshtein distance known as the restricted edit distance which further specifies that no substring be modified more than once, which is primarily what I found when searching for algorithms for computing Damerau-Levenshtein distance, since, I presume, this sub-variant is a bit more straight forward to compute. In addition, I’ve had a difficult time finding a good explanation of the algorithm for computing the full Damerau-Levenshtein distance – hence, the motivation behind this blog post.

A variation on the Levenshtein edit distance algorithm that you may find useful.

I first saw this at DZone.

An Efficient Trie-based Method for Approximate Entity Extraction…

Monday, March 12th, 2012

An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints by Dong Deng, Guoliang Li, and Jianhua Feng. (PDF)

Abstract:

Dictionary-based entity extraction has attracted much attention from the database community recently, which locates substrings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between substrings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. Firstly, they need to tune many parameters to achieve a high performance. Secondly, they are inefficient for large editdistance thresholds. We propose a trie-based method to address these problems. We partition each entity into a set of segments. We prove that if a substring of the document is similar to an entity, it must contain a segment of the entity. Based on this observation, we first search segments from the document, and then extend the matching segments in both entities and the document to find similar pairs. To facilitate searching segments, we use a trie structure to index segments and develop an efficient trie-based algorithm. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. The experimental results show that our method achieves much higher performance compared with state-of-the-art studies.

Project page with author contact information. Code coming soon.

In case you are wondering about the path for the project including the word “taste:”

To address these problems, we propose a trie-based method for dictionary-based approximate entity extraction with edit distance constraints, called TASTE. TASTE does not need to tune parameters. Moreover TASTE achieves much higher performance, even for large edit-distance thresholds.

Is there a word for a person who creates acronyms? Acronymist perhaps?

Deeply interesting paper on the use of tries for entity extraction. Interesting due to its performance but also because of its approach.

You do remember that tries were what made the original e-version of the OED (Oxford English Dictionary) possible? Extremely responsive on less powerful machines than in your smart phone.

One wonders how starting from a set of entities this approach would fare against the TREC legal archives? But that would be “seeding” the application with entities. It may be that being given queries against a dark corpus isn’t all that realistic.

Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Friday, December 23rd, 2011

Bed-tree: an all-purpose index structure for string similarity search based on edit distance by Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava.

Abstract:

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.

The authors classify similarity queries as:

Type Example
range address in customer database
top-k results of search engine
all-pairs joins
pairs of proteins or genes

There are a couple of things that trouble me about the paper.

First:

6.3 Top-K construction

In many cases, top-k queries are more practical than range queries. However, existing indexing schemes with inverted lists do not naturally support such queries. To illustrate
the performance benefits of our proposals, we implemented a simple strategy with Flamingo, by increasing the range query threshold gradually until more than k string results
are found. Notice that we use the same Bed-tree structures to support all different types of queries. Thus, we skip the performance comparisons on index construction but focus on query processing efficiency. (emphasis added)

I am not sure what is meant by inverted lists “…do not naturally support …[top-k queries].” Inverted list structures are wildly popular among WWW search engines so I would like to know more about this notion of “…naturally support….”

Moreover, indexes aren’t simply used, they are created as well. Puzzling that we are left to wonder how long it will take to have a Bed-tree database that we can use.

Second, there are a couple of fairly serious mis-citation errors in the paper. The authors refer to “Flamingo” and “Mismatch” (from 2008) as comparisons but the articles cited: “[15] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008” and “C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008, respectively, are innocent of any such implementations.

C. Li is the faculty adviser for the Flamingo project, which has a new release since I mentioned it at: The FLAMINGO Project on Data Cleaning, but you don’t cite a project by picking a paper at random that doesn’t mention the project. (If you haven’t looked at the FLAMINGO project, its software and papers you really should.)

C. Xiao and company have a “mismatch” filter but it isn’t ever used as the name of an implementation.

Tracing the history of advances in computer science is easier or at least less time consuming if researchers don’t have to chase rabbits in the form of bad citations. Not to mention that if you aren’t cited properly, you may not get full credit for all the work you have actually done. Good citation practices are in everyone’s interest.

Nearest Neighbor Search: the Old, the New, and the Impossible

Monday, October 10th, 2011

Nearest Neighbor Search: the Old, the New, and the Impossible, the MIT PhD thesis of Alexandr Andoni.

To be honest, it is the discovery of gems like this one that keep me prowling journals, pre-publication sites, homepages, etc.

Alexandr walks the reader through a very complete review of the literature on nearest neighbor search, all the while laying a foundation for the significant progress he has made.

Not for the faint of heart but it promises to be well worth the effort.

Approximating Edit Distance in Near-Linear Time

Sunday, October 9th, 2011

Approximating Edit Distance in Near-Linear Time

Abstract:

We show how to compute the edit distance between two strings of length n up to a factor of $2^{\~O(sqrt(log n))} in n^(1+o(1))$ time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art $n^(1/3+o(1))$ approximation. Previously, approximation of $2^{\~O(sqrt(log n))}$ was known only for embedding edit distance into $l_1$, and it is not known if that embedding can be computed in less than quadratic time.

Deeply important research for bioinformatics, text searching. The edit distance is “approximated.”

If you are not familiar with this area, Levenshtein Distance, in Three Flavors by Michael Gilleland is a nice starting point with source code in three languages.