Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 9, 2011

Approximating Edit Distance in Near-Linear Time

Filed under: Algorithms,Edit Distance,Levenshtein Distance — Patrick Durusau @ 6:44 pm

Approximating Edit Distance in Near-Linear Time

Abstract:

We show how to compute the edit distance between two strings of length n up to a factor of $2^{\~O(sqrt(log n))} in n^(1+o(1))$ time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art $n^(1/3+o(1))$ approximation. Previously, approximation of $2^{\~O(sqrt(log n))}$ was known only for embedding edit distance into $l_1$, and it is not known if that embedding can be computed in less than quadratic time.

Deeply important research for bioinformatics, text searching. The edit distance is “approximated.”

If you are not familiar with this area, Levenshtein Distance, in Three Flavors by Michael Gilleland is a nice starting point with source code in three languages.

3 Comments

  1. This is important for record linkage and deduplication, too, as much of the actuall processing time is spent in string comparison. So this is very useful. Thank you!

    Comment by larsga@garshol.priv.no — October 10, 2011 @ 1:19 am

  2. if the dataset is large, computational order for pairwise comparison is misleading, because there are algorithms that avoid explicit pairwise comparisons completely, lucene uses http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

    Comment by rcmuir — October 10, 2011 @ 6:39 am

  3. @larsga – Your welcome! Another goodie (dissertation) from the same author a bit later today)

    @rcmuir – Thanks for the pointer!

    Comment by Patrick Durusau — October 10, 2011 @ 3:59 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress