Approximating Edit Distance in Near-Linear Time

Approximating Edit Distance in Near-Linear Time

Abstract:

We show how to compute the edit distance between two strings of length n up to a factor of $2^{\~O(sqrt(log n))} in n^(1+o(1))$ time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art $n^(1/3+o(1))$ approximation. Previously, approximation of $2^{\~O(sqrt(log n))}$ was known only for embedding edit distance into $l_1$, and it is not known if that embedding can be computed in less than quadratic time.

Deeply important research for bioinformatics, text searching. The edit distance is “approximated.”

If you are not familiar with this area, Levenshtein Distance, in Three Flavors by Michael Gilleland is a nice starting point with source code in three languages.

3 Responses to “Approximating Edit Distance in Near-Linear Time”

  1. larsga@garshol.priv.no says:

    This is important for record linkage and deduplication, too, as much of the actuall processing time is spent in string comparison. So this is very useful. Thank you!

  2. rcmuir says:

    if the dataset is large, computational order for pairwise comparison is misleading, because there are algorithms that avoid explicit pairwise comparisons completely, lucene uses http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

  3. Patrick Durusau says:

    @larsga – Your welcome! Another goodie (dissertation) from the same author a bit later today)

    @rcmuir – Thanks for the pointer!

Leave a Reply

You must be logged in to post a comment.