Archive for the ‘Levenshtein Distance’ Category

Jellyfish

Saturday, September 6th, 2014

Jellyfish by James Turk and Michael Stephens.

From the webpage:

Jellyfish is a python library for doing approximate and phonetic matching of strings.

String comparison:

  • Levenshtein Distance
  • Damerau-Levenshtein Distance
  • Jaro Distance
  • Jaro-Winkler Distance
  • Match Rating Approach Comparison
  • Hamming Distance

Phonetic encoding:

  • American Soundex
  • Metaphone
  • NYSIIS (New York State Identification and Intelligence System)
  • Match Rating Codex

You might want to consider the string matching offered by Duke (written on top of Lucene):

String comparators

  • Levenshtein
  • WeightedLevenshtein
  • JaroWinkler
  • QGramComparator

Simple comparators

  • ExactComparator
  • DifferentComparator

Specialized comparators

  • GeopositionComparator
  • NumericComparator
  • PersonNameComparator

Phonetic comparators

  • SoundexComparator
  • MetaphoneComparator
  • NorphoneComparator

Token set comparators

  • DiceCoefficientComparator
  • JaccardIndexComparator

Enjoy!

Damerau-Levenshtein Edit Distance

Saturday, September 22nd, 2012

Damerau-Levenshtein Edit Distance by Kevin Stern.

From the post:

The Damerau-Levenshtein distance admits all of the operations from the Levenshtein distance and further allows for swapping of adjacent characters, with the caveat that cost of two adjacent character swaps be at least the cost of a character deletion plus the cost of a character insertion (this caveat enables a fast dynamic programming solution to the problem). There is a sub-variant of the Damerau-Levenshtein distance known as the restricted edit distance which further specifies that no substring be modified more than once, which is primarily what I found when searching for algorithms for computing Damerau-Levenshtein distance, since, I presume, this sub-variant is a bit more straight forward to compute. In addition, I’ve had a difficult time finding a good explanation of the algorithm for computing the full Damerau-Levenshtein distance – hence, the motivation behind this blog post.

A variation on the Levenshtein edit distance algorithm that you may find useful.

I first saw this at DZone.

Damn Cool Algorithms: Levenshtein Automata

Saturday, September 22nd, 2012

Damn Cool Algorithms: Levenshtein Automata by Nick Johnson.

From the post:

In a previous Damn Cool Algorithms post, I talked about BK-trees, a clever indexing structure that makes it possible to search for fuzzy matches on a text string based on Levenshtein distance – or any other metric that obeys the triangle inequality. Today, I’m going to describe an alternative approach, which makes it possible to do fuzzy text search in a regular index: Levenshtein automata.

Introduction

The basic insight behind Levenshtein automata is that it’s possible to construct a Finite state automaton that recognizes exactly the set of strings within a given Levenshtein distance of a target word. We can then feed in any word, and the automaton will accept or reject it based on whether the Levenshtein distance to the target word is at most the distance specified when we constructed the automaton. Further, due to the nature of FSAs, it will do so in O(n) time with the length of the string being tested. Compare this to the standard Dynamic Programming Levenshtein algorithm, which takes O(mn) time, where m and n are the lengths of the two input words! It’s thus immediately apparrent that Levenshtein automaton provide, at a minimum, a faster way for us to check many words against a single target word and maximum distance – not a bad improvement to start with!

Of course, if that were the only benefit of Levenshtein automata, this would be a short article. There’s much more to come, but first let’s see what a Levenshtein automaton looks like, and how we can build one.

Not recent but I think you will enjoy the post anyway.

I first saw this at DZone.

Levenshtein distance in C++ and code profiling in R

Monday, March 26th, 2012

Levenshtein distance in C++ and code profiling in R by Dzidorius Martinaitis.

From the post:

At work, the client requested, if existing search engine could accept singular and plural forms equally, e. g. “partner” and “partners” would lead to the same result.

The first option – stemming. In that case, search engine would use root of a word, e. g. “partn”. However, stemming has many weaknesses: two different words might have same root, a user can misspell the root of the word, except English and few others languages it is not that trivial to implement stemming.

Levenshtein distance comes as the second option. The algorithm is simple – you have two words and you calculate the difference between them. You can insert, delete or replace any character, but it will cost you. Let’s imagine, an user enters “Levenstin distances” into search engine and expects to find revalent information. However, he just made 2 errors by misspeling the author’s name and he used plural form of “distance”. If search engine accepts 3 errors – the user will get relevant information.

The challenge comes, when you have a dictionary of terms (e. g. more that 1 mil.) and you want to get similar terms based on Levenshtein distance. You can visit every entry in the dictionary (very costly) or you can push dictionary into the trie. Do you need a proof for the cost? There we go:

Appreciate the comparison of approaches based on data but wondering why “professional” stemming, like you find in Solr was not investigated? Will post a comment asking and report back.

You are likely to encounter this sort of issue in almost all topic map authoring activities.

Nearest Neighbor Search: the Old, the New, and the Impossible

Monday, October 10th, 2011

Nearest Neighbor Search: the Old, the New, and the Impossible, the MIT PhD thesis of Alexandr Andoni.

To be honest, it is the discovery of gems like this one that keep me prowling journals, pre-publication sites, homepages, etc.

Alexandr walks the reader through a very complete review of the literature on nearest neighbor search, all the while laying a foundation for the significant progress he has made.

Not for the faint of heart but it promises to be well worth the effort.

Approximating Edit Distance in Near-Linear Time

Sunday, October 9th, 2011

Approximating Edit Distance in Near-Linear Time

Abstract:

We show how to compute the edit distance between two strings of length n up to a factor of $2^{\~O(sqrt(log n))} in n^(1+o(1))$ time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art $n^(1/3+o(1))$ approximation. Previously, approximation of $2^{\~O(sqrt(log n))}$ was known only for embedding edit distance into $l_1$, and it is not known if that embedding can be computed in less than quadratic time.

Deeply important research for bioinformatics, text searching. The edit distance is “approximated.”

If you are not familiar with this area, Levenshtein Distance, in Three Flavors by Michael Gilleland is a nice starting point with source code in three languages.