Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 17, 2014

Understanding Classic SoundEx Algorithms

Filed under: Algorithms,SoundEx,Subject Identity — Patrick Durusau @ 8:53 pm

Understanding Classic SoundEx Algorithms

From the webpage:

Terms that are often misspelled can be a problem for database designers. Names, for example, are variable length, can have strange spellings, and they are not unique. American names have a diversity of ethnic origins, which give us names pronounced the same way but spelled differently and vice versa.

Words too, can be misspelled or have multiple spellings, especially across different cultures or national sources.

To help solve this problem, we need phonetic algorithms which can find similar sounding terms and names. Just such a family of algorithms exist and have come to be called SoundEx algorithms.

A Soundex search algorithm takes a written word, such as a person’s name, as input, and produces a character string that identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases when the user has incomplete data.

The method used by Soundex is based on the six phonetic classifications of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal), which are themselves based on where you put your lips and tongue to make the sounds.

The algorithm itself is fairly straight forward to code and requires no backtracking or multiple passes over the input word. In fact, it is so straight forward, I will start (after a history section) by presenting it as an outline. Further on, I will give C, JavaScript, Perl, and VB code that implements the two standard algorithms used in the American Census as well as an enhanced version, which is described in this article.

A timely reminder that knowing what is likely to be confused can be more powerful than the details of any particular confusion.

Even domain level semantics may be too difficult to capture. What if we were to capture only the known cases of confusion?

That would be a much smaller set than the domain in general and easier to maintain. (As well as to distinguish in a solution.)

Powered by WordPress