Software Development

Consider algorithms other than Soundex

Soundex is a phonetic normalization function that gets around the problem of sorting information by last names with different spellings but similar or identical sounds. Find out how alternate algorithms, such as Metaphone, have overcome some of Soundex's weaknesses.

Oracle SQL string functions have included the Soundex function for a long time. Soundex is a phonetic normalization function that was invented for the 1880 U.S. Census to get around the problem of sorting information by last names with different spellings but similar or identical sounds, such as Smith and Smythe. Since then, it's become one of the more popular ways of searching for similar sounding names in genealogy and government applications.

However, this algorithm was created when such algorithms were performed by hand, so there are some weaknesses that have been overcome by alternate algorithms. One of the problems is that it always takes the first letter, despite the sound. So, Cline and Kline are completely different codes. Soundex reduces all English sounds to six-integer values after the first letter, so the number of possible matches is quite large and might not be obvious.

One popular alternative to Soundex is Metaphone, which was first described by Lawrence Philips in the December 1990 edition of Computer Language magazine. It uses a rough phonetics algorithm that reduces names to 16 consonants. While keeping the accuracy the same, the ranges of values are greatly reduced. (Metaphone has actually been used as the basis of several spell checkers.) Philips has gone on to write Double Metaphone, which also deals with extended accent characters and more rules.

For this article, I'll stick to the original Metaphone specifications for brevity. Many people have ported the original version (in BASIC) to other languages, including Java and PL/SQL. It can be scary to look at the algorithms out there, so check out Listing A, which shows my different approach of encoding a set of rules. Listing B offers an example of its usage.

TechRepublic's Oracle newsletter covers automating Oracle utilities, generating database alerts, solving directed graph problems, and more. Automatically subscribe today!

0 comments

Editor's Picks