Implementing a Soundex Function

by John Midwinter

No, this is not grunting and grinding sound effects for all you Multimedia-heads - though it's a thought for later maybe? This is about making our database lightening fast at finding names that sound the same as the name we have in mind.

How about this scenario

I have a 300Mb database holding 300,000 contacts, I can identify that there are 3,439 records for surnames that sound like "SmiTh" in 0.77 seconds and list the 82 different names that sound alike ranging from "SMEETH" to "Synott" in 1.51 seconds; and note that this is not case sensitive. Compare this with a time on the same database of 2.95secs to find an exact match for "SMITH" on an indexed field of Upper_Surname even though there were fewer matching records (3,161). i.e. The sounding like method is twice as fast as an exact match! That surely cannot be right? But it is! These are real figures on a real database.

So, now I've got your interest huh?

This is all achieved by utilising the SOUNDEX method of indexing, encapsulating it in a DLL, using that as an Interbase UDF and modifying our Interbase table to use this key. That may sound hefty, but really is so simple and straightforward you should be able to modify your application to perform this miracle of lookup within just "a few short moments".

How it works

The SOUNDEX method of indexing was developed for use with the U.S. Federal census records since 1880 as a card index system for grouping similar sounding names. The implementation given here goes a step further to make it really compact and faster.

The SOUNDEX method works by taking each word to be indexed (surnames in this example) and reducing it to a short code based on the letters in the word according to a simple set of RULES:

  • Take the first letter.

  • Then, for each of the subsequent letters, score a code for the group it belongs to. There are 6 groups:

    Group             Letters
    1               BFPV
    2               CGJKQSXZ
    3               DT
    4               L
    5               MN
    6               R
    
  • Ignore ALL other characters (vowels, h,y,w, spaces and punctuation)

  • If the scoring code is the same as the last one, ignore it, otherwise add it to the Code

  • Do this until you get up to 4 characters. If you do not get to 4 characters, then add 0's up to 4 characters.

Thus:

SMITH becomes S5300
Smitt becomes S5330
sMythe becomes S5300

and hey! they all sound the same!

That's how it works. It is that simple. Though it sometimes it gets quite strange matches, it generally gets realistic results.

Thus all we need to do to use this is to have a new column in our database into which we generate the soundex code for the name and keep it up to date. Then, to find matches we just look for those names that have the same Soundex code. As this new column will be indexed, this search will be fast.

How to implement it

We need to do the following :

  1. Produce a function in a DLL that we can call from Interbase (and also call from our search routines)
  2. Declare the DLL and its function to Interbase as a User Defined Function (UDF)
  3. Add a column (via a domain) to our master table.
  4. Initialise the data in the master table to create the Soundex index codes.
  5. Put in some code to keep our database up to date.

The Function

The function is just a few lines of Delphi. The FULL source code for the DLL and its function SoundBytes (so called as it returns bytes not characters) is attached. I hope it makes easy reading. Notice that rather than use the Soundex 4 character result I have modified this to generate just 2 bytes (as a smallint) to give more compactness. This halves the size of the index used to lookup our values thus providing faster performance. I did this on realising that, though the natural Soundex Code has 4 characters, it only has 26 * 6 * 6 * 6 (=5,616) possible values which can be held by a small int. I experimented with improving the "accuracy" of the match by increasing the size of the Soundex code to 5 characters (which could still be held within 2 bytes (33,696) as a Word). This resulted in fewer matches being returned but, looking at the results, it seemed better to have a wider net for general use. However, it could be worth bearing in mind if your application has a lot of similar sounding words and you need to have finer selection.

For those not familiar with DLL writing, the "tricks" to pick up in the code are :

  • The heading as a library rather than a project.
  • Only USE SysUtils - we do not need anything else. By default, Delphi will put other stuff in there - knock it out, you do not need it and removing it will reduce the size of the DLL.
  • The declaration of the function with the MAGIC word cdecl at the end. That tells Delphi which way round to handle its parameters and how to compile it for Windows use (by pretending it's a "C" thingy - it's more complex than that - but this is the declaration we need).
  • The exports clause at the end.
  • The use of pointer type variables for the parameter and return (do not use Strings).

Notes in the code describe why some things have been done the way they have. The aim has been for minimal processing so that the function (which can be used in general queries) will operate as quickly as possible. Get it tighter and I'd be delighted to hear - it's always a challenge.

Compiling produces a 27k DLL which is a tiny thing for Windows but works.

Making it available to InterBase

For guaranteed results, put the SoundBytes.DLL into the BIN directory of Interbase (normally "Program filesBorlandinterbasebin") The necessity is that it is visible to Interbase. The bin directory ensures this but you are not restricted to this location if you need to put it elsewhere.

Then tell Interbase about it. Using WISQL, connect to your database, then give the command:

Declare external function SoundBytes Char(40) \What we will call it in Interbase
Returns Integer by VALUE \What it returns.
Entry_Point "SoundBts" \The name of the function in the DLL
Module_Name "Soundbytes.DLL" \The name of the DLL

Then commit. That's it.

You now have your own User Defined Function that you can call like any other function (e.g. Upper()) in InterBase.

A word of caution when using UDF's. Be aware that when a database has a UDF declared, then the DLL it calls MUST be visible to the SERVER when it attempts to connect to the database. If the DLL is not visible the database will not connect. This can be painful when setting up on another machine if you forget to bring the DLL with you! Do not underestimate this one. It can be a real Gotcha, you cannot even get into your database in order to remove the reference to it! (You're right, that's personal experience talking!)

Making InterBase use it

First we need somewhere to hold this index so give your table an extra column.

In my case the table was called "Main" and had the field [Surname] that I wanted to index. So from WISQL, I first created a domain (OK, you don't need to do this before adding a column, but it's the right way to do it!)

Create Domain SoundBytesDomain
SmallInt
Default 0
Not Null

Then use this domain to add a column to the table:

Alter Table Main
Add SoundBytes SoundBytesDomain

Now generate data into it:

Update Main
Set SoundBytes = GetSoundBytes(Surname)

That took about 10 mins on my 300,000 record database.

Now, put an index on this column

NB DO NOT put the index on the column before doing the data update. If you do the data update on a table of this size would take approximately 1 hour, 17mins and 44.4 secs (guess how I know this!)

Create Index SoundByteIndex on Main (SoundBytes)

At this point the database is ready to tryout!

If you have not committed as you go along now would be a good time to commit and save all this hard (?) work.

Try:

Select Distinct Surname from Main //use distinct to get the different names
where SoundBytes = GetSoundBytes("sMitH") //I use mixed case just to show it does not matter!

What you get depends on what you have got on your database. I got 82 records from my 300,000 record database in 1.54 seconds and was mightily impressed.

Now all we need to do is to make our database keep this new field up to date. We want the Server to do this and not bother us with it. This is an obvious task for a trigger.

We need this field to be updated whenever the Surname is changed. So trap on UPDATE:

Create Trigger SoundBytes_update for Main
BEFORE UPDATE
as begin
New. SoundBytes = GetSoundBytes(New.Surname);
end;

Note the use of the New.Fieldname syntax. We want the New index value to be set to the new value of the Surname. We could put in code to check to see if the Surname has changed, but the operation is so fast that I let it reset the index regardless of change to the Surname. You might already have a trigger on the UPDATE event (it is probably the most common trigger used) so you could just add code to that trigger.

Depending on how your Database operates, you might also want to trap on INSERT for cases where the whole record gets banged in one shot. Note that you can only trap for one event for each trigger, so an INSERT trigger would have to be separate.

That's it. All done. You now have a SoundBytes index maintained in your table. All you need to do is use it.

Using It

Though it's so impressive to use this new function from WISQL that you could probably play with it all day, we need to get down to using it from the real world of our applications. This is easy, you probably do not need me to tell you how to do this, but this is how I did it from Delphi.

Most commonly this index would be used to produce a list of records that have surnames sounding like our sample. Taking this as my example, the simple pseudo query is:

select surname from mytable where
soundbytes = the soundbyte of my sample.

Three approaches came to mind

  1. Pass the query as text to Interbase and get the server to do the conversion of the sample and return the result:

    'Select ID,Surname,FirstName  from Main where
    SoundBytes = GetSoundBytes( "'+ SearchPattern + '")'
    

    Now this should have worked exactly the same as if the command had been given from WISQL with the search pattern filled in (as it did above). But I found that I got GPs on this. I guessed that InterBase (remote) when addressed from a client was having problems with the path of the DLL? Anyway, I gave up on this as there was an easier solution.

  2. To overcome the path thing, use a stored procedure that would first call the UDF to get the search pattern as a small int and then use this in a Select command. This worked but I did not like calling a stored proc to call a UDF to return a select command. It did not "feel right"

  3. Use a normal select query, but first get the SoundByte of the search pattern and pass that:

    'Select Surname from Main where
    SoundBytes = '+ SearchIndex + ')'
    

    This worked best of all (like it worked!) and had the neatness that it could be held by a TQuery and be pre-prepared, thus just needing to change the parameter each time the query was called.

To get the SoundByte index, I called the same DLL from my application. To do this the function must be declared in the interface section of the unit from which you want to call the DLL:

function SountBts(Name : PChar) : SmallInt ; cdecl; external 'SoundBytes.DLL';

Note the use of "cdecl" again and the declaration of the DLL file in which it is to be found. In this example Delphi will need to find the DLL within its default paths. So :

  • a copy should be it in the same directory as your executable
  • or it should be in a common directory that Delphi and InterBase can see
  • or you could put the Interbase/Bin directory in your default path
  • · or you could give the full path in the declaration - but this is messy - and "we don't like hard paths in OUR applications" do we?

Then it can be used it just like any other function. Thus:

SoundIndex := SoundBts(SearchPattern);

works to put the Number into SoundIndex (which is a SmallInt) for the SearchPattern variable which is a PChar.

In my test application, I had a TQuery, attached to a DataBase component which was set to my InterBase Database. I set the query strings:

Select Surname,Contact_ID
from Main
where Soundbytes = :SXCode

By defining my strings thus, I automatically got a parameter of SXCode.

Using the Params editor I set this to a type of SmallInt

In the FormCreate event I prepared the Query:

SoundBytesQuery.Prepare

so that it would go faster when called

Then to activate the query, I just needed to put in the SoundBytes value into the Parameter and activate the query:

With SoundByteQuery do
begin
   Active := False;
   ParamByName('SXCode').AsSmallInt := SoundBts(PChar(SearchString.Text));
   Active := True;
end;

Note that this calls the external function and that the string passed to it is first TypeCast as a PChar. That's it. All that was left was to play and be impressed!

Conclusion

Well that was all pretty easy wasn't it? Now you can have that snappy response to what to the observer may seem a daunting task and bask in the glory of InterBase at it's best :

Perplexed User : "Let me get this straight, you are searching 300,000 surnames of my database and comparing each with your pattern to see if it sounds the same and then you show all of those?"

Happy Developer : "Yup." (well it's near enough, why break the spell?) Imressed User: "In that time?"

Happier Developer : "Nothing up my sleeve, I promise."

Very impressed User : " It's amazing!"

Happiest Developer :"Just something I knocked up! By the way, here is my latest invoice"

Now don't you agree that is how to make a database sound better!

Source code of SoundBytes.DPR to create SoundBytes.DLL

{This DLL is for use as a UDF for Interbase but may be called by any
application as a normal DLL it is based on the Soundex method of compressing
names released to the public domain by Natural Systems for free
distribution This version compresses the sound of the name into a 2 byte
word rather than the 4 characters of the original version in order for
improved performance.

JDM 19 Nov 97}

library SoundBytes;

uses
SysUtils;

function SoundBts(Name : PChar) : SmallInt; cdecl;
Const
{This table gives the SoundEX SCORE for all characters Upper and Lower Case
hence no need to convert. This is faster than doing an UpCase on the whole input string
The 5 NON Chars in middle are just given 0}

SoundExTable : Array[65..122] Of Byte
//A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ / ] ^ _ '
=(0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2,0,0,0,0,0,0,
//a b c d e f g h i j k l m n o p q r s t u v w x y z
0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2);

Var
i, l, s, SO, x : Byte;
Multiple : Word;

begin
If Name >  ''                           //do nothing if nothing is passedthen begin
   Result := Ord(UpCase(Name[0]));       //initialise to first character
   SO := 0;                              //initialise last char as 0
   Multiple := 26;                       //initialise to 26 char of alphabet
   l := Pred(StrLen(Name));              //get into var to save repeating function
   For i := 1 to l do                    //for each char of input str
   begin
      s := Ord(name[i]);                  //*
      If (s > 64) and (s < 123)           //see notes * below
      then begin
      x := SoundExTable[s];             //get soundex value
      If (x > 0)                        //it is a scoring char
      AND (x <> SO)                     //is different from previous char
      then begin
         Result := Result + (x * Multiple); //avoid use of POW as it needs maths unit
         If (Multiple = 26 * 6 * 6)      //we have done enough (NB compiles to a const
         then break;                          //We have done, so leave loop
         Multiple := Multiple * 6;
         SO := x;                        //save for next round
      end;                              // of if a scoring char
      end;                                //of if in range of SoundEx table
   end;                                  //of for loop
end else result := 0;
end;                                      //of function SoundBts

exports
SoundBts;
begin
end.

{*Notes  I originally used a Try/except block for getting the index code
but found that although SoundexTable[45] will error as out of range,
SoundExTable[Ord(Name[i])] where OrdName[i] was 45 returned 0, or what
ever happened to be at that ILLEGAL address. It did NOT error and fall
out of the block! So, I put in checks that letter was in range - I could
have kept my try/except by using using SoundExTable[s], but went for this
way instead}