Anastasia and the team have been using some real-world data to test algorithms for performance at identifying obscured matches, and at avoiding false positives. We hope this is a useful reference for anyone embarking on a project for which they need the help of an algorithm, or for anyone (like us) who finds this just fascinating!
Choosing the best “fuzzy matching” algorithms to make human-like decisions
Data is often messy. Frequently, the same basic information gets written in many ways, which makes it difficult to see that two similar looking pieces of data are actually talking about the same thing.
Compare these two addresses:
Signals Ltd, 47A Market Pl, Henley-on-Thames RG9 2AA
Signals Limited, 47A Market Place, Henley-on-Thames, Oxfordshire, RG9 2AA
Both point to the same company, but a simple piece of software/string matching program would not be able to see it. As soon as the comparison reaches the second word, the match fails. This can lead to duplication of information, or errors, depending on how the data is used.
Detecting matching data
What is required in these cases is “fuzzy matching”, whereby the result that we get back from the match is not a binary result of pass or fail, but a percentage score of how likely the two strings of text are to be a match.
There are several algorithms that provide this sort of matching – fortunately here is a site which can quickly run this sort of match for many of them, and output a score: https://asecuritysite.com/forensics/simstring.
Below is a list of algorithms. Each shows the % likelihood that the two addresses we mentioned earlier are the same. As you can see, some algorithms are less certain of this than others.
- 87% – Jaro Winkler
- 86% – Smith-Waterman Gotoh
- 86% – Smith-Waterman Gotoh Windowed Affine
- 80% – Smith-Waterman
- 79% – Jaro
- 73% – QGrams Distance
- 71% – Levenshtein
- 71% – NeedlemanWunch
- 71% – Chapman Length Deviation
- 62% – Overlap Coefficient
- 59% – Block Distance
- 59% – Cosine Similarity
- 36% – Euclidean Distance
In this case, the Smith-Waterman and the Jaro Winkler algorithms provide a high degree of confidence that these two addresses are the same.
Detecting non-matching data
Another requirement of our algorithm is that it should be able to identify when we are not looking at the same company, even though the address might be very similar.
Here are two addresses that do not match, though some elements are the same or similar:
Signals Ltd, Broadgates, 47A Market Pl, Henley-on-Thames RG9 2AA
An Example Company, 999 Market Place, Henley-on-Thames, Oxfordshire, RG9 2AA
When we compare these addresses, the algorithms produce the following results:
- 84% – Chapman Length Deviation
- 69% – Jaro
- 69% – Jaro Winkler
- 61% – NeedlemanWunch
- 49% – Levenshtein
- 49% – Smith-Waterman Gotoh
- 49% – Smith-Waterman Gotoh Windowed Affine
- 47% – QGrams Distance
- 43% – Smith-Waterman
- 33% – Overlap Coefficient
- 32% – Block Distance
- 32% – Cosine Similarity
- 17% – Euclidean Distance
The Smith Waterman algorithms are only 49% confident that the two strings are a match, however the Jaro Winkler returns a relatively high confidence rate of 69% – a false positive.
Not only is accurate matching important, but so is our confidence that we are not getting false positives.
We ran through a set of tests with some real-world data to see which algorithms did well at matching correctly without returning too many false positives. The Smith Waterman algorithms seemed best at this task, closely followed by Overlap, Block, Cosine, and Q grams.
Algorithm speed matters
Another important factor is the speed of implementation.
Using the SimMetrics library (https://www.nuget.org/packages/SimMetrics/) we tested a dataset of 4000 items to check for duplicates amongst them, using the top five algorithms from the list above. This meant doing around 16 million comparisons.
The top algorithms we chose produced a wide range of timings:
- Smith-Waterman: 8 minutes 19 seconds
- OverlapCoefficient: 1 minute 40 seconds
- Block Distance: 1 minute 32 seconds
- Cosine Similarity: 1 minute 51 seconds
- QGrams Distance: 38 minutes 20 seconds (!)
Again, we saw a large range of speeds from these different algorithms. Some, like Smith-Waterman are accurate but slow compared to other algorithms which are similar quality. Though less certain to produce high % likelihoods of a match, Overlap Coefficient is exceedingly fast.
When set to a fairly low threshold of match detection like 60%, this makes Overlap Coefficient both a time-efficient and effective way to identify duplicates in addresses.
Depending on your requirements, you may choose to focus on % likelihoods of matching, not-matching or algorithm speed to choose the best algorithm for the task at hand. For many projects, all of these factors come into play; In such cases, running tests like the ones above will help you identify the best detection algorithm for the job. Happy testing!