What is fuzzy matching?

Fuzzy matching is also known as “Approximate String Matching”, and it is useful in cases where you want to compare or match two strings that are not identical.

Typical Input: Two strings of text

Typical Output: A similarity score

If the score exceeds some user-defined threshold, the two strings are considered a match. This match can be used in whatever decision-making logic the user wants!

Use Cases

  • Data merging, where the by variables are quasi-identical in both datasets but might differ slightly, such as name, date of birth, address, etc. Manually resolving these differences would be unfeasible.
  • Connecticut Zoning project: identifying names of zoning districts in two different documents, where convention can differ
  • Optical Character Recognition software: can often make tiny transcription errors that throw off exact matches
  • Twitter Harassment Detection: cleaning my dataset by removing Tweets that are similar enough to bias the sample (e.g. “fuzzy deduplication”)

How does fuzzy matching work?

You can generally think of a similarity score as the number of edits that need to be made to one string of text in order to convert it to the other string of text. What an algorithm considers to be an “edit” varies, but generally is some combination of insertions, deletions, substitutions, or transpositions of characters. For example:

  • bed to red requires 1 substitution
  • HAHA to HAHAHA requires 2 insertions
  • happily to happy requires 2 deletions
  • blade to baled requires 1 transposition

A few common algorithms include: - The Jaro-Winkler similarity score, which considers only transpositions - Levenshtein distance, which considers insertions, deletions, or substitutions - Hamming distance, which considers only substitutions - Damerau-Levenshtein distance, which includes all four of these kinds of edits

Similarity score will be a function not just of the edits, but also of the number of common characters. “far” to “bar” and “revenue” to “revenge” both require only one edit, but the latter two words would rightfully receive a higher fuzzy matching score because a larger proportion of the word is identical.

thefuzz package for fuzzy matching

Disclaimer: This is not the definitive or best way to do fuzzy matching. Graham led a Data Science Series session on fuzzy matching, the resources for which are found here.

This package mainly relies on rules developed by SeatGeek, and along with other algorithms like Jaro-Winkler and Levenshtein distance, should be thought of as another tool for the toolkit.

Motivation

  • Simple, takes in two strings of any length and output a similarity score between 0 and 100
  • Fast, especially with right packages installed
  • Flexible, with 4 different techniques for calculating similarity, so room to experiment for various use cases

Background and not-too-onerous Setup

  • Developed by SeatGeek, originally called fuzzywuzzy
  • Requires Python 2.7 or greater, plus the difflib package and optionally python-Levenshtein for speedup
  • See their repo and excellent walkthrough guide for details on installation and usage
  • An R library called fuzzywuzzyr does exist that ports this Python code
  • Note that all of these methods are case-sensitive!
from thefuzz import fuzz

1. Ratio

This is the simplest function, which replicates the matching procedure used by difflib (similar to Levenshtein distance). It essentially computes the number of edits needed to convert string 1 into string 2, as a ratio of the length of the larger string.

Best for short, simple examples. Struggles when word order is inverted or small words/sets of characters are missing.

fuzz.ratio('cart', 'cast')
75
fuzz.ratio('United States', 'United States of America')
70
fuzz.ratio('The quick brown fox jumped over the yellow dog', 'The yellow dog jumped over the quick brown fox')
63
fuzz.ratio('A canal, a plan, a man', 'A man, a plan, a canal. Panama')
69

2. Partial Ratio

This function developed by SeatGeek is great for matching shorter substrings within longer strings. If the shorter string is length m and the longer string is length n, this calculates the similarity (using the previous Ratio function) between the shorter string and the best substring of length m within the longer string.

fuzz.partial_ratio('cart', 'cast')
75
fuzz.partial_ratio('United States', 'United States of America')
100
fuzz.partial_ratio('The quick brown fox jumped over the yellow dog', 'The yellow dog jumped over the quick brown fox')
63
fuzz.partial_ratio('A canal, a plan, a man', 'A man, a plan, a canal. Panama')
82

3. Token Sort Ratio

This function developed by SeatGeek splits two sequences of text into their individual tokens (e.g. letters, digits, or punctuation), sorts them alphabetically, and then rejoins them together. It then calculates the Ratio above on these new strings. This is great for order-agnostic matching where you don’t care about inverted words/characters.

fuzz.token_sort_ratio('cart', 'cast')
75
fuzz.token_sort_ratio('United States', 'United States of America')
70
fuzz.token_sort_ratio('The quick brown fox jumped over the yellow dog', 'The yellow dog jumped over the quick brown fox')
100
fuzz.token_sort_ratio('A canal, a plan, a man', 'A man, a plan, a canal. Panama')
85

4. Token Set Ratio

This last function developed by SeatGeek is similar to the previous one, but a little more complicated, creating ‘intersection’ and ‘remainder’ groups after tokenizing. Generally, it’s more flexible and useful when one sequence has lots of extra words/tokens that don’t affect the meaning but mess with the sorting process of token sort ratio. Refer to SeatGeek’s explanation and examples, or just try yourself to see what works!

fuzz.token_set_ratio('cart', 'cast')
75
fuzz.token_set_ratio('United States', 'United States of America')
100
fuzz.token_set_ratio('The quick brown fox jumped over the yellow dog', 'The yellow dog jumped over the quick brown fox')
100
fuzz.token_set_ratio('A canal, a plan, a man', 'A man, a plan, a canal. Panama')
100

Experiment and combine! Can try setting arbitrary score thresholds. For example:

  • If token set ratio > 85, this constitutes a match
  • If ratio and token sort ratio both > 75, this constitutes a match

Other Resources Shared

  • rapidfuzz package, a faster version of the thefuzz: https://medium.com/mlearning-ai/all-about-rapidfuzz-string-similarity-and-matching-cd26fdc963d8
  • Bringing in NLP techniques from the nltk package - particularly useful for “needle in a haystack” applications: https://stackoverflow.com/questions/17740833/checking-fuzzy-approximate-substring-existing-in-a-longer-string-in-python/31433394#31433394
  • Jess Kelly’s technical report on linking program data: https://www.ojp.gov/pdffiles1/bjs/grants/239536.pdf