from thefuzz import fuzz
What is fuzzy matching?
Fuzzy matching is also known as “Approximate String Matching”, and it is useful in cases where you want to compare or match two strings that are not identical.
Typical Input: Two strings of text
Typical Output: A similarity score
If the score exceeds some user-defined threshold, the two strings are considered a match. This match can be used in whatever decision-making logic the user wants!
Use Cases
- Data merging, where the
by
variables are quasi-identical in both datasets but might differ slightly, such as name, date of birth, address, etc. Manually resolving these differences would be unfeasible. - Connecticut Zoning project: identifying names of zoning districts in two different documents, where convention can differ
- Optical Character Recognition software: can often make tiny transcription errors that throw off exact matches
- Twitter Harassment Detection: cleaning my dataset by removing Tweets that are similar enough to bias the sample (e.g. “fuzzy deduplication”)
How does fuzzy matching work?
You can generally think of a similarity score as the number of edits that need to be made to one string of text in order to convert it to the other string of text. What an algorithm considers to be an “edit” varies, but generally is some combination of insertions, deletions, substitutions, or transpositions of characters. For example:
- bed to red requires 1 substitution
- HAHA to HAHAHA requires 2 insertions
- happily to happy requires 2 deletions
- blade to baled requires 1 transposition
A few common algorithms include: - The Jaro-Winkler similarity score, which considers only transpositions - Levenshtein distance, which considers insertions, deletions, or substitutions - Hamming distance, which considers only substitutions - Damerau-Levenshtein distance, which includes all four of these kinds of edits
Similarity score will be a function not just of the edits, but also of the number of common characters. “far” to “bar” and “revenue” to “revenge” both require only one edit, but the latter two words would rightfully receive a higher fuzzy matching score because a larger proportion of the word is identical.
thefuzz
package for fuzzy matching
Disclaimer: This is not the definitive or best way to do fuzzy matching. Graham led a Data Science Series session on fuzzy matching, the resources for which are found here.
This package mainly relies on rules developed by SeatGeek, and along with other algorithms like Jaro-Winkler and Levenshtein distance, should be thought of as another tool for the toolkit.
Motivation
- Simple, takes in two strings of any length and output a similarity score between 0 and 100
- Fast, especially with right packages installed
- Flexible, with 4 different techniques for calculating similarity, so room to experiment for various use cases
Background and not-too-onerous Setup
- Developed by SeatGeek, originally called
fuzzywuzzy
- Requires Python 2.7 or greater, plus the
difflib
package and optionallypython-Levenshtein
for speedup - See their repo and excellent walkthrough guide for details on installation and usage
- An R library called
fuzzywuzzyr
does exist that ports this Python code - Note that all of these methods are case-sensitive!
1. Ratio
This is the simplest function, which replicates the matching procedure used by difflib
(similar to Levenshtein distance). It essentially computes the number of edits needed to convert string 1 into string 2, as a ratio of the length of the larger string.
Best for short, simple examples. Struggles when word order is inverted or small words/sets of characters are missing.
'cart', 'cast') fuzz.ratio(
75
'United States', 'United States of America') fuzz.ratio(
70
'The quick brown fox jumped over the yellow dog', 'The yellow dog jumped over the quick brown fox') fuzz.ratio(
63
'A canal, a plan, a man', 'A man, a plan, a canal. Panama') fuzz.ratio(
69
2. Partial Ratio
This function developed by SeatGeek is great for matching shorter substrings within longer strings. If the shorter string is length m
and the longer string is length n
, this calculates the similarity (using the previous Ratio function) between the shorter string and the best substring of length m
within the longer string.
'cart', 'cast') fuzz.partial_ratio(
75
'United States', 'United States of America') fuzz.partial_ratio(
100
'The quick brown fox jumped over the yellow dog', 'The yellow dog jumped over the quick brown fox') fuzz.partial_ratio(
63
'A canal, a plan, a man', 'A man, a plan, a canal. Panama') fuzz.partial_ratio(
82
3. Token Sort Ratio
This function developed by SeatGeek splits two sequences of text into their individual tokens (e.g. letters, digits, or punctuation), sorts them alphabetically, and then rejoins them together. It then calculates the Ratio above on these new strings. This is great for order-agnostic matching where you don’t care about inverted words/characters.
'cart', 'cast') fuzz.token_sort_ratio(
75
'United States', 'United States of America') fuzz.token_sort_ratio(
70
'The quick brown fox jumped over the yellow dog', 'The yellow dog jumped over the quick brown fox') fuzz.token_sort_ratio(
100
'A canal, a plan, a man', 'A man, a plan, a canal. Panama') fuzz.token_sort_ratio(
85
4. Token Set Ratio
This last function developed by SeatGeek is similar to the previous one, but a little more complicated, creating ‘intersection’ and ‘remainder’ groups after tokenizing. Generally, it’s more flexible and useful when one sequence has lots of extra words/tokens that don’t affect the meaning but mess with the sorting process of token sort ratio. Refer to SeatGeek’s explanation and examples, or just try yourself to see what works!
'cart', 'cast') fuzz.token_set_ratio(
75
'United States', 'United States of America') fuzz.token_set_ratio(
100
'The quick brown fox jumped over the yellow dog', 'The yellow dog jumped over the quick brown fox') fuzz.token_set_ratio(
100
'A canal, a plan, a man', 'A man, a plan, a canal. Panama') fuzz.token_set_ratio(
100
Experiment and combine! Can try setting arbitrary score thresholds. For example:
- If token set ratio > 85, this constitutes a match
- If ratio and token sort ratio both > 75, this constitutes a match