# For installation help with this package, visit this link: https://github.com/seatgeek/thefuzz
from thefuzz import fuzz
def intersect_fuzzy(a, b, threshold=70):
'''
Function:
The set.intersection() method built into Python finds the intersection of two sets.
For example, the intersection of {"apple", "pear"} and {"pear", "strawberry"} is "pear".
This function does the same, but rather than exact matching, it
does intersection based on fuzzy matching with Levenshtein distance.
For example, non-exact cases like {"apple", "pear"} and {"pears", "strawberry"}.
Inputs:
a and b (sets): two sets of strings
threshold (integer): the threshold for considering two strings to be a match, from 0 to 100.
(Trial and error with various example is the best way to set the threshold)
'''
if len(a) > len(b):
= b, a
a, b
= set()
c for x in a:
for y in b:
if fuzz.ratio(x, y) >= threshold:
c.add(x) break
return c
Fuzzy Intersect
Please use all code samples responsibly - these are samples and likely require adjustments to work correctly for your specific needs. Read through the documentation and comments to understand any caveats or limitations of the code and/or data and follow-up with the code author or Code Library admins if you have questions on how to adapt the sample to your specific use case.
Purpose: Given two sets of words, characters, or strings, the set.intersect() function in Python is already built in to find the intersection of those sets. But for non-exact matches, this function can do the same using fuzzy matching (e.g. matching “apple” and “apples”)
Data: Any two sets of strings being used in text analysis, data cleaning, or matching tasks.
Author: Judah Axelrod (November 2022)