Fuzzy Intersect

Reminder

Please use all code samples responsibly - these are samples and likely require adjustments to work correctly for your specific needs. Read through the documentation and comments to understand any caveats or limitations of the code and/or data and follow-up with the code author or Code Library admins if you have questions on how to adapt the sample to your specific use case.

Purpose: Given two sets of words, characters, or strings, the set.intersect() function in Python is already built in to find the intersection of those sets. But for non-exact matches, this function can do the same using fuzzy matching (e.g. matching “apple” and “apples”)

Data: Any two sets of strings being used in text analysis, data cleaning, or matching tasks.

Author: Judah Axelrod (November 2022)

# For installation help with this package, visit this link: https://github.com/seatgeek/thefuzz
from thefuzz import fuzz

def intersect_fuzzy(a, b, threshold=70):
    '''
    Function:
        The set.intersection() method built into Python finds the intersection of two sets.
        For example, the intersection of {"apple", "pear"} and {"pear", "strawberry"} is "pear".
        
        This function does the same, but rather than exact matching, it
        does intersection based on fuzzy matching with Levenshtein distance.
        For example, non-exact cases like {"apple", "pear"} and {"pears", "strawberry"}.
    Inputs:
        a and b (sets): two sets of strings
        threshold (integer): the threshold for considering two strings to be a match, from 0 to 100.
            (Trial and error with various example is the best way to set the threshold)
    '''
    if len(a) > len(b):
        a, b = b, a

    c = set()
    for x in a:
        for y in b:
            if fuzz.ratio(x, y) >= threshold:
                c.add(x) 
                break
    return c