Build Custom Stopwords

Reminder

Please use all code samples responsibly - these are samples and likely require adjustments to work correctly for your specific needs. Read through the documentation and comments to understand any caveats or limitations of the code and/or data and follow-up with the code author or Code Library admins if you have questions on how to adapt the sample to your specific use case.

Purpose: Building a custom list of stopwords (words to filter out) during text analysis. (See here for more information)

Data: Any text data being used for NLP or text analysis

Author: Judah Axelrod (November 2022)

from collections import Counter
from nltk.corpus import stopwords
import string

def get_stopwords():
    '''
    Function: 
        Builds a custom list of stopwords, in this case for a zoning application.
    Context:
        Stopwords are words to be filtered out as extraneous or irrelevant in text analysis.
        This function combines a base list of stopwords from the nltk package (e.g. common words like "the" or "and")
        with other custom stopwords for this zoning use case (e.g. letters, numbers, and 
        other commonly appearing words in zoning documents).
    Returns: 
        zoning_stopwords (Counter object): A list of stopwords, put into a "Counter" object to account for duplicates.
    '''
    
    alphabet = [let for let in string.ascii_lowercase]
    numbers = [str(num) for num in range(100)]
    zoning_words = ['district', 'districts', 'zoning district', 'zoning districts', 'zoning']
    nltk_stopwords = stopwords.words('english')
    custom_stopwords = Counter(alphabet + numbers + zoning_words + nltk_stopwords)
        
    return custom_stopwords