from collections import Counter
from nltk.corpus import stopwords
import string
def get_stopwords():
'''
Function:
Builds a custom list of stopwords, in this case for a zoning application.
Context:
Stopwords are words to be filtered out as extraneous or irrelevant in text analysis.
This function combines a base list of stopwords from the nltk package (e.g. common words like "the" or "and")
with other custom stopwords for this zoning use case (e.g. letters, numbers, and
other commonly appearing words in zoning documents).
Returns:
zoning_stopwords (Counter object): A list of stopwords, put into a "Counter" object to account for duplicates.
'''
= [let for let in string.ascii_lowercase]
alphabet = [str(num) for num in range(100)]
numbers = ['district', 'districts', 'zoning district', 'zoning districts', 'zoning']
zoning_words = stopwords.words('english')
nltk_stopwords = Counter(alphabet + numbers + zoning_words + nltk_stopwords)
custom_stopwords
return custom_stopwords
Build Custom Stopwords
Reminder
Please use all code samples responsibly - these are samples and likely require adjustments to work correctly for your specific needs. Read through the documentation and comments to understand any caveats or limitations of the code and/or data and follow-up with the code author or Code Library admins if you have questions on how to adapt the sample to your specific use case.
Purpose: Building a custom list of stopwords (words to filter out) during text analysis. (See here for more information)
Data: Any text data being used for NLP or text analysis
Author: Judah Axelrod (November 2022)