Intro to Web Scraping
Contents
Intro to Web Scraping#
What is web scraping?#
Web scraping is the use of programming to extract structured text or data from a website
It is generally used to automate tasks that would take too long (or be too error-prone) to feasibly do manually
There are two main categories of web scraping tasks: (1) Collecting text data from one or more web pages and (2) automating the download of a number of files from a website
How does Urban use web scraping?#
Collecting thousands of community college course descriptions from the FLDOE website
Downloading hundreds of CSV files from the Centers for Medicare & Medicaid Services website that all required clicking different dropdowns from a menu of options
Collecting the contact info for all notaries in Mississippi by clicking through thousands of pages on the Secretary of State website
Pulling voting history information from the North Carolina State Board of Election website by searching for thousands of registered voters
What are some drawbacks of web scraping?#
Not all sites can be legally or responsibly scraped
Repeated requests to a website can lead to rate limiting (i.e. capping the number of requests over a certain period of time)
Depending on the task and site layout, complexity can vary widely
Web scraping code can be brittle as websites change over time
Why is this web scraping bootcamp being taught in Python?#
Python ecosystem more mature, flexible, and better-suited for dynamic web pages
Functionality in R is growing and evolving (e.g. the
rvest
package)We may consider R tools for future versions of this workshop
What questions should I be asking at the outset?#
Can I get the data without web scraping? (e.g. Is there an API or download option? Can you contact the site owner to request access?)
Am I legally allowed to scrape the website? Are there any site/rate limits or responsible web scraping considerations?
How many datasets or pieces of text need to be scraped?
Is webpage layout consistent or unstandardized?
Are there Captchas, pop-ups, or ads blocking the content you want?
Does the webpage have slow or inconsistent load times?
What tools/packages are needed for the job? (We will learn this throughout the workshop!)
What are the variables that affect how difficult a web scraping task is?#
How many different websites or pages are involved in the web scraping process?
Does the website have dynamic content or only static content?
Is it straightforward to extract the info we want once we reach the desired webpage?
1. Different Webpages#
Intuitively, scraping information from one website is simpler than doing so from many websites
If the layouts of the sites are different, difficulty vastly increases
Rule of thumb: Think of this as a unique web scraping task for each uniquely structured website
Web crawlers such as
scrapy
exist to traverse many websites and grab all relevant information, but without easy ways to filter through that metadata, this can quickly become infeasibleFor jobs that take a long time to run (e.g. more than a few hours), gracefully logging and handling issues can add complexity
2. Static vs Dynamic Content#
For a static page like a Wikipedia article, packages like
BeautifulSoup
orpandas
can grab HTML text without too much complexity by parsing HTML tagsFor pages with dynamic content like clickable buttons or dropdown menus, the
Selenium
package is needed and the complexity goes upRule of thumb: Would a human user need to take any actions (besides scrolling up or down) to navigate to the desired info, or is it immediately available on the webpage?
3. Identifying Desired Information#
Possible future task: scraping area median income from HUD website
Upside: Only one webpage, can use
Selenium
to navigate dropdownsDownside: Numbers we want to grab can be in different places within each webpage
Responsible Web Scraping Guidelines#
Check the robots.txt file - let’s look at an example: https://www.urban.org/robots.txt
Consult Urban’s Automated Data Collection Guidelines.
Use Headers (we’ll see this in action next week)
headers = {'user-agent': 'Urban Institute Research Data Collector ([your_e-mail]@urban.org)'}
Use Site Monitor to ensure web scraping does not strain the website
Site Monitor#
A tool created by Urban to ensure responsible web scraping practices
The actual code for Site Monitor lives here in this GitHub repository
Example code to test strain on a website
from site_monitor import *
import requests
sm = SiteMonitor(burn_in=20)
for i in range(100):
print(i)
url = "https://flscns.fldoe.org/PbInstituteCourseSearch.aspx"
response = requests.get(url)
delay = sm.track_request(response)
# Display the report of response times in graph format
sm.report('display')
Example Output from Site Monitor#
{fig-align=’center’}
A note on ~AI~#
We don’t expect you to understand 100% of the code throughout this bootcamp.
We want to emphasize the idea of concepts > syntax.
Urban is still working through testing of its guidelines for use of AI, and while things are murky, we want to focus on building Python and web scraping intuition.
This workshop will not use Copilot or ChatGPT, though we acknowledge the utility of those tools if you are asking from a place of conceptual understanding.
Homework: Installations for Next Time#
Install Python via Anaconda - see guidance from PUG’s Python Installation training here
Install the following Python packages:
requests
,beautifulsoup4
,lxml
,selenium
, andwebdriver-manager
.Launch a new Jupyter Notebook if you’ve never done so before - see guidance from PUG’s Intro to Python training here
If you have any issues, please use the #python-users channel and we’d love to help. Someone else probably has the same question!
Sign up for GitHub using this guide if you haven’t so that you can access these workshop materials!
Next Session#
How to scrape text from static webpages using BeautifulSoup
Diving into some Python code!