Intro to Web Scraping#

What is web scraping?#

  • Web scraping is the use of programming to extract structured text or data from a website

  • It is generally used to automate tasks that would take too long (or be too error-prone) to feasibly do manually

  • There are two main categories of web scraping tasks: (1) Collecting text data from one or more web pages and (2) automating the download of a number of files from a website

How does Urban use web scraping?#

What are some drawbacks of web scraping?#

  • Not all sites can be legally or responsibly scraped

  • Repeated requests to a website can lead to rate limiting (i.e. capping the number of requests over a certain period of time)

  • Depending on the task and site layout, complexity can vary widely

  • Web scraping code can be brittle as websites change over time

Why is this web scraping bootcamp being taught in Python?#

  • Python ecosystem more mature, flexible, and better-suited for dynamic web pages

  • Functionality in R is growing and evolving (e.g. the rvest package)

  • We may consider R tools for future versions of this workshop

What questions should I be asking at the outset?#

  • Can I get the data without web scraping? (e.g. Is there an API or download option? Can you contact the site owner to request access?)

  • Am I legally allowed to scrape the website? Are there any site/rate limits or responsible web scraping considerations?

  • How many datasets or pieces of text need to be scraped?

  • Is webpage layout consistent or unstandardized?

  • Are there Captchas, pop-ups, or ads blocking the content you want?

  • Does the webpage have slow or inconsistent load times?

  • What tools/packages are needed for the job? (We will learn this throughout the workshop!)

What are the variables that affect how difficult a web scraping task is?#

  1. How many different websites or pages are involved in the web scraping process?

  2. Does the website have dynamic content or only static content?

  3. Is it straightforward to extract the info we want once we reach the desired webpage?

1. Different Webpages#

  • Intuitively, scraping information from one website is simpler than doing so from many websites

  • If the layouts of the sites are different, difficulty vastly increases

    • Rule of thumb: Think of this as a unique web scraping task for each uniquely structured website

  • Web crawlers such as scrapy exist to traverse many websites and grab all relevant information, but without easy ways to filter through that metadata, this can quickly become infeasible

  • For jobs that take a long time to run (e.g. more than a few hours), gracefully logging and handling issues can add complexity

2. Static vs Dynamic Content#

  • For a static page like a Wikipedia article, packages like BeautifulSoup or pandas can grab HTML text without too much complexity by parsing HTML tags

  • For pages with dynamic content like clickable buttons or dropdown menus, the Selenium package is needed and the complexity goes up

    • Rule of thumb: Would a human user need to take any actions (besides scrolling up or down) to navigate to the desired info, or is it immediately available on the webpage?

3. Identifying Desired Information#

  • Possible future task: scraping area median income from HUD website

  • Upside: Only one webpage, can use Selenium to navigate dropdowns

  • Downside: Numbers we want to grab can be in different places within each webpage

Responsible Web Scraping Guidelines#

  1. Check the robots.txt file - let’s look at an example: https://www.urban.org/robots.txt

  2. Consult Urban’s Automated Data Collection Guidelines.

  3. Use Headers (we’ll see this in action next week)

headers = {'user-agent': 'Urban Institute Research Data Collector ([your_e-mail]@urban.org)'}

  1. Use Site Monitor to ensure web scraping does not strain the website

Site Monitor#

  • A tool created by Urban to ensure responsible web scraping practices

  • The actual code for Site Monitor lives here in this GitHub repository

  • Example code to test strain on a website

from site_monitor import *
import requests

sm = SiteMonitor(burn_in=20)

for i in range(100):
	print(i)
	url = "https://flscns.fldoe.org/PbInstituteCourseSearch.aspx"
	response = requests.get(url)
	delay = sm.track_request(response)

# Display the report of response times in graph format
sm.report('display')

Example Output from Site Monitor#

{fig-align=’center’}

A note on ~AI~#

  • We don’t expect you to understand 100% of the code throughout this bootcamp.

  • We want to emphasize the idea of concepts > syntax.

  • Urban is still working through testing of its guidelines for use of AI, and while things are murky, we want to focus on building Python and web scraping intuition.

  • This workshop will not use Copilot or ChatGPT, though we acknowledge the utility of those tools if you are asking from a place of conceptual understanding.

Homework: Installations for Next Time#

  • Install Python via Anaconda - see guidance from PUG’s Python Installation training here

  • Install the following Python packages: requests, beautifulsoup4, lxml, selenium, and webdriver-manager.

  • Launch a new Jupyter Notebook if you’ve never done so before - see guidance from PUG’s Intro to Python training here

  • If you have any issues, please use the #python-users channel and we’d love to help. Someone else probably has the same question!

  • Sign up for GitHub using this guide if you haven’t so that you can access these workshop materials!

Next Session#

  • How to scrape text from static webpages using BeautifulSoup

  • Diving into some Python code!