Web Scraping with Dynamic Pages#

Dynamic web pages require a separate set of tools - either instead of or in addition to - what we covered in the last lesson. We’ll have to automate the actions that human web users would take when navigating a web page, such as clicking a button, selecting dropdown options, or entering text. This lesson will be an introduction to the selenium package in Python, which allows us to flexibly and powerfully interact with dynamic web pages.

Overview#

Today we’ll be digging into how to get started with a web scraping task and how to structure your thinking about approaching the task. For the remainder of the boot camp we’ll be working on scraping state-level health insurance premium values from the KFF Health Insurance Marketplace Calculator. In this example, the project team needs the cost of the Silver Plan Premium (without financial help) for each county for people aged 14, 20, 40, and 60. The final output should look something like this:

State

County

Age 14

Age 20

Age 40

Age 60

AL

St. Clair

281

434

566

1824

AL

Jefferson

294

420

540

1830

AL

Shelby

273

451

589

1801

Getting Started#

First, let’s import some packages that we’ll need throughout the lesson (which you may need to install from the command line using conda install package_name):

from bs4 import BeautifulSoup
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service 
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options

## NOTE: Some users may want to try a Firefox Driver instead;
## Can comment above two lines and uncomment the below two lines
# from selenium.webdriver.firefox.service import Service
# from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select, WebDriverWait
import pandas as pd
import time

The Driver#

To start, we’ll need to launch a web browser that will be controlled by our python code, which is called a driver. First, we have a line that either installs the driver (in this case using a Chrome browser) or pulls a locally cached version if it’s already installed.

Next, we need to specify the URL that we want the driver to navigate to. The following chunk of code specifies that we want to navigate to the Health Insurance Marketplace Calculator, and then opens a web browser and navigates to the page.

%%capture #Hides output
service = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
url = "https://www.kff.org/interactive/subsidy-calculator/"
driver.get(url)
UsageError: unrecognized arguments: output

Quick refresher on functions#

A function is a block of code that can execute a particular action. The reason we use a function is to avoid copy and pasting the same code multiple times if we want to repeatedly execute that action. Functions can take in arguments from the user and can return a value.

def square_number(x):
    return x**2

square_number(3)
9

Looking through the website, there are drop down menus, select buttons, and text input that we’ll need to navigate. Based on the rule of “if a human needs to click something” we’ll need to use the selenium package. Luckily, this isn’t Urban’s first web scraping rodeo and we have sample code functions for completing each of these types of actions.

Click Button#

Use the function when you need to click a button on the page.

def click_button(identifier, driver, by=By.XPATH, timeout=15):   
    '''
    This function waits until a button is clickable and then clicks on it.`

    Inputs:
        identifier (string): The Id, XPath, or other way of identifying the element to be clicked on
        by (By object): How to identify the identifier (Options include By.XPATH, By.ID, By.Name and others).
            Make sure 'by' and 'identifier' correspond to one other as they are used as a tuple pair below.
        timeout (int): How long to wait for the object to be clickable

    Returns:
        None (just clicks on button)
    '''

    element_clickable = EC.element_to_be_clickable((by, identifier))
    element = WebDriverWait(driver, timeout=timeout).until(element_clickable)
    driver.execute_script("arguments[0].click();", element)

Select a Dropdown#

Use this function to select a value in a dropdown menu

def select_dropdown(identifier, driver,  by=By.XPATH, value=None, option=None,  index=None):
    '''
    This function clicks on the correct dropdown option in a dropdown object.
    It first waits until the element becomes selectable before locating the proper drop down menu. Then it selects the proper option.
    If the page doesn't load within 15 seconds, it will return a timeout message.

    Inputs:
        id (string): This is the HTML 'value' of the dropdown menu to be selected, 
            found through inspecting the web page.
        value (string): The value to select from the dropdown menu.
        index (int): If index is not None, function assumes we want to select an option by its index instead of by specific value. 
            In this case, should specify that value = None.
    
    Returns:
        None (just selects the right item in the dropdown menu)
    '''
    element_clickable = EC.element_to_be_clickable((by, identifier))
    element = WebDriverWait(driver, timeout=15).until(element_clickable)
    if value is not None:
        Select(element).select_by_value(value)
    elif option is not None: 
        Select(element).select_by_visible_text(option)
    else:
        Select(element).select_by_index(index)

Enter Text#

Use this function to enter text in a text box. the enter_text function is accompanied by the is_textbox_empty function to test is there is already a value in the text box. Later in the boot camp when we start to loop through variables, in some cases we’ll want to skip over the text box if there’s already text, in others we’ll want to make sure to clear the value first before we enter something else.

def enter_text(identifier, text, driver, by=By.XPATH):
    element_clickable = EC.element_to_be_clickable((by, identifier))
    element = WebDriverWait(driver, timeout=15).until(element_clickable)
     # Clear the text from the text box (zip code wasn't overwritting)
    element.clear()
    element.send_keys(text)
def is_textbox_empty(driver, textbox_id):
    '''
    This function checks if a text box is empty
    Use this for the income variable so that we don't rewrite it
    every loop
    '''
    textbox = driver.find_element('xpath',textbox_id)
    textbox_value = textbox.get_attribute("value")

    return not bool(textbox_value)

Sometimes to move forward, you have to wait#

Let’s dive into the click_button function so we have some intuition as to what’s going on. There are really only 3 lines, which speaks to how powerful the selenium package is.

def click_button(identifier, driver, by=By.XPATH, timeout=15):   
    
    element_clickable = EC.element_to_be_clickable((by, identifier))
    element = WebDriverWait(driver, timeout=timeout).until(element_clickable)
    driver.execute_script("arguments[0].click();", element)

Function Arguments#

First, let’s look at what we pass into the function. The driver is just the webdriver we specified earlier, which will never change. The identifier is some unique way to identify the button we want to click, and the by argument specifies how we identify that button. Let’s take a look at different ways we could identify the “SUBMIT” button on the page we launched before:

  • XPATH is the default and is probably the easiest way to to identify a button, though might be computationally a little slower.

  • Finding an object by its ID is faster, but not all objects (including this one) have an ID.

Using XPATH in this case, we can copy and paste the XPATH of the button from the page source and feed it in as the identifier argument.

Function Steps#

  1. The first line identifies the element we want to click based on the identifier and by arguments we discussed before.

  2. The second line uses what we call an “implicit wait”, a hugely powerful part of selenium. The driver waits until the element we want is “clickable” on the webpage, which is crucial for dynamic web pages where elements might take time to load, especially when we repeatedly call the same page. What’s nice about implicit waits is that they wait only as long as they need to, or until a timeout is reached (in this case if it takes an element longer than 15 seconds to load, which means something is probably wrong). On the other hand, explicit waits pause the driver for some user-specified amount of time. We generally prefer implicit waits because they’re more efficient, though we’ll see how both have their place.

  3. Finally, once the element is clickable, the third line actually executes the action to click the button.

A note on dropdowns and text boxes:#

Unlike clicking a button, the other functions have one additional argument. For entering text, you obviously have to specify the text you want to enter. Dropdowns are slightly more complicated, because you have to indicate which dropdown option you want to select, which you can do by “value” or “index” upon inspecting the dropdown.

WORKSHOP#

TASK 1#

Let’s make a list of the actions we’d need to take on the webpage in order to navigate to the health insurance premium values. We might call this “pseudocode” - no Python code needed here; just a list of steps that we want to convert to code.

TASK 1 - SOLUTION#

Steps 1. Select dropdown - state 2. Enter text - zip code 3. Some zip codes cross counties, so in these cases, we also need to select the county dropdown 4. Enter text - yearly household income 5. Click button - whether coverage available from you or spouse's job 6. Select dropdown - number of people in family 7. Select dropdown - number of adults enrolling in coverage 8. Select dropdown - age 9. Select dropdown - # of children enrolling 10. Click button - "SUBMIT"

TASK 2#

Taking the 10 steps above, let’s use the functions defined earlier to actually execute these ten steps. You can actually do this one step at a time, checking the browser opened by your driver to see if it clicked the correct thing. If you make a mistake, you can either close the driver and relaunch it using the code above, or you can manually click back to the beginning and try again.

Let’s try this for the following values:

  • State: Illinois

  • Zip Code: 62401

  • County: Shelby

  • Yearly Household Income: $100,000

  • No coverage available

  • 1 person in family

  • 1 adult enrolling

  • Age of 60

  • No children enrolling

TASK 2 SOLUTION#

# Solution here:

And now we’ll see that we’ve successfully navigated to the page with the premium values we want to scrape, setting us up to use the code from last week’s BeautifulSoup workshop to actually get these values!

Looking Ahead#

Next week, we’ll see how to put this code into a bigger loop so that we can repeat it many times for various menu options. Some concepts that will be important next week:

  • for loops in Python

  • functions that prevent us from copy/pasting things 10 million times

  • Implicit and Explicit waits

  • Dictionaries and lists for storing values

Helpful Review: