Web Scraping with Static Pages#

Today we will be learning how to scrape text from a static webpage. By static, we mean that the webpage does not change its content based on user input (e.g. clicks, textboxes, etc.). We will cover the following concepts today:

  • Inspecting a webpage

  • What are HTML tags and why are they important?

  • How to use the requests library to get the HTML content of a webpage

  • How to use the BeautifulSoup library to parse the HTML content and extract just the parts we want

  • Getting the final output into a workable format using Pandas

Example#

Our example web page will be the Wikipedia page listing National Parks in the United States: https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States. We’ll use this example to showcase a few different approaches to scraping text from a static webpage. Let’s say we wanted to generate a list of National Parks and their state/territory, which would look something like this:

Park

State/Territory

Acadia

Maine

American Samoa

American Samoa

Arches

Utah

Inspecting a Web Page to learn more about it#

By right clicking on a web page and selecting “Inspect” or “Inspect Element” you can see the HTML and CSS that makes up the page. You can also right click on the specific text or data elements that you want to extract and select “Inspect” to see the HTML and CSS that makes up that specific element. Let’s start by right clicking on “Acadia” and clicking “Inspect”. We will see several HTML tags (indicated with <> symbols), including the one corresponding to where the name appears:

<th scope="row"> == $0
    <a href="/wiki/Acadia_National_Park" title="Acadia National Park">Acadia</a>
</th>

Similarly, for its state, we have:

<td>
    <a href="/wiki/Maine" title="Maine">Maine</a>
    <br>
[...]

So what does any of this mean?

HTML Tags#

Websites are built using HTML tags. Tags are used to create the structure of a website and indicate different headings, paragraphs, lists, links, images, and more.

Tags are signified with an opening tag, like <h1>, and a closing tag, like </h1>. The closing tag is the same as the opening tag, but with a forward slash / before the tag name. The text between the opening and closing tags is the content of the tag. A few common HTML tags are listed below:

  • <h1>, <h2>, <h3>, <h4>, <h5>, <h6>: Headings in decreasing order of size

  • <p>: Paragraph

  • <a>: Link

  • <ul>, <ol>, <li>: Unordered list, ordered list, and list item

  • <table>, <tr>, <th>, <td>: Table, table row, table header, table cell

  • <img>: Image

So turning back to our example above, we see that national park names are enclosed in a <th> tag and the corresponding names seem to be enclosed in <td>. We can also see both of these tags are nested within a <tr> tag (table row), which is itself nested within a <table> tag. This is a common structure for tables in HTML and helpful to know for when we start extracting data.

Making a request to a webpage#

To get the HTML content from this page, we can use the requests library in Python. The requests.get() function will return a Response object, which contains the HTML content of the webpage.

When we print the response object, the number will tell us if the request was successful. See this link more detailed information on the possible numbers, but generally any response in the 200s means the request was successful.

# Import the requests package
import requests 
# Set headers to let website know who we are
headers = {'user-agent': 'Urban Institute Research Data Collector (jaxelrod@urban.org)'}
# Save our URL name
url = 'https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States'
# Send a GET request to the URL and save the response
response = requests.get(url, headers=headers)
# The response of 200 means that the page was downloaded successfully. Other responses could be 
print(response)
<Response [200]>

Extracting the text#

Now that we have the URL stored in a response object, we can use the BeautifulSoup library to parse through and extract only the text we want. Since we know that the park names are enclosed in <th>, we can use the find_all() function to extract all instances of these tags. Let’s start here to see what happens:

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')
table_headers = soup.find_all('th')
print(table_headers)
[<th scope="col">Name
</th>, <th class="unsortable" scope="col">Image
</th>, <th scope="col">Location
</th>, <th scope="col">Date established as park<sup class="reference" id="cite_ref-12"><a href="#cite_note-12">[12]</a></sup>
</th>, <th scope="col">Area (2023)<sup class="reference" id="cite_ref-acreage_report_8-1"><a href="#cite_note-acreage_report-8">[8]</a></sup>
</th>, <th scope="col">Recreation visitors (2022)<sup class="reference" id="cite_ref-:0_11-2"><a href="#cite_note-:0-11">[11]</a></sup>
</th>, <th class="unsortable" scope="col">Description
</th>, <th scope="row"><a href="/wiki/Acadia_National_Park" title="Acadia National Park">Acadia</a>
</th>, <th scope="row"><a href="/wiki/National_Park_of_American_Samoa" title="National Park of American Samoa">American Samoa</a>
</th>, <th scope="row"><a href="/wiki/Arches_National_Park" title="Arches National Park">Arches</a>
</th>, <th scope="row"><a href="/wiki/Badlands_National_Park" title="Badlands National Park">Badlands</a>
</th>, <th scope="row"><a href="/wiki/Biscayne_National_Park" title="Biscayne National Park">Biscayne</a>
</th>, <th scope="row"><a href="/wiki/Black_Canyon_of_the_Gunnison_National_Park" title="Black Canyon of the Gunnison National Park">Black Canyon of the Gunnison</a>
</th>, <th scope="row"><a href="/wiki/Bryce_Canyon_National_Park" title="Bryce Canyon National Park">Bryce Canyon</a>
</th>, <th scope="row"><a href="/wiki/Canyonlands_National_Park" title="Canyonlands National Park">Canyonlands</a>
</th>, <th scope="row"><a href="/wiki/Capitol_Reef_National_Park" title="Capitol Reef National Park">Capitol Reef</a>
</th>, <th scope="row"><a href="/wiki/Crater_Lake_National_Park" title="Crater Lake National Park">Crater Lake</a>
</th>, <th scope="row"><a href="/wiki/Cuyahoga_Valley_National_Park" title="Cuyahoga Valley National Park">Cuyahoga Valley</a>
</th>, <th scope="row"><a href="/wiki/Gates_of_the_Arctic_National_Park_and_Preserve" title="Gates of the Arctic National Park and Preserve">Gates of the Arctic</a>
</th>, <th scope="row"><a href="/wiki/Gateway_Arch_National_Park" title="Gateway Arch National Park">Gateway Arch</a>
</th>, <th scope="row"><a href="/wiki/Great_Basin_National_Park" title="Great Basin National Park">Great Basin</a>
</th>, <th scope="row"><a href="/wiki/Great_Sand_Dunes_National_Park_and_Preserve" title="Great Sand Dunes National Park and Preserve">Great Sand Dunes</a>
</th>, <th scope="row"><a href="/wiki/Guadalupe_Mountains_National_Park" title="Guadalupe Mountains National Park">Guadalupe Mountains</a>
</th>, <th scope="row"><a href="/wiki/Hot_Springs_National_Park" title="Hot Springs National Park">Hot Springs</a>
</th>, <th scope="row"><a href="/wiki/Indiana_Dunes_National_Park" title="Indiana Dunes National Park">Indiana Dunes</a>
</th>, <th scope="row"><a href="/wiki/Katmai_National_Park_and_Preserve" title="Katmai National Park and Preserve">Katmai</a>
</th>, <th scope="row"><a href="/wiki/Kenai_Fjords_National_Park" title="Kenai Fjords National Park">Kenai Fjords</a>
</th>, <th scope="row"><a href="/wiki/Kobuk_Valley_National_Park" title="Kobuk Valley National Park">Kobuk Valley</a>
</th>, <th scope="row"><a href="/wiki/Lake_Clark_National_Park_and_Preserve" title="Lake Clark National Park and Preserve">Lake Clark</a>
</th>, <th scope="row"><a href="/wiki/Lassen_Volcanic_National_Park" title="Lassen Volcanic National Park">Lassen Volcanic</a>
</th>, <th scope="row"><a href="/wiki/Mount_Rainier_National_Park" title="Mount Rainier National Park">Mount Rainier</a>
</th>, <th scope="row"><a href="/wiki/New_River_Gorge_National_Park_and_Preserve" title="New River Gorge National Park and Preserve">New River Gorge</a>
</th>, <th scope="row"><a href="/wiki/North_Cascades_National_Park" title="North Cascades National Park">North Cascades</a>
</th>, <th scope="row"><a href="/wiki/Petrified_Forest_National_Park" title="Petrified Forest National Park">Petrified Forest</a>
</th>, <th scope="row"><a href="/wiki/Pinnacles_National_Park" title="Pinnacles National Park">Pinnacles</a>
</th>, <th scope="row"><a href="/wiki/Saguaro_National_Park" title="Saguaro National Park">Saguaro</a>
</th>, <th scope="row"><a href="/wiki/Shenandoah_National_Park" title="Shenandoah National Park">Shenandoah</a>
</th>, <th scope="row"><a href="/wiki/Theodore_Roosevelt_National_Park" title="Theodore Roosevelt National Park">Theodore Roosevelt</a>
</th>, <th scope="row"><a href="/wiki/Virgin_Islands_National_Park" title="Virgin Islands National Park">Virgin Islands</a>
</th>, <th scope="row"><a href="/wiki/Voyageurs_National_Park" title="Voyageurs National Park">Voyageurs</a>
</th>, <th scope="row"><a href="/wiki/White_Sands_National_Park" title="White Sands National Park">White Sands</a>
</th>, <th scope="row"><a href="/wiki/Wind_Cave_National_Park" title="Wind Cave National Park">Wind Cave</a>
</th>, <th scope="row"><a href="/wiki/Zion_National_Park" title="Zion National Park">Zion</a>
</th>, <th>State</th>, <th>Total parks</th>, <th>Exclusive parks</th>, <th>Shared parks
</th>, <th class="navbox-title" colspan="3" scope="col" style="background:#bbeb85;;background:#abdb75;"><link href="mw-data:TemplateStyles:r1129693374" rel="mw-deduplicated-inline-style"/><style data-mw-deduplicate="TemplateStyles:r1063604349">.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}</style><div class="navbar plainlinks hlist navbar-mini"><ul><li class="nv-view"><a href="/wiki/Template:National_parks_of_the_United_States" title="Template:National parks of the United States"><abbr style="background:#bbeb85;;background:#abdb75;;background:none transparent;border:none;box-shadow:none;padding:0;" title="View this template">v</abbr></a></li><li class="nv-talk"><a href="/wiki/Template_talk:National_parks_of_the_United_States" title="Template talk:National parks of the United States"><abbr style="background:#bbeb85;;background:#abdb75;;background:none transparent;border:none;box-shadow:none;padding:0;" title="Discuss this template">t</abbr></a></li><li class="nv-edit"><a href="/wiki/Special:EditPage/Template:National_parks_of_the_United_States" title="Special:EditPage/Template:National parks of the United States"><abbr style="background:#bbeb85;;background:#abdb75;;background:none transparent;border:none;box-shadow:none;padding:0;" title="Edit this template">e</abbr></a></li></ul></div><div id="National_parks_of_the_United_States" style="font-size:114%;margin:0 4em"><a class="mw-selflink selflink">National parks of the United States</a></div></th>, <th class="navbox-title" colspan="2" scope="col" style="background:#bbeb85;;background:#abdb75;"><link href="mw-data:TemplateStyles:r1129693374" rel="mw-deduplicated-inline-style"/><link href="mw-data:TemplateStyles:r1063604349" rel="mw-deduplicated-inline-style"/><div class="navbar plainlinks hlist navbar-mini"><ul><li class="nv-view"><a href="/wiki/Template:US_Protected_Areas" title="Template:US Protected Areas"><abbr style="background:#bbeb85;;background:#abdb75;;background:none transparent;border:none;box-shadow:none;padding:0;" title="View this template">v</abbr></a></li><li class="nv-talk"><a href="/wiki/Template_talk:US_Protected_Areas" title="Template talk:US Protected Areas"><abbr style="background:#bbeb85;;background:#abdb75;;background:none transparent;border:none;box-shadow:none;padding:0;" title="Discuss this template">t</abbr></a></li><li class="nv-edit"><a href="/wiki/Special:EditPage/Template:US_Protected_Areas" title="Special:EditPage/Template:US Protected Areas"><abbr style="background:#bbeb85;;background:#abdb75;;background:none transparent;border:none;box-shadow:none;padding:0;" title="Edit this template">e</abbr></a></li></ul></div><div id="Federal_protected_areas_in_the_United_States" style="font-size:114%;margin:0 4em"><a href="/wiki/Protected_areas_of_the_United_States" title="Protected areas of the United States">Federal protected areas in the United States</a></div></th>]

We can see that the output definitely seems to contain the national park names! But also probably some other stuff we don’t want. One useful trick is to try to narrow your search to just the table or object that contains the info you want; in this case, note that the HTML of the table we want actually contains a caption. We can search for that caption and then use the find_parent function to find the table that contains it. Then, within just that table, we can once again search for the table headers. Let’s try this now:

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')
# Find the caption
caption = soup.find('caption', text="List of U.S. national parks\n")
# Find its parent table
table = caption.find_parent('table')
# Find the table headers that make up that table
table_headers = table.find_all('th')
print(table_headers)
[<th scope="col">Name
</th>, <th class="unsortable" scope="col">Image
</th>, <th scope="col">Location
</th>, <th scope="col">Date established as park<sup class="reference" id="cite_ref-12"><a href="#cite_note-12">[12]</a></sup>
</th>, <th scope="col">Area (2023)<sup class="reference" id="cite_ref-acreage_report_8-1"><a href="#cite_note-acreage_report-8">[8]</a></sup>
</th>, <th scope="col">Recreation visitors (2022)<sup class="reference" id="cite_ref-:0_11-2"><a href="#cite_note-:0-11">[11]</a></sup>
</th>, <th class="unsortable" scope="col">Description
</th>, <th scope="row"><a href="/wiki/Acadia_National_Park" title="Acadia National Park">Acadia</a>
</th>, <th scope="row"><a href="/wiki/National_Park_of_American_Samoa" title="National Park of American Samoa">American Samoa</a>
</th>, <th scope="row"><a href="/wiki/Arches_National_Park" title="Arches National Park">Arches</a>
</th>, <th scope="row"><a href="/wiki/Badlands_National_Park" title="Badlands National Park">Badlands</a>
</th>, <th scope="row"><a href="/wiki/Biscayne_National_Park" title="Biscayne National Park">Biscayne</a>
</th>, <th scope="row"><a href="/wiki/Black_Canyon_of_the_Gunnison_National_Park" title="Black Canyon of the Gunnison National Park">Black Canyon of the Gunnison</a>
</th>, <th scope="row"><a href="/wiki/Bryce_Canyon_National_Park" title="Bryce Canyon National Park">Bryce Canyon</a>
</th>, <th scope="row"><a href="/wiki/Canyonlands_National_Park" title="Canyonlands National Park">Canyonlands</a>
</th>, <th scope="row"><a href="/wiki/Capitol_Reef_National_Park" title="Capitol Reef National Park">Capitol Reef</a>
</th>, <th scope="row"><a href="/wiki/Crater_Lake_National_Park" title="Crater Lake National Park">Crater Lake</a>
</th>, <th scope="row"><a href="/wiki/Cuyahoga_Valley_National_Park" title="Cuyahoga Valley National Park">Cuyahoga Valley</a>
</th>, <th scope="row"><a href="/wiki/Gates_of_the_Arctic_National_Park_and_Preserve" title="Gates of the Arctic National Park and Preserve">Gates of the Arctic</a>
</th>, <th scope="row"><a href="/wiki/Gateway_Arch_National_Park" title="Gateway Arch National Park">Gateway Arch</a>
</th>, <th scope="row"><a href="/wiki/Great_Basin_National_Park" title="Great Basin National Park">Great Basin</a>
</th>, <th scope="row"><a href="/wiki/Great_Sand_Dunes_National_Park_and_Preserve" title="Great Sand Dunes National Park and Preserve">Great Sand Dunes</a>
</th>, <th scope="row"><a href="/wiki/Guadalupe_Mountains_National_Park" title="Guadalupe Mountains National Park">Guadalupe Mountains</a>
</th>, <th scope="row"><a href="/wiki/Hot_Springs_National_Park" title="Hot Springs National Park">Hot Springs</a>
</th>, <th scope="row"><a href="/wiki/Indiana_Dunes_National_Park" title="Indiana Dunes National Park">Indiana Dunes</a>
</th>, <th scope="row"><a href="/wiki/Katmai_National_Park_and_Preserve" title="Katmai National Park and Preserve">Katmai</a>
</th>, <th scope="row"><a href="/wiki/Kenai_Fjords_National_Park" title="Kenai Fjords National Park">Kenai Fjords</a>
</th>, <th scope="row"><a href="/wiki/Kobuk_Valley_National_Park" title="Kobuk Valley National Park">Kobuk Valley</a>
</th>, <th scope="row"><a href="/wiki/Lake_Clark_National_Park_and_Preserve" title="Lake Clark National Park and Preserve">Lake Clark</a>
</th>, <th scope="row"><a href="/wiki/Lassen_Volcanic_National_Park" title="Lassen Volcanic National Park">Lassen Volcanic</a>
</th>, <th scope="row"><a href="/wiki/Mount_Rainier_National_Park" title="Mount Rainier National Park">Mount Rainier</a>
</th>, <th scope="row"><a href="/wiki/New_River_Gorge_National_Park_and_Preserve" title="New River Gorge National Park and Preserve">New River Gorge</a>
</th>, <th scope="row"><a href="/wiki/North_Cascades_National_Park" title="North Cascades National Park">North Cascades</a>
</th>, <th scope="row"><a href="/wiki/Petrified_Forest_National_Park" title="Petrified Forest National Park">Petrified Forest</a>
</th>, <th scope="row"><a href="/wiki/Pinnacles_National_Park" title="Pinnacles National Park">Pinnacles</a>
</th>, <th scope="row"><a href="/wiki/Saguaro_National_Park" title="Saguaro National Park">Saguaro</a>
</th>, <th scope="row"><a href="/wiki/Shenandoah_National_Park" title="Shenandoah National Park">Shenandoah</a>
</th>, <th scope="row"><a href="/wiki/Theodore_Roosevelt_National_Park" title="Theodore Roosevelt National Park">Theodore Roosevelt</a>
</th>, <th scope="row"><a href="/wiki/Virgin_Islands_National_Park" title="Virgin Islands National Park">Virgin Islands</a>
</th>, <th scope="row"><a href="/wiki/Voyageurs_National_Park" title="Voyageurs National Park">Voyageurs</a>
</th>, <th scope="row"><a href="/wiki/White_Sands_National_Park" title="White Sands National Park">White Sands</a>
</th>, <th scope="row"><a href="/wiki/Wind_Cave_National_Park" title="Wind Cave National Park">Wind Cave</a>
</th>, <th scope="row"><a href="/wiki/Zion_National_Park" title="Zion National Park">Zion</a>
</th>]
/tmp/ipykernel_1916/2762383118.py:4: DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead.
  caption = soup.find('caption', text="List of U.S. national parks\n")

Okay, now we’ve narrowed things down nicely. But how do we actually pull the text from this word salad? We can do this using the get_text() function and a simple Python for loop:

# Create an empty list that we will store text in
table_header_text = []
# Loop over all of the table headers BeautifulSoup has found for us
for header in table_headers:
    # Add the text of the header to our list using get_text()
    table_header_text.append(header.get_text())

print(table_header_text)
['Name\n', 'Image\n', 'Location\n', 'Date established as park[12]\n', 'Area (2023)[8]\n', 'Recreation visitors (2022)[11]\n', 'Description\n', 'Acadia\n', 'American Samoa\n', 'Arches\n', 'Badlands\n', 'Biscayne\n', 'Black Canyon of the Gunnison\n', 'Bryce Canyon\n', 'Canyonlands\n', 'Capitol Reef\n', 'Crater Lake\n', 'Cuyahoga Valley\n', 'Gates of the Arctic\n', 'Gateway Arch\n', 'Great Basin\n', 'Great Sand Dunes\n', 'Guadalupe Mountains\n', 'Hot Springs\n', 'Indiana Dunes\n', 'Katmai\n', 'Kenai Fjords\n', 'Kobuk Valley\n', 'Lake Clark\n', 'Lassen Volcanic\n', 'Mount Rainier\n', 'New River Gorge\n', 'North Cascades\n', 'Petrified Forest\n', 'Pinnacles\n', 'Saguaro\n', 'Shenandoah\n', 'Theodore Roosevelt\n', 'Virgin Islands\n', 'Voyageurs\n', 'White Sands\n', 'Wind Cave\n', 'Zion\n']

Nearly done! Two more quick things to clean this up. First, we remove all of the extraneous matches at the beginning of our list. Second, we remove the newline character from the end of each string.

(Remember, Python starts indexing at 0, and we want to exclude the first 7 items.)

# Get items in our list starting from the 8th item
national_park_names = table_header_text[7:]
# Strip whitespace from the beginning and end of each item
national_park_names = [park.strip() for park in national_park_names]
print(national_park_names)
print(f'{len(national_park_names)} entries')
['Acadia', 'American Samoa', 'Arches', 'Badlands', 'Biscayne', 'Black Canyon of the Gunnison', 'Bryce Canyon', 'Canyonlands', 'Capitol Reef', 'Crater Lake', 'Cuyahoga Valley', 'Gates of the Arctic', 'Gateway Arch', 'Great Basin', 'Great Sand Dunes', 'Guadalupe Mountains', 'Hot Springs', 'Indiana Dunes', 'Katmai', 'Kenai Fjords', 'Kobuk Valley', 'Lake Clark', 'Lassen Volcanic', 'Mount Rainier', 'New River Gorge', 'North Cascades', 'Petrified Forest', 'Pinnacles', 'Saguaro', 'Shenandoah', 'Theodore Roosevelt', 'Virgin Islands', 'Voyageurs', 'White Sands', 'Wind Cave', 'Zion']
36 entries

Woohoo! Now let’s move on to the state/territory names, which we recall live inside <td> tags. Conveniently, we don’t have to start from the soup object that contains all the webpage’s HTML, but can start again from the table object we created above which contains just the table we want. Let’s try this now:

table_cells = table.find_all('td')
# print(table_cells)

The print statement is commented out because the output returned is so long, but try it yourself to see! This is not uncommon for very HTML-rich pages.

We can see that the returned HTML definitely contains the state/territory names, but also has a lot of other extraneous text. Again, this is a useful real-life example, because there are in practice lots of other elements that might share the type of tag with the data you want. How can we get more specific?

Yet again, we can use the find_parent trick. Note that beneath each state/territory name is a set of coordinates that have a unique <small> tag, unlike the other text in <td> tags. So we can:

  1. Search for the small tags

  2. Find all td tags that are parents of these small tags

  3. Extract just the location from each td tag

#1. Get small tags
small_tags = table.find_all('small')
td_tags = []
#2. Loop over small tags and get their parent td tags
for small_tag in small_tags:
    td_tags.append(small_tag.find_parent('td'))

#3. Extract just the title text from the td tags
states_and_territories = []
for tag in td_tags:
    states_and_territories.append(tag.a['title']) # The .a['title'] is needed to get just the text under the 'title' attribute; otherwise this would include the coordinates too!

print(states_and_territories)
print(f'{len(states_and_territories)} entries')
['Maine', 'American Samoa', 'Utah', 'South Dakota', 'Texas', 'Florida', 'Colorado', 'Utah', 'Utah', 'Utah', 'New Mexico', 'California', 'South Carolina', 'Oregon', 'Ohio', 'California', 'Alaska', 'Florida', 'Florida', 'Alaska', 'Missouri', 'Montana', 'Alaska', 'Arizona', 'Wyoming', 'Nevada', 'Colorado', 'North Carolina', 'Texas', 'Hawaii', 'Hawaii', 'Arkansas', 'Indiana', 'Michigan', 'California', 'Alaska', 'Alaska', 'California', 'Alaska', 'Alaska', 'California', 'Kentucky', 'Colorado', 'Washington (state)', 'West Virginia', 'Washington (state)', 'Washington (state)', 'Arizona', 'California', 'California', 'Colorado', 'Arizona', 'California', 'Virginia', 'North Dakota', 'United States Virgin Islands', 'Minnesota', 'New Mexico', 'South Dakota', 'Alaska', 'Wyoming', 'California', 'Utah']
63 entries

And we’re 100% fini…OH NO! We have a problem! Why do we have 63 locations but only 37 parks? Well, it looks like some of the national parks are missing, specifically those that have symbols next to them on Wikipedia. Upon inspection, these are actually in <td> tags, not <th> tags. However, it appears they all contain scope="row". This is one more nifty feature of Beautiful Soup - the ability to feed in custom attributes that fit the quirks of our use case. Here’s the syntax for how we do it:

national_parks_extra_tags = table.find_all('td', attrs={'scope': 'row'})
national_parks_extra = []
for park in national_parks_extra_tags:
    national_parks_extra.append(park.get_text().strip())
print(national_parks_extra)
['Big Bend †', 'Carlsbad Caverns *', 'Channel Islands †', 'Congaree †', 'Death Valley †', 'Denali †', 'Dry Tortugas †', 'Everglades ‡', 'Glacier ‡', 'Glacier Bay ‡', 'Grand Canyon *', 'Grand Teton †', 'Great Smoky Mountains ‡', 'Haleakalā †', 'Hawaiʻi Volcanoes ‡', 'Isle Royale †', 'Joshua Tree †', 'Kings Canyon †', 'Mammoth Cave ‡', 'Mesa Verde *', 'Olympic ‡', 'Redwood *', 'Rocky Mountain †', 'Sequoia †', 'Wrangell–St.\xa0Elias *', 'Yellowstone ‡', 'Yosemite *']

Now, let’s append the two lists of national parks together and re-sort them in alphabetical order. Finally, we’ve got both of our lists of 63 entries.

national_parks_combined = sorted(national_park_names + national_parks_extra)
print(national_parks_combined)
print(f'{len(national_parks_combined)} entries')
['Acadia', 'American Samoa', 'Arches', 'Badlands', 'Big Bend †', 'Biscayne', 'Black Canyon of the Gunnison', 'Bryce Canyon', 'Canyonlands', 'Capitol Reef', 'Carlsbad Caverns *', 'Channel Islands †', 'Congaree †', 'Crater Lake', 'Cuyahoga Valley', 'Death Valley †', 'Denali †', 'Dry Tortugas †', 'Everglades ‡', 'Gates of the Arctic', 'Gateway Arch', 'Glacier Bay ‡', 'Glacier ‡', 'Grand Canyon *', 'Grand Teton †', 'Great Basin', 'Great Sand Dunes', 'Great Smoky Mountains ‡', 'Guadalupe Mountains', 'Haleakalā †', 'Hawaiʻi Volcanoes ‡', 'Hot Springs', 'Indiana Dunes', 'Isle Royale †', 'Joshua Tree †', 'Katmai', 'Kenai Fjords', 'Kings Canyon †', 'Kobuk Valley', 'Lake Clark', 'Lassen Volcanic', 'Mammoth Cave ‡', 'Mesa Verde *', 'Mount Rainier', 'New River Gorge', 'North Cascades', 'Olympic ‡', 'Petrified Forest', 'Pinnacles', 'Redwood *', 'Rocky Mountain †', 'Saguaro', 'Sequoia †', 'Shenandoah', 'Theodore Roosevelt', 'Virgin Islands', 'Voyageurs', 'White Sands', 'Wind Cave', 'Wrangell–St.\xa0Elias *', 'Yellowstone ‡', 'Yosemite *', 'Zion']
63 entries

Manipulating the text into a useful format#

As a final step, we will convert these two lists we’ve scraped into a pandas dataframe (for R users, analogous to a tibble). This example will be light on the pandas-specific code, but this is often an important part of web scraping. BeautifulSoup is a powerful tool, but it sometimes outputs data in a format that is not immediately useful, even with some of the tricks we applied above. Pandas can help us clean and manipulate this data into a more useful format.

Note: Pandas even has a built-in function to read HTML tables directly from a webpage, which can be a nice starting point for certain examples (like this one believe it or not). BeautifulSoup is your workhorse for static webpage scraping, but this is worth knowing about if you’re a pandas user. You’d be surprised how far the two lines below will get you:

df = pd.read_html('https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States')
df[0][['Name', 'Location']]

Web scraping with pandas is outside the scope of our lesson here, but worth exploring. Back to our example:

# Convert this into a pandas dataframe
import pandas as pd
df = pd.DataFrame({
    'National Park': national_parks_combined,
    'State': states_and_territories
})
print(df)
           National Park           State
0                 Acadia           Maine
1         American Samoa  American Samoa
2                 Arches            Utah
3               Badlands    South Dakota
4             Big Bend †           Texas
..                   ...             ...
58             Wind Cave    South Dakota
59  Wrangell–St. Elias *          Alaska
60         Yellowstone ‡         Wyoming
61            Yosemite *      California
62                  Zion            Utah

[63 rows x 2 columns]

BONUS:#

Some of our more adventurous participants may have caught early on that national parks can be in multiple states and territories.

For instance, the Great Smoky Mountains are in both North Carolina and Tennessee. Let’s tweak the above code to handle that, starting at #3.

df.iloc[27]
National Park    Great Smoky Mountains ‡
State                     North Carolina
Name: 27, dtype: object
states_and_territories = {}
# Enumerate allows us to loop over a list and get the index (in this case, "park_number") of the item as well
for park_number, tag in enumerate(td_tags):
    # Before, we were just getting the first 'a' tag, now let's get all of them for a given table cell
    a_tags = tag.find_all('a')
    # Get the titles of each location in a list, except for the last one which is extraneous info
    locations = [a.text for a in a_tags][:-1]
    # Create a dictionary where the key is the national park name and the value is the list of locations
    dict_key = national_parks_combined[park_number]
    states_and_territories[dict_key] = locations

print(states_and_territories)
print(pd.DataFrame.from_dict(states_and_territories, 
                             orient='index'))
{'Acadia': ['Maine'], 'American Samoa': ['American Samoa'], 'Arches': ['Utah'], 'Badlands': ['South Dakota'], 'Big Bend †': ['Texas'], 'Biscayne': ['Florida'], 'Black Canyon of the Gunnison': ['Colorado'], 'Bryce Canyon': ['Utah'], 'Canyonlands': ['Utah'], 'Capitol Reef': ['Utah'], 'Carlsbad Caverns *': ['New Mexico'], 'Channel Islands †': ['California'], 'Congaree †': ['South Carolina'], 'Crater Lake': ['Oregon'], 'Cuyahoga Valley': ['Ohio'], 'Death Valley †': ['California', 'Nevada'], 'Denali †': ['Alaska'], 'Dry Tortugas †': ['Florida'], 'Everglades ‡': ['Florida'], 'Gates of the Arctic': ['Alaska'], 'Gateway Arch': ['Missouri'], 'Glacier Bay ‡': ['Montana'], 'Glacier ‡': ['Alaska'], 'Grand Canyon *': ['Arizona'], 'Grand Teton †': ['Wyoming'], 'Great Basin': ['Nevada'], 'Great Sand Dunes': ['Colorado'], 'Great Smoky Mountains ‡': ['North Carolina', 'Tennessee'], 'Guadalupe Mountains': ['Texas'], 'Haleakalā †': ['Hawaii'], 'Hawaiʻi Volcanoes ‡': ['Hawaii'], 'Hot Springs': ['Arkansas'], 'Indiana Dunes': ['Indiana'], 'Isle Royale †': ['Michigan'], 'Joshua Tree †': ['California'], 'Katmai': ['Alaska'], 'Kenai Fjords': ['Alaska'], 'Kings Canyon †': ['California'], 'Kobuk Valley': ['Alaska'], 'Lake Clark': ['Alaska'], 'Lassen Volcanic': ['California'], 'Mammoth Cave ‡': ['Kentucky'], 'Mesa Verde *': ['Colorado'], 'Mount Rainier': ['Washington'], 'New River Gorge': ['West Virginia'], 'North Cascades': ['Washington'], 'Olympic ‡': ['Washington'], 'Petrified Forest': ['Arizona'], 'Pinnacles': ['California'], 'Redwood *': ['California'], 'Rocky Mountain †': ['Colorado'], 'Saguaro': ['Arizona'], 'Sequoia †': ['California'], 'Shenandoah': ['Virginia'], 'Theodore Roosevelt': ['North Dakota'], 'Virgin Islands': ['U.S. Virgin Islands'], 'Voyageurs': ['Minnesota'], 'White Sands': ['New Mexico'], 'Wind Cave': ['South Dakota'], 'Wrangell–St.\xa0Elias *': ['Alaska'], 'Yellowstone ‡': ['Wyoming', 'Montana', 'Idaho'], 'Yosemite *': ['California'], 'Zion': ['Utah']}
                                   0        1      2
Acadia                         Maine     None   None
American Samoa        American Samoa     None   None
Arches                          Utah     None   None
Badlands                South Dakota     None   None
Big Bend †                     Texas     None   None
...                              ...      ...    ...
Wind Cave               South Dakota     None   None
Wrangell–St. Elias *          Alaska     None   None
Yellowstone ‡                Wyoming  Montana  Idaho
Yosemite *                California     None   None
Zion                            Utah     None   None

[63 rows x 3 columns]

Looking Ahead#

Next week we’ll introduce Selenium for scraping dynamic content. We’ll be scraping this website, so a quick perusal to familiarize yourself could be helpful: https://www.kff.org/interactive/subsidy-calculator/