Web Scraping with Static Pages
Contents
Web Scraping with Static Pages#
Today we will be learning how to scrape text from a static webpage. By static, we mean that the webpage does not change its content based on user input (e.g. clicks, textboxes, etc.). We will cover the following concepts today:
Inspecting a webpage
What are HTML tags and why are they important?
How to use the
requests
library to get the HTML content of a webpageHow to use the
BeautifulSoup
library to parse the HTML content and extract just the parts we wantGetting the final output into a workable format using
Pandas
Example#
Our example web page will be the Wikipedia page listing National Parks in the United States: https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States. We’ll use this example to showcase a few different approaches to scraping text from a static webpage. Let’s say we wanted to generate a list of National Parks and their state/territory, which would look something like this:
Park |
State/Territory |
---|---|
Acadia |
Maine |
American Samoa |
American Samoa |
Arches |
Utah |
… |
… |
Inspecting a Web Page to learn more about it#
By right clicking on a web page and selecting “Inspect” or “Inspect Element” you can see the HTML and CSS that makes up the page. You can also right click on the specific text or data elements that you want to extract and select “Inspect” to see the HTML and CSS that makes up that specific element. Let’s start by right clicking on “Acadia” and clicking “Inspect”. We will see several HTML tags (indicated with <> symbols), including the one corresponding to where the name appears:
<th scope="row"> == $0
<a href="/wiki/Acadia_National_Park" title="Acadia National Park">Acadia</a>
</th>
Similarly, for its state, we have:
<td>
<a href="/wiki/Maine" title="Maine">Maine</a>
<br>
[...]
So what does any of this mean?
Making a request to a webpage#
To get the HTML content from this page, we can use the requests
library in Python. The requests.get()
function will return a Response
object, which contains the HTML content of the webpage.
When we print the response object, the number will tell us if the request was successful. See this link more detailed information on the possible numbers, but generally any response in the 200s means the request was successful.
# Import the requests package
import requests
# Set headers to let website know who we are
headers = {'user-agent': 'Urban Institute Research Data Collector (jaxelrod@urban.org)'}
# Save our URL name
url = 'https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States'
# Send a GET request to the URL and save the response
response = requests.get(url, headers=headers)
# The response of 200 means that the page was downloaded successfully. Other responses could be
print(response)
<Response [200]>
Extracting the text#
Now that we have the URL stored in a response object, we can use the BeautifulSoup
library to parse through and extract only the text we want. Since we know that the park names are enclosed in <th>
, we can use the find_all()
function to extract all instances of these tags. Let’s start here to see what happens:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')
table_headers = soup.find_all('th')
print(table_headers)
[<th scope="col">Name
</th>, <th class="unsortable" scope="col">Image
</th>, <th scope="col">Location
</th>, <th scope="col">Date established as park<sup class="reference" id="cite_ref-12"><a href="#cite_note-12"><span class="cite-bracket">[</span>12<span class="cite-bracket">]</span></a></sup>
</th>, <th scope="col">Area (2023)<sup class="reference" id="cite_ref-acreage_report_8-1"><a href="#cite_note-acreage_report-8"><span class="cite-bracket">[</span>8<span class="cite-bracket">]</span></a></sup>
</th>, <th scope="col">Recreation visitors (2022)<sup class="reference" id="cite_ref-:0_11-2"><a href="#cite_note-:0-11"><span class="cite-bracket">[</span>11<span class="cite-bracket">]</span></a></sup>
</th>, <th class="unsortable" scope="col">Description
</th>, <th scope="row"><a href="/wiki/Acadia_National_Park" title="Acadia National Park">Acadia</a>
</th>, <th scope="row"><a href="/wiki/National_Park_of_American_Samoa" title="National Park of American Samoa">American Samoa</a>
</th>, <th scope="row"><a href="/wiki/Arches_National_Park" title="Arches National Park">Arches</a>
</th>, <th scope="row"><a href="/wiki/Badlands_National_Park" title="Badlands National Park">Badlands</a>
</th>, <th scope="row"><a href="/wiki/Biscayne_National_Park" title="Biscayne National Park">Biscayne</a>
</th>, <th scope="row"><a href="/wiki/Black_Canyon_of_the_Gunnison_National_Park" title="Black Canyon of the Gunnison National Park">Black Canyon of the Gunnison</a>
</th>, <th scope="row"><a href="/wiki/Bryce_Canyon_National_Park" title="Bryce Canyon National Park">Bryce Canyon</a>
</th>, <th scope="row"><a href="/wiki/Canyonlands_National_Park" title="Canyonlands National Park">Canyonlands</a>
</th>, <th scope="row"><a href="/wiki/Capitol_Reef_National_Park" title="Capitol Reef National Park">Capitol Reef</a>
</th>, <th scope="row"><a href="/wiki/Crater_Lake_National_Park" title="Crater Lake National Park">Crater Lake</a>
</th>, <th scope="row"><a href="/wiki/Cuyahoga_Valley_National_Park" title="Cuyahoga Valley National Park">Cuyahoga Valley</a>
</th>, <th scope="row"><a href="/wiki/Gates_of_the_Arctic_National_Park_and_Preserve" title="Gates of the Arctic National Park and Preserve">Gates of the Arctic</a>
</th>, <th scope="row"><a href="/wiki/Gateway_Arch_National_Park" title="Gateway Arch National Park">Gateway Arch</a>
</th>, <th scope="row"><a href="/wiki/Great_Basin_National_Park" title="Great Basin National Park">Great Basin</a>
</th>, <th scope="row"><a href="/wiki/Great_Sand_Dunes_National_Park_and_Preserve" title="Great Sand Dunes National Park and Preserve">Great Sand Dunes</a>
</th>, <th scope="row"><a href="/wiki/Guadalupe_Mountains_National_Park" title="Guadalupe Mountains National Park">Guadalupe Mountains</a>
</th>, <th scope="row"><a href="/wiki/Hot_Springs_National_Park" title="Hot Springs National Park">Hot Springs</a>
</th>, <th scope="row"><a href="/wiki/Indiana_Dunes_National_Park" title="Indiana Dunes National Park">Indiana Dunes</a>
</th>, <th scope="row"><a href="/wiki/Katmai_National_Park_and_Preserve" title="Katmai National Park and Preserve">Katmai</a>
</th>, <th scope="row"><a href="/wiki/Kenai_Fjords_National_Park" title="Kenai Fjords National Park">Kenai Fjords</a>
</th>, <th scope="row"><a href="/wiki/Kobuk_Valley_National_Park" title="Kobuk Valley National Park">Kobuk Valley</a>
</th>, <th scope="row"><a href="/wiki/Lake_Clark_National_Park_and_Preserve" title="Lake Clark National Park and Preserve">Lake Clark</a>
</th>, <th scope="row"><a href="/wiki/Lassen_Volcanic_National_Park" title="Lassen Volcanic National Park">Lassen Volcanic</a>
</th>, <th scope="row"><a href="/wiki/Mount_Rainier_National_Park" title="Mount Rainier National Park">Mount Rainier</a>
</th>, <th scope="row"><a href="/wiki/New_River_Gorge_National_Park_and_Preserve" title="New River Gorge National Park and Preserve">New River Gorge</a>
</th>, <th scope="row"><a href="/wiki/North_Cascades_National_Park" title="North Cascades National Park">North Cascades</a>
</th>, <th scope="row"><a href="/wiki/Petrified_Forest_National_Park" title="Petrified Forest National Park">Petrified Forest</a>
</th>, <th scope="row"><a href="/wiki/Pinnacles_National_Park" title="Pinnacles National Park">Pinnacles</a>
</th>, <th scope="row"><a href="/wiki/Saguaro_National_Park" title="Saguaro National Park">Saguaro</a>
</th>, <th scope="row"><a href="/wiki/Shenandoah_National_Park" title="Shenandoah National Park">Shenandoah</a>
</th>, <th scope="row"><a href="/wiki/Theodore_Roosevelt_National_Park" title="Theodore Roosevelt National Park">Theodore Roosevelt</a>
</th>, <th scope="row"><a href="/wiki/Virgin_Islands_National_Park" title="Virgin Islands National Park">Virgin Islands</a>
</th>, <th scope="row"><a href="/wiki/Voyageurs_National_Park" title="Voyageurs National Park">Voyageurs</a>
</th>, <th scope="row"><a href="/wiki/White_Sands_National_Park" title="White Sands National Park">White Sands</a>
</th>, <th scope="row"><a href="/wiki/Wind_Cave_National_Park" title="Wind Cave National Park">Wind Cave</a>
</th>, <th scope="row"><a href="/wiki/Zion_National_Park" title="Zion National Park">Zion</a>
</th>, <th>State</th>, <th>Total parks</th>, <th>Exclusive parks</th>, <th>Shared parks
</th>, <th class="navbox-title" colspan="3" scope="col" style="background:#bbeb85;;background:#abdb75;"><link href="mw-data:TemplateStyles:r1129693374" rel="mw-deduplicated-inline-style"/><style data-mw-deduplicate="TemplateStyles:r1239400231">.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}html.skin-theme-clientpref-night .mw-parser-output .navbar li a abbr{color:var(--color-base)!important}@media(prefers-color-scheme:dark){html.skin-theme-clientpref-os .mw-parser-output .navbar li a abbr{color:var(--color-base)!important}}@media print{.mw-parser-output .navbar{display:none!important}}</style><div class="navbar plainlinks hlist navbar-mini"><ul><li class="nv-view"><a href="/wiki/Template:National_parks_of_the_United_States" title="Template:National parks of the United States"><abbr title="View this template">v</abbr></a></li><li class="nv-talk"><a href="/wiki/Template_talk:National_parks_of_the_United_States" title="Template talk:National parks of the United States"><abbr title="Discuss this template">t</abbr></a></li><li class="nv-edit"><a href="/wiki/Special:EditPage/Template:National_parks_of_the_United_States" title="Special:EditPage/Template:National parks of the United States"><abbr title="Edit this template">e</abbr></a></li></ul></div><div id="National_parks_of_the_United_States" style="font-size:114%;margin:0 4em"><a class="mw-selflink selflink">National parks of the United States</a></div></th>, <th class="navbox-title" colspan="2" scope="col" style="background:#bbeb85;;background:#abdb75;"><link href="mw-data:TemplateStyles:r1129693374" rel="mw-deduplicated-inline-style"/><link href="mw-data:TemplateStyles:r1239400231" rel="mw-deduplicated-inline-style"/><div class="navbar plainlinks hlist navbar-mini"><ul><li class="nv-view"><a href="/wiki/Template:US_Protected_Areas" title="Template:US Protected Areas"><abbr title="View this template">v</abbr></a></li><li class="nv-talk"><a href="/wiki/Template_talk:US_Protected_Areas" title="Template talk:US Protected Areas"><abbr title="Discuss this template">t</abbr></a></li><li class="nv-edit"><a href="/wiki/Special:EditPage/Template:US_Protected_Areas" title="Special:EditPage/Template:US Protected Areas"><abbr title="Edit this template">e</abbr></a></li></ul></div><div id="Federal_protected_areas_in_the_United_States" style="font-size:114%;margin:0 4em"><a href="/wiki/Protected_areas_of_the_United_States" title="Protected areas of the United States">Federal protected areas in the United States</a></div></th>]
We can see that the output definitely seems to contain the national park names! But also probably some other stuff we don’t want. One useful trick is to try to narrow your search to just the table or object that contains the info you want; in this case, note that the HTML of the table we want actually contains a caption. We can search for that caption and then use the find_parent
function to find the table that contains it. Then, within just that table, we can once again search for the table headers. Let’s try this now:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')
# Find the caption
caption = soup.find('caption', text="List of U.S. national parks\n")
# Find its parent table
table = caption.find_parent('table')
# Find the table headers that make up that table
table_headers = table.find_all('th')
print(table_headers)
[<th scope="col">Name
</th>, <th class="unsortable" scope="col">Image
</th>, <th scope="col">Location
</th>, <th scope="col">Date established as park<sup class="reference" id="cite_ref-12"><a href="#cite_note-12"><span class="cite-bracket">[</span>12<span class="cite-bracket">]</span></a></sup>
</th>, <th scope="col">Area (2023)<sup class="reference" id="cite_ref-acreage_report_8-1"><a href="#cite_note-acreage_report-8"><span class="cite-bracket">[</span>8<span class="cite-bracket">]</span></a></sup>
</th>, <th scope="col">Recreation visitors (2022)<sup class="reference" id="cite_ref-:0_11-2"><a href="#cite_note-:0-11"><span class="cite-bracket">[</span>11<span class="cite-bracket">]</span></a></sup>
</th>, <th class="unsortable" scope="col">Description
</th>, <th scope="row"><a href="/wiki/Acadia_National_Park" title="Acadia National Park">Acadia</a>
</th>, <th scope="row"><a href="/wiki/National_Park_of_American_Samoa" title="National Park of American Samoa">American Samoa</a>
</th>, <th scope="row"><a href="/wiki/Arches_National_Park" title="Arches National Park">Arches</a>
</th>, <th scope="row"><a href="/wiki/Badlands_National_Park" title="Badlands National Park">Badlands</a>
</th>, <th scope="row"><a href="/wiki/Biscayne_National_Park" title="Biscayne National Park">Biscayne</a>
</th>, <th scope="row"><a href="/wiki/Black_Canyon_of_the_Gunnison_National_Park" title="Black Canyon of the Gunnison National Park">Black Canyon of the Gunnison</a>
</th>, <th scope="row"><a href="/wiki/Bryce_Canyon_National_Park" title="Bryce Canyon National Park">Bryce Canyon</a>
</th>, <th scope="row"><a href="/wiki/Canyonlands_National_Park" title="Canyonlands National Park">Canyonlands</a>
</th>, <th scope="row"><a href="/wiki/Capitol_Reef_National_Park" title="Capitol Reef National Park">Capitol Reef</a>
</th>, <th scope="row"><a href="/wiki/Crater_Lake_National_Park" title="Crater Lake National Park">Crater Lake</a>
</th>, <th scope="row"><a href="/wiki/Cuyahoga_Valley_National_Park" title="Cuyahoga Valley National Park">Cuyahoga Valley</a>
</th>, <th scope="row"><a href="/wiki/Gates_of_the_Arctic_National_Park_and_Preserve" title="Gates of the Arctic National Park and Preserve">Gates of the Arctic</a>
</th>, <th scope="row"><a href="/wiki/Gateway_Arch_National_Park" title="Gateway Arch National Park">Gateway Arch</a>
</th>, <th scope="row"><a href="/wiki/Great_Basin_National_Park" title="Great Basin National Park">Great Basin</a>
</th>, <th scope="row"><a href="/wiki/Great_Sand_Dunes_National_Park_and_Preserve" title="Great Sand Dunes National Park and Preserve">Great Sand Dunes</a>
</th>, <th scope="row"><a href="/wiki/Guadalupe_Mountains_National_Park" title="Guadalupe Mountains National Park">Guadalupe Mountains</a>
</th>, <th scope="row"><a href="/wiki/Hot_Springs_National_Park" title="Hot Springs National Park">Hot Springs</a>
</th>, <th scope="row"><a href="/wiki/Indiana_Dunes_National_Park" title="Indiana Dunes National Park">Indiana Dunes</a>
</th>, <th scope="row"><a href="/wiki/Katmai_National_Park_and_Preserve" title="Katmai National Park and Preserve">Katmai</a>
</th>, <th scope="row"><a href="/wiki/Kenai_Fjords_National_Park" title="Kenai Fjords National Park">Kenai Fjords</a>
</th>, <th scope="row"><a href="/wiki/Kobuk_Valley_National_Park" title="Kobuk Valley National Park">Kobuk Valley</a>
</th>, <th scope="row"><a href="/wiki/Lake_Clark_National_Park_and_Preserve" title="Lake Clark National Park and Preserve">Lake Clark</a>
</th>, <th scope="row"><a href="/wiki/Lassen_Volcanic_National_Park" title="Lassen Volcanic National Park">Lassen Volcanic</a>
</th>, <th scope="row"><a href="/wiki/Mount_Rainier_National_Park" title="Mount Rainier National Park">Mount Rainier</a>
</th>, <th scope="row"><a href="/wiki/New_River_Gorge_National_Park_and_Preserve" title="New River Gorge National Park and Preserve">New River Gorge</a>
</th>, <th scope="row"><a href="/wiki/North_Cascades_National_Park" title="North Cascades National Park">North Cascades</a>
</th>, <th scope="row"><a href="/wiki/Petrified_Forest_National_Park" title="Petrified Forest National Park">Petrified Forest</a>
</th>, <th scope="row"><a href="/wiki/Pinnacles_National_Park" title="Pinnacles National Park">Pinnacles</a>
</th>, <th scope="row"><a href="/wiki/Saguaro_National_Park" title="Saguaro National Park">Saguaro</a>
</th>, <th scope="row"><a href="/wiki/Shenandoah_National_Park" title="Shenandoah National Park">Shenandoah</a>
</th>, <th scope="row"><a href="/wiki/Theodore_Roosevelt_National_Park" title="Theodore Roosevelt National Park">Theodore Roosevelt</a>
</th>, <th scope="row"><a href="/wiki/Virgin_Islands_National_Park" title="Virgin Islands National Park">Virgin Islands</a>
</th>, <th scope="row"><a href="/wiki/Voyageurs_National_Park" title="Voyageurs National Park">Voyageurs</a>
</th>, <th scope="row"><a href="/wiki/White_Sands_National_Park" title="White Sands National Park">White Sands</a>
</th>, <th scope="row"><a href="/wiki/Wind_Cave_National_Park" title="Wind Cave National Park">Wind Cave</a>
</th>, <th scope="row"><a href="/wiki/Zion_National_Park" title="Zion National Park">Zion</a>
</th>]
/tmp/ipykernel_1846/2762383118.py:4: DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead.
caption = soup.find('caption', text="List of U.S. national parks\n")
Okay, now we’ve narrowed things down nicely. But how do we actually pull the text from this word salad? We can do this using the get_text()
function and a simple Python for
loop:
# Create an empty list that we will store text in
table_header_text = []
# Loop over all of the table headers BeautifulSoup has found for us
for header in table_headers:
# Add the text of the header to our list using get_text()
table_header_text.append(header.get_text())
print(table_header_text)
['Name\n', 'Image\n', 'Location\n', 'Date established as park[12]\n', 'Area (2023)[8]\n', 'Recreation visitors (2022)[11]\n', 'Description\n', 'Acadia\n', 'American Samoa\n', 'Arches\n', 'Badlands\n', 'Biscayne\n', 'Black Canyon of the Gunnison\n', 'Bryce Canyon\n', 'Canyonlands\n', 'Capitol Reef\n', 'Crater Lake\n', 'Cuyahoga Valley\n', 'Gates of the Arctic\n', 'Gateway Arch\n', 'Great Basin\n', 'Great Sand Dunes\n', 'Guadalupe Mountains\n', 'Hot Springs\n', 'Indiana Dunes\n', 'Katmai\n', 'Kenai Fjords\n', 'Kobuk Valley\n', 'Lake Clark\n', 'Lassen Volcanic\n', 'Mount Rainier\n', 'New River Gorge\n', 'North Cascades\n', 'Petrified Forest\n', 'Pinnacles\n', 'Saguaro\n', 'Shenandoah\n', 'Theodore Roosevelt\n', 'Virgin Islands\n', 'Voyageurs\n', 'White Sands\n', 'Wind Cave\n', 'Zion\n']
Nearly done! Two more quick things to clean this up. First, we remove all of the extraneous matches at the beginning of our list. Second, we remove the newline character from the end of each string.
(Remember, Python starts indexing at 0, and we want to exclude the first 7 items.)
# Get items in our list starting from the 8th item
national_park_names = table_header_text[7:]
# Strip whitespace from the beginning and end of each item
national_park_names = [park.strip() for park in national_park_names]
print(national_park_names)
print(f'{len(national_park_names)} entries')
['Acadia', 'American Samoa', 'Arches', 'Badlands', 'Biscayne', 'Black Canyon of the Gunnison', 'Bryce Canyon', 'Canyonlands', 'Capitol Reef', 'Crater Lake', 'Cuyahoga Valley', 'Gates of the Arctic', 'Gateway Arch', 'Great Basin', 'Great Sand Dunes', 'Guadalupe Mountains', 'Hot Springs', 'Indiana Dunes', 'Katmai', 'Kenai Fjords', 'Kobuk Valley', 'Lake Clark', 'Lassen Volcanic', 'Mount Rainier', 'New River Gorge', 'North Cascades', 'Petrified Forest', 'Pinnacles', 'Saguaro', 'Shenandoah', 'Theodore Roosevelt', 'Virgin Islands', 'Voyageurs', 'White Sands', 'Wind Cave', 'Zion']
36 entries
Woohoo! Now let’s move on to the state/territory names, which we recall live inside <td>
tags. Conveniently, we don’t have to start from the soup
object that contains all the webpage’s HTML, but can start again from the table
object we created above which contains just the table we want. Let’s try this now:
table_cells = table.find_all('td')
# print(table_cells)
The print statement is commented out because the output returned is so long, but try it yourself to see! This is not uncommon for very HTML-rich pages.
We can see that the returned HTML definitely contains the state/territory names, but also has a lot of other extraneous text. Again, this is a useful real-life example, because there are in practice lots of other elements that might share the type of tag with the data you want. How can we get more specific?
Yet again, we can use the find_parent
trick. Note that beneath each state/territory name is a set of coordinates that have a unique <small>
tag, unlike the other text in <td>
tags. So we can:
Search for the small tags
Find all td tags that are parents of these small tags
Extract just the location from each td tag
#1. Get small tags
small_tags = table.find_all('small')
td_tags = []
#2. Loop over small tags and get their parent td tags
for small_tag in small_tags:
td_tags.append(small_tag.find_parent('td'))
#3. Extract just the title text from the td tags
states_and_territories = []
for tag in td_tags:
states_and_territories.append(tag.a['title']) # The .a['title'] is needed to get just the text under the 'title' attribute; otherwise this would include the coordinates too!
print(states_and_territories)
print(f'{len(states_and_territories)} entries')
['Maine', 'American Samoa', 'Utah', 'South Dakota', 'Texas', 'Florida', 'Colorado', 'Utah', 'Utah', 'Utah', 'New Mexico', 'California', 'South Carolina', 'Oregon', 'Ohio', 'California', 'Alaska', 'Florida', 'Florida', 'Alaska', 'Missouri', 'Montana', 'Alaska', 'Arizona', 'Wyoming', 'Nevada', 'Colorado', 'North Carolina', 'Texas', 'Hawaii', 'Hawaii', 'Arkansas', 'Indiana', 'Michigan', 'California', 'Alaska', 'Alaska', 'California', 'Alaska', 'Alaska', 'California', 'Kentucky', 'Colorado', 'Washington (state)', 'West Virginia', 'Washington (state)', 'Washington (state)', 'Arizona', 'California', 'California', 'Colorado', 'Arizona', 'California', 'Virginia', 'North Dakota', 'United States Virgin Islands', 'Minnesota', 'New Mexico', 'South Dakota', 'Alaska', 'Wyoming', 'California', 'Utah']
63 entries
And we’re 100% fini…OH NO! We have a problem! Why do we have 63 locations but only 37 parks? Well, it looks like some of the national parks are missing, specifically those that have symbols next to them on Wikipedia. Upon inspection, these are actually in <td>
tags, not <th>
tags. However, it appears they all contain scope="row"
. This is one more nifty feature of Beautiful Soup - the ability to feed in custom attributes that fit the quirks of our use case. Here’s the syntax for how we do it:
national_parks_extra_tags = table.find_all('td', attrs={'scope': 'row'})
national_parks_extra = []
for park in national_parks_extra_tags:
national_parks_extra.append(park.get_text().strip())
print(national_parks_extra)
['Big Bend †', 'Carlsbad Caverns *', 'Channel Islands †', 'Congaree †', 'Death Valley †', 'Denali †', 'Dry Tortugas †', 'Everglades ‡', 'Glacier ‡', 'Glacier Bay ‡', 'Grand Canyon *', 'Grand Teton †', 'Great Smoky Mountains ‡', 'Haleakalā †', 'Hawaiʻi Volcanoes ‡', 'Isle Royale †', 'Joshua Tree †', 'Kings Canyon †', 'Mammoth Cave ‡', 'Mesa Verde *', 'Olympic ‡', 'Redwood *', 'Rocky Mountain †', 'Sequoia †', 'Wrangell–St.\xa0Elias *', 'Yellowstone ‡', 'Yosemite *']
Now, let’s append the two lists of national parks together and re-sort them in alphabetical order. Finally, we’ve got both of our lists of 63 entries.
national_parks_combined = sorted(national_park_names + national_parks_extra)
print(national_parks_combined)
print(f'{len(national_parks_combined)} entries')
['Acadia', 'American Samoa', 'Arches', 'Badlands', 'Big Bend †', 'Biscayne', 'Black Canyon of the Gunnison', 'Bryce Canyon', 'Canyonlands', 'Capitol Reef', 'Carlsbad Caverns *', 'Channel Islands †', 'Congaree †', 'Crater Lake', 'Cuyahoga Valley', 'Death Valley †', 'Denali †', 'Dry Tortugas †', 'Everglades ‡', 'Gates of the Arctic', 'Gateway Arch', 'Glacier Bay ‡', 'Glacier ‡', 'Grand Canyon *', 'Grand Teton †', 'Great Basin', 'Great Sand Dunes', 'Great Smoky Mountains ‡', 'Guadalupe Mountains', 'Haleakalā †', 'Hawaiʻi Volcanoes ‡', 'Hot Springs', 'Indiana Dunes', 'Isle Royale †', 'Joshua Tree †', 'Katmai', 'Kenai Fjords', 'Kings Canyon †', 'Kobuk Valley', 'Lake Clark', 'Lassen Volcanic', 'Mammoth Cave ‡', 'Mesa Verde *', 'Mount Rainier', 'New River Gorge', 'North Cascades', 'Olympic ‡', 'Petrified Forest', 'Pinnacles', 'Redwood *', 'Rocky Mountain †', 'Saguaro', 'Sequoia †', 'Shenandoah', 'Theodore Roosevelt', 'Virgin Islands', 'Voyageurs', 'White Sands', 'Wind Cave', 'Wrangell–St.\xa0Elias *', 'Yellowstone ‡', 'Yosemite *', 'Zion']
63 entries
Manipulating the text into a useful format#
As a final step, we will convert these two lists we’ve scraped into a pandas dataframe (for R users, analogous to a tibble). This example will be light on the pandas-specific code, but this is often an important part of web scraping. BeautifulSoup is a powerful tool, but it sometimes outputs data in a format that is not immediately useful, even with some of the tricks we applied above. Pandas can help us clean and manipulate this data into a more useful format.
Note: Pandas even has a built-in function to read HTML tables directly from a webpage, which can be a nice starting point for certain examples (like this one believe it or not). BeautifulSoup is your workhorse for static webpage scraping, but this is worth knowing about if you’re a pandas user. You’d be surprised how far the two lines below will get you:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States')
df[0][['Name', 'Location']]
Web scraping with pandas is outside the scope of our lesson here, but worth exploring. Back to our example:
# Convert this into a pandas dataframe
import pandas as pd
df = pd.DataFrame({
'National Park': national_parks_combined,
'State': states_and_territories
})
print(df)
National Park State
0 Acadia Maine
1 American Samoa American Samoa
2 Arches Utah
3 Badlands South Dakota
4 Big Bend † Texas
.. ... ...
58 Wind Cave South Dakota
59 Wrangell–St. Elias * Alaska
60 Yellowstone ‡ Wyoming
61 Yosemite * California
62 Zion Utah
[63 rows x 2 columns]
BONUS:#
Some of our more adventurous participants may have caught early on that national parks can be in multiple states and territories.
For instance, the Great Smoky Mountains are in both North Carolina and Tennessee. Let’s tweak the above code to handle that, starting at #3.
df.iloc[27]
National Park Great Smoky Mountains ‡
State North Carolina
Name: 27, dtype: object
states_and_territories = {}
# Enumerate allows us to loop over a list and get the index (in this case, "park_number") of the item as well
for park_number, tag in enumerate(td_tags):
# Before, we were just getting the first 'a' tag, now let's get all of them for a given table cell
a_tags = tag.find_all('a')
# Get the titles of each location in a list, except for the last one which is extraneous info
locations = [a.text for a in a_tags][:-1]
# Create a dictionary where the key is the national park name and the value is the list of locations
dict_key = national_parks_combined[park_number]
states_and_territories[dict_key] = locations
print(states_and_territories)
print(pd.DataFrame.from_dict(states_and_territories,
orient='index'))
{'Acadia': ['Maine'], 'American Samoa': ['American Samoa'], 'Arches': ['Utah'], 'Badlands': ['South Dakota'], 'Big Bend †': ['Texas'], 'Biscayne': ['Florida'], 'Black Canyon of the Gunnison': ['Colorado'], 'Bryce Canyon': ['Utah'], 'Canyonlands': ['Utah'], 'Capitol Reef': ['Utah'], 'Carlsbad Caverns *': ['New Mexico'], 'Channel Islands †': ['California'], 'Congaree †': ['South Carolina'], 'Crater Lake': ['Oregon'], 'Cuyahoga Valley': ['Ohio'], 'Death Valley †': ['California', 'Nevada'], 'Denali †': ['Alaska'], 'Dry Tortugas †': ['Florida'], 'Everglades ‡': ['Florida'], 'Gates of the Arctic': ['Alaska'], 'Gateway Arch': ['Missouri'], 'Glacier Bay ‡': ['Montana'], 'Glacier ‡': ['Alaska'], 'Grand Canyon *': ['Arizona'], 'Grand Teton †': ['Wyoming'], 'Great Basin': ['Nevada'], 'Great Sand Dunes': ['Colorado'], 'Great Smoky Mountains ‡': ['North Carolina', 'Tennessee'], 'Guadalupe Mountains': ['Texas'], 'Haleakalā †': ['Hawaii'], 'Hawaiʻi Volcanoes ‡': ['Hawaii'], 'Hot Springs': ['Arkansas'], 'Indiana Dunes': ['Indiana'], 'Isle Royale †': ['Michigan'], 'Joshua Tree †': ['California'], 'Katmai': ['Alaska'], 'Kenai Fjords': ['Alaska'], 'Kings Canyon †': ['California'], 'Kobuk Valley': ['Alaska'], 'Lake Clark': ['Alaska'], 'Lassen Volcanic': ['California'], 'Mammoth Cave ‡': ['Kentucky'], 'Mesa Verde *': ['Colorado'], 'Mount Rainier': ['Washington'], 'New River Gorge': ['West Virginia'], 'North Cascades': ['Washington'], 'Olympic ‡': ['Washington'], 'Petrified Forest': ['Arizona'], 'Pinnacles': ['California'], 'Redwood *': ['California'], 'Rocky Mountain †': ['Colorado'], 'Saguaro': ['Arizona'], 'Sequoia †': ['California'], 'Shenandoah': ['Virginia'], 'Theodore Roosevelt': ['North Dakota'], 'Virgin Islands': ['U.S. Virgin Islands'], 'Voyageurs': ['Minnesota'], 'White Sands': ['New Mexico'], 'Wind Cave': ['South Dakota'], 'Wrangell–St.\xa0Elias *': ['Alaska'], 'Yellowstone ‡': ['Wyoming', 'Montana', 'Idaho'], 'Yosemite *': ['California'], 'Zion': ['Utah']}
0 1 2
Acadia Maine None None
American Samoa American Samoa None None
Arches Utah None None
Badlands South Dakota None None
Big Bend † Texas None None
... ... ... ...
Wind Cave South Dakota None None
Wrangell–St. Elias * Alaska None None
Yellowstone ‡ Wyoming Montana Idaho
Yosemite * California None None
Zion Utah None None
[63 rows x 3 columns]
Looking Ahead#
Next week we’ll introduce Selenium for scraping dynamic content. We’ll be scraping this website, so a quick perusal to familiarize yourself could be helpful: https://www.kff.org/interactive/subsidy-calculator/