import requests
import pandas as pd
import re
import lxml
from bs4 import BeautifulSoup
# Beautiful Soup setup using the desired URL
='https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award'
url=requests.get(url)
page=BeautifulSoup(page.content,'lxml') #we use the 'lxml' parser here to scrape this page, which is very fast soup
Web Scrapping with Beautiful Soup
Example: Web scraping the COPSS Awards Recipients
Goal: Get information regarding statisticians that were awarded with the COPSS Presidents’ Award. Get a dataframe with 3 columns (Year
, Name
, Institute
). For example (1981
, Peter J. Bickel
, University of California, Berkeley
). Set Year
to be the index, Name
and Institute
as column names.
This is the page we will be scraping: https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award
Step 1: Go to the website, use Google Chrome > More Tools > Developer Tools to inspect the HTML code. We well find all those information in <li>...</li>
. These are html tags indicating items that appear in a list.
Step 2: Find all these tags using BeautifulSoup
. We want to select a subset of them.
Step 3: Use the .split
function to separate the year, name and institute. Specify an optional argument k in .split
to split only on the k-th occurrence.
Here, we take advantage of the fact that our desired entries are similarly formatted. They will appear in the form “Year: Name, Institution”. We can use text parsing functions such as split
to split up these parts into different dataframe columns, based on the appearance of punctuation like :
and ,
.
When wrangling scraped data, it is common to look for structures like these that follow certain patterns, often requiring some creativity, both in the types of tags we search for (in this case <li>
) and how we manipulate the data that appears in those tags!
To better understand what BeautifulSoup is actually scraping from the URL, try uncommenting the print statements to check out the intermediate output.
= soup.find_all('li') #find all HTML tags for list items on the page
lists = []
records # print(lists)
for li in lists:
# print(li.text)
= li.text.split(': ')
char_element if(len(char_element) == 2):
# print(char_element)
= char_element[1].split(', ', 1)
char # print([char,len(char)])
if(len(char) > 1):
0],char[0],char[1]])
records.append([char_element[
records= pd.DataFrame(records, columns =['Year', 'Name', 'Institute'])
copss_recipients copss_recipients
For each of these statisticians, find their year of birth. Create another column in your dataframe showing those information.
Step 1: Go to the website https://en.wikipedia.org/wiki/Peter_J._Bickel, use Google Chrome > More Tools > Developer Tools to inspect the HTML code. This is a good way to figure out which tags contain the relevant information and how we can best scrape them.
Step 2: Find the tag containing the birth year information and get the information using BeautifulSoup
. We can use soup.find(text='Born')
and .findNext('td')
, which will look for data appearing in HTML tables. To get birth year from a given string, we can split it into two parts by 19
(since they are all born in the 20th century) and tack on the first two characters that immediately follow.
Step 3: To repeat this procedure for each of those statisticians in your dataframe, try something like this 'https://en.wikipedia.org/wiki/' + name
. It does not contain year of birth for all statisticians. If there is no such information for a particular statistician, you can set its value to be NA
.
= []
Birth for name in copss_recipients['Name']:
= 'https://en.wikipedia.org/wiki/' + name.replace(' ', '_')
url # print(url)
=requests.get(url)
page=BeautifulSoup(page.content,'lxml')
soupif(soup.find(text='Born') is not None):
= soup.find(text='Born').findNext('td').text.split('19')
tmp '19' + tmp[1][:2])
Birth.append(else:
= soup.find_all('p')[0].text
tmp = re.search(r'born\s(\d{4})',tmp) #Regular expression searching for 4 digits in a row after the word "born"
tmp1 if(tmp1 is not None):
1))
Birth.append(tmp1.group(else:
'NA')
Birth.append(
'Birth year'] = Birth
copss_recipients[
copss_recipients= 'https://en.wikipedia.org/wiki/C._F._Jeff_Wu'
url =requests.get(url)
page=BeautifulSoup(page.content,'lxml')
soup= re.search(r'born\s(\d{4})',soup.find_all('p')[0].text)
tmp 1) tmp.group(