Web Scrapping with Beautiful Soup

Example: Web scraping the COPSS Awards Recipients

Goal: Get information regarding statisticians that were awarded with the COPSS Presidents’ Award. Get a dataframe with 3 columns (Year, Name, Institute). For example (1981, Peter J. Bickel, University of California, Berkeley). Set Year to be the index, Name and Institute as column names.

This is the page we will be scraping: https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award

Step 1: Go to the website, use Google Chrome > More Tools > Developer Tools to inspect the HTML code. We well find all those information in <li>...</li>. These are html tags indicating items that appear in a list.

Step 2: Find all these tags using BeautifulSoup. We want to select a subset of them.

Step 3: Use the .split function to separate the year, name and institute. Specify an optional argument k in .split to split only on the k-th occurrence.

import requests
import pandas as pd
import re
import lxml
from bs4 import BeautifulSoup 

# Beautiful Soup setup using the desired URL
url='https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award'
page=requests.get(url)
soup=BeautifulSoup(page.content,'lxml') #we use the 'lxml' parser here to scrape this page, which is very fast

Here, we take advantage of the fact that our desired entries are similarly formatted. They will appear in the form “Year: Name, Institution”. We can use text parsing functions such as split to split up these parts into different dataframe columns, based on the appearance of punctuation like : and ,.

When wrangling scraped data, it is common to look for structures like these that follow certain patterns, often requiring some creativity, both in the types of tags we search for (in this case <li>) and how we manipulate the data that appears in those tags!

To better understand what BeautifulSoup is actually scraping from the URL, try uncommenting the print statements to check out the intermediate output.

lists = soup.find_all('li') #find all HTML tags for list items on the page 
records = []
# print(lists)
for li in lists:
    # print(li.text)
    char_element = li.text.split(': ')
    if(len(char_element) == 2):
        # print(char_element)
        char = char_element[1].split(', ', 1)
        # print([char,len(char)])
        if(len(char) > 1):
            records.append([char_element[0],char[0],char[1]])
records
copss_recipients = pd.DataFrame(records, columns =['Year', 'Name', 'Institute'])
copss_recipients

For each of these statisticians, find their year of birth. Create another column in your dataframe showing those information.

Step 1: Go to the website https://en.wikipedia.org/wiki/Peter_J._Bickel, use Google Chrome > More Tools > Developer Tools to inspect the HTML code. This is a good way to figure out which tags contain the relevant information and how we can best scrape them.

Step 2: Find the tag containing the birth year information and get the information using BeautifulSoup. We can use soup.find(text='Born') and .findNext('td'), which will look for data appearing in HTML tables. To get birth year from a given string, we can split it into two parts by 19 (since they are all born in the 20th century) and tack on the first two characters that immediately follow.

Step 3: To repeat this procedure for each of those statisticians in your dataframe, try something like this 'https://en.wikipedia.org/wiki/' + name. It does not contain year of birth for all statisticians. If there is no such information for a particular statistician, you can set its value to be NA.

Birth = []
for name in copss_recipients['Name']:
    url = 'https://en.wikipedia.org/wiki/' + name.replace(' ', '_')
    # print(url)
    page=requests.get(url)
    soup=BeautifulSoup(page.content,'lxml')
    if(soup.find(text='Born') is not None):
        tmp = soup.find(text='Born').findNext('td').text.split('19')
        Birth.append('19' + tmp[1][:2])
    else:
        tmp = soup.find_all('p')[0].text
        tmp1 = re.search(r'born\s(\d{4})',tmp) #Regular expression searching for 4 digits in a row after the word "born"
        if(tmp1 is not None):
            Birth.append(tmp1.group(1))
        else:
            Birth.append('NA')

copss_recipients['Birth year'] = Birth
copss_recipients
url = 'https://en.wikipedia.org/wiki/C._F._Jeff_Wu'
page=requests.get(url)
soup=BeautifulSoup(page.content,'lxml')
tmp = re.search(r'born\s(\d{4})',soup.find_all('p')[0].text)
tmp.group(1)