Chapter 13 - Relational data

Load the libraries needed for these exercises.

library(nycflights13)
library(tidyverse)

13.2 - nycflights13

Problem 1

Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?

Latitude and longitude from the airports table and origin and destination from the flights table would be needed.

Problem 2

I forgot to draw the relationship between weather and airports. What is the relationship and how should it appear in the diagram?

The relationship is between origin from weather and faa from airports. It should be drawn as an arrow around the the flights table.

Problem 3

weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?

Destination.

Problem 4

We know that some days of the year are “special”, and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables?

It would create a table of observations for days that are special that could relate to the flights table through day, month, and year.

13.3 - Keys

Problem 1

Add a surrogate key to flights.

flights %>%
  mutate(surrogate_key = row_number())
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, surrogate_key <int>

Problem 2

Identify the keys in the following datasets

Lahman::Batting babynames::babynames nasaweather::atmos fueleconomy::vehicles ggplot2::diamonds (You might need to install some packages and read some documentation.)

  • Lahman::Batting - playerID, yearID, and stint
library(Lahman)
Batting <- as_tibble(Batting)

Batting %>%
  count(playerID, yearID, stint) %>%
  filter(n > 1)
## # A tibble: 0 x 4
## # ... with 4 variables: playerID <chr>, yearID <int>, stint <int>, n <int>
  • babynames::babynames
library(babynames)
babynames <- as_tibble(babynames)

babynames %>%
  count(year, sex, name) %>%
  filter(nn > 1)
## # A tibble: 0 x 4
## # ... with 4 variables: year <dbl>, sex <chr>, name <chr>, nn <int>
  • nasaweather::atmos - lat, long, year, month
library(nasaweather)
## 
## Attaching package: 'nasaweather'
## The following object is masked from 'package:dplyr':
## 
##     storms
atmos <- as_tibble(atmos)

atmos %>%
  count(lat, long, year, month) %>%
  filter(n > 1)
## # A tibble: 0 x 5
## # ... with 5 variables: lat <dbl>, long <dbl>, year <int>, month <int>,
## #   n <int>
  • fueleconomy::vehicles -
library(fueleconomy)
vehicles <- as_tibble(vehicles)

vehicles %>%
  count(id) %>%
  filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: id <int>, n <int>
  • ggplot2::diamonds - none!
diamonds <- as_tibble(diamonds)

diamonds %>%
  count(carat, cut, color, clarity, depth, table, price, x, y, z) %>%
  filter(n > 1)
## # A tibble: 143 x 11
##    carat cut       color clarity depth table price     x     y     z     n
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
##  1  0.3  Good      J     VS1      63.4    57   394  4.23  4.26  2.69     2
##  2  0.3  Very Good G     VS2      63      55   526  4.29  4.31  2.71     2
##  3  0.3  Very Good J     VS1      63.4    57   506  4.26  4.23  2.69     2
##  4  0.3  Premium   D     SI1      62.2    58   709  4.31  4.28  2.67     2
##  5  0.3  Ideal     G     VS2      63      55   675  4.31  4.29  2.71     2
##  6  0.3  Ideal     G     IF       62.1    55   863  4.32  4.35  2.69     2
##  7  0.3  Ideal     H     SI1      62.2    57   450  4.26  4.29  2.66     2
##  8  0.3  Ideal     H     SI1      62.2    57   450  4.27  4.28  2.66     2
##  9  0.31 Good      D     SI1      63.5    56   571  4.29  4.31  2.73     2
## 10  0.31 Very Good D     SI1      63.5    56   732  4.31  4.29  2.73     2
## # ... with 133 more rows

Problem 3

Draw a diagram illustrating the connections between the Batting, Master, and Salaries tables in the Lahman package. Draw another diagram that shows the relationship between Master, Managers, AwardsManagers.

Problem 4

How would you characterise the relationship between the Batting, Pitching, and Fielding tables?

13.4 - Mutating joins

Problem 1

Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:

airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap()
(Don’t worry if you don’t understand what semi_join() does — you’ll learn about it next.)

You might want to use the size or colour of the points to display the average delay for each airport.

flights %>%
  group_by(dest) %>%
  summarize(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  left_join(airports, c("dest" = "faa")) %>%
  ggplot(aes(lon, lat, size = arr_delay), alpha = 0.5) +
    borders("state") +
    geom_point(alpha = 0.3, color = "blue") +
    coord_quickmap()
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
## Warning: Removed 5 rows containing missing values (geom_point).

Problem 2

Add the location of the origin and destination (i.e. the lat and lon) to flights.

airports2 <- airports %>%
  select(faa, lat, lon)

flights %>%
  left_join(airports2, c("origin" = "faa")) %>%
  rename(origin_lat = lat, origin_lon = lon) %>%
  left_join(airports2, c("dest" = "faa")) %>%
  rename(dest_lat = lat, dest_lon = lon)
## # A tibble: 336,776 x 23
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 16 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, origin_lat <dbl>, origin_lon <dbl>,
## #   dest_lat <dbl>, dest_lon <dbl>

Problem 3

Is there a relationship between the age of a plane and its delays?

flights %>%
  left_join(planes, "tailnum") %>%
  ggplot(aes(year.y, dep_delay)) +
    geom_point(alpha = 0.2) +
    geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 61980 rows containing non-finite values (stat_smooth).
## Warning: Removed 61980 rows containing missing values (geom_point).

flights %>%
  left_join(planes, "tailnum") %>%
  ggplot(aes(year.y, arr_delay)) +
    geom_point(alpha = 0.2) +
    geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 62923 rows containing non-finite values (stat_smooth).
## Warning: Removed 62923 rows containing missing values (geom_point).

Problem 4

What weather conditions make it more likely to see a delay?

flight_weather <- flights %>%
  left_join(weather, c("year", "month", "day", "hour", "origin")) %>%
  select(year, month, day, arr_delay, dep_delay, temp:visib, -wind_dir) %>%
  filter(complete.cases(.)) %>%
  mutate(arr_delay_categorical = cut_number(arr_delay, 5)) 

flight_weather %>%
  summarize_all(funs(sum(is.na(.))))
## # A tibble: 1 x 14
##    year month   day arr_delay dep_delay  temp  dewp humid wind_speed
##   <int> <int> <int>     <int>     <int> <int> <int> <int>      <int>
## 1     0     0     0         0         0     0     0     0          0
## # ... with 5 more variables: wind_gust <int>, precip <int>,
## #   pressure <int>, visib <int>, arr_delay_categorical <int>
flight_means <- flight_weather %>%
  select(arr_delay_categorical, temp:visib) %>%
  group_by(arr_delay_categorical) %>%
  summarize_all(funs(mean(.)))

flight_means %>%
  gather(key = weather_type, value = value, -arr_delay_categorical) %>%
  ggplot(aes(arr_delay_categorical, value)) +
    geom_bar(stat = "identity") +
    facet_wrap(~weather_type, scales = "free")

lm(arr_delay ~ dewp + humid + precip + pressure + temp + visib + wind_gust + wind_speed, data = flight_weather)
## 
## Call:
## lm(formula = arr_delay ~ dewp + humid + precip + pressure + temp + 
##     visib + wind_gust + wind_speed, data = flight_weather)
## 
## Coefficients:
## (Intercept)         dewp        humid       precip     pressure  
##   538.52374      0.85046     -0.44252     98.37679     -0.47439  
##        temp        visib    wind_gust   wind_speed  
##    -0.66531     -2.40736      0.08383           NA

Problem 5

What happened on June 13 2013? Display the spatial pattern of delays, and then use Google to cross-reference with the weather.

flight_weather <- flights %>%
  left_join(weather, c("year", "month", "day", "hour", "origin")) %>%
  select(year, month, day, arr_delay, dep_delay, temp:visib, -wind_dir) %>%
  filter(complete.cases(.)) %>%
  mutate(dep_delay_categorical = cut_number(dep_delay, 5)) 

airports2 <- airports %>%
  select(faa, lat, lon)

flight_airports <- flights %>%
  left_join(airports2, c("origin" = "faa")) %>%
  rename(origin_lat = lat, origin_lon = lon) %>%
  left_join(airports2, c("dest" = "faa")) %>%
  rename(dest_lat = lat, dest_lon = lon) %>%
  left_join(weather, c("year", "month", "day", "hour", "origin")) %>%
  filter(complete.cases(.)) %>%
  filter(month == 6 & day == 13)

flight_airports %>%
  ggplot(aes(dep_delay)) +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

flight_airports %>%
  ggplot(aes(arr_delay)) +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

flight_airports %>%
  ggplot(aes(dest_lon, dest_lat, size = dep_delay)) +
    borders("state") +
    geom_point(alpha = 0.2) +
    coord_quickmap()

Two derechos hit the United States East Coast on June 13, 2013. This caused significant departure delays at several airports.

13.5 - Filtering joins

Problem 1

What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

planes <- as_tibble(planes)
flights <- as_tibble(flights)

no_tailnum <- flights %>%
  anti_join(planes, "tailnum") 

count(no_tailnum, carrier)
## # A tibble: 10 x 2
##    carrier     n
##    <chr>   <int>
##  1 9E       1044
##  2 AA      22558
##  3 B6        830
##  4 DL        110
##  5 F9         50
##  6 FL        187
##  7 MQ      25397
##  8 UA       1693
##  9 US        699
## 10 WN         38
count(no_tailnum, origin)
## # A tibble: 3 x 2
##   origin     n
##   <chr>  <int>
## 1 EWR     5908
## 2 JFK    17137
## 3 LGA    29561

Departures from LaGuardia (LGA), JFK, and Newark (EWR) account for 100 percent of the planes without tail numbers. American Airlines (AA) and Envoy Airline (MQ) account for 90 percent of the planes without tail numbers.

Problem 2

Filter flights to only show flights with planes that have flown at least 100 flights.

flights100 <- flights %>%
  group_by(tailnum) %>%
  count(tailnum) %>%
  filter(n >= 100)

semi_join(flights, flights100, "tailnum")
## # A tibble: 230,902 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      544            545        -1     1004
##  4  2013     1     1      554            558        -4      740
##  5  2013     1     1      555            600        -5      913
##  6  2013     1     1      557            600        -3      709
##  7  2013     1     1      557            600        -3      838
##  8  2013     1     1      558            600        -2      849
##  9  2013     1     1      558            600        -2      853
## 10  2013     1     1      558            600        -2      923
## # ... with 230,892 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Problem 3

Combine fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models.

library(fueleconomy)

semi_join(vehicles, common)
## Joining, by = c("make", "model")
## # A tibble: 14,531 x 12
##       id make  model  year class trans drive   cyl displ fuel    hwy   cty
##    <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl> <chr> <int> <int>
##  1  1833 Acura Inte…  1986 Subc… Auto… Fron…     4   1.6 Regu…    28    22
##  2  1834 Acura Inte…  1986 Subc… Manu… Fron…     4   1.6 Regu…    28    23
##  3  3037 Acura Inte…  1987 Subc… Auto… Fron…     4   1.6 Regu…    28    22
##  4  3038 Acura Inte…  1987 Subc… Manu… Fron…     4   1.6 Regu…    28    23
##  5  4183 Acura Inte…  1988 Subc… Auto… Fron…     4   1.6 Regu…    27    22
##  6  4184 Acura Inte…  1988 Subc… Manu… Fron…     4   1.6 Regu…    28    23
##  7  5303 Acura Inte…  1989 Subc… Auto… Fron…     4   1.6 Regu…    27    22
##  8  5304 Acura Inte…  1989 Subc… Manu… Fron…     4   1.6 Regu…    28    23
##  9  6442 Acura Inte…  1990 Subc… Auto… Fron…     4   1.8 Regu…    24    20
## 10  6443 Acura Inte…  1990 Subc… Manu… Fron…     4   1.8 Regu…    26    21
## # ... with 14,521 more rows

Problem 4

Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

daily_delay <- flights %>%
  filter(arr_delay > 0) %>%
  group_by(year, month, day) %>%
  summarize(arr_delay = sum(arr_delay, na.rm = TRUE)) %>%
  mutate(yesterday = arr_delay + lag(arr_delay),
         tomorrow = arr_delay + lead(arr_delay)) %>%
  ungroup() %>%
  filter(min_rank(-tomorrow) == 1 | min_rank(-yesterday) == 1) 

Problem 5

What does anti_join(flights, airports, by = c(“dest” = “faa”)) tell you? What does anti_join(airports, flights, by = c(“faa” = “dest”)) tell you?

Problem 6

You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.