Chapter 5 - Data Transformation
Load the libraries needed for these exercises.
library(tidyverse)
library(nycflights13)
5.2 - Filter Rows with filter()
Problem 1
Find all flights that:
- Had an arrival delay of two or more hours
flights %>%
filter(arr_delay >= 120) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 10200
- Flew to Houston (IAH or HOU)
flights %>%
filter(dest %in% c('IAH', 'HOU')) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 9313
- Were operated by United, American, or Delta
flights %>%
filter(carrier %in% c('UA', 'AA', 'DL')) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 139504
- Departed in summer (July, August, and September)
flights %>%
filter(month %in% c(7, 8, 9)) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 86326
- Arrived more than two hours late, but didn’t leave late
flights %>%
filter(arr_delay >= 120, dep_delay <= 0) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 29
- Were delayed by at least an hour, but made up over 30 minutes in flight
flights %>%
filter(dep_delay >= 60, arr_delay <= dep_delay - 30) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 2074
- Departed between midnight and 6am (inclusive)
flights %>%
filter(dep_time >= 0, dep_time <= 600) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 9344
Problem 2
Another useful dplyr filtering helper is between()
. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
between()
is a shortcut for x >= left & x <= right
. We can simplify the last answer to Problem 1 as:
flights %>%
filter(between(dep_time, 0, 600)) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 9344
Problem 3
How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
We use is.na()
to filter the flights with a missing departure time.
flights %>%
filter(is.na(dep_time)) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 8255
Using summary()
to see the breakout of the other variables, there appear to be flights that were cancelled.
flights %>%
filter(is.na(dep_time)) %>%
summary()
## year month day dep_time
## Min. :2013 Min. : 1.000 Min. : 1.0 Min. : NA
## 1st Qu.:2013 1st Qu.: 3.000 1st Qu.: 8.0 1st Qu.: NA
## Median :2013 Median : 6.000 Median :12.0 Median : NA
## Mean :2013 Mean : 5.927 Mean :14.6 Mean :NaN
## 3rd Qu.:2013 3rd Qu.: 8.000 3rd Qu.:23.0 3rd Qu.: NA
## Max. :2013 Max. :12.000 Max. :31.0 Max. : NA
## NA's :8255
## sched_dep_time dep_delay arr_time sched_arr_time
## Min. : 106 Min. : NA Min. : NA Min. : 1
## 1st Qu.:1159 1st Qu.: NA 1st Qu.: NA 1st Qu.:1330
## Median :1559 Median : NA Median : NA Median :1749
## Mean :1492 Mean :NaN Mean :NaN Mean :1669
## 3rd Qu.:1855 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:2049
## Max. :2359 Max. : NA Max. : NA Max. :2359
## NA's :8255 NA's :8255
## arr_delay carrier flight tailnum
## Min. : NA Length:8255 Min. : 1 Length:8255
## 1st Qu.: NA Class :character 1st Qu.:1577 Class :character
## Median : NA Mode :character Median :3535 Mode :character
## Mean :NaN Mean :3063
## 3rd Qu.: NA 3rd Qu.:4373
## Max. : NA Max. :6177
## NA's :8255
## origin dest air_time distance
## Length:8255 Length:8255 Min. : NA Min. : 17.0
## Class :character Class :character 1st Qu.: NA 1st Qu.: 292.0
## Mode :character Mode :character Median : NA Median : 583.0
## Mean :NaN Mean : 695.4
## 3rd Qu.: NA 3rd Qu.: 872.0
## Max. : NA Max. :4963.0
## NA's :8255
## hour minute time_hour
## Min. : 1.00 Min. : 0.00 Min. :2013-01-01 06:00:00
## 1st Qu.:11.00 1st Qu.: 5.00 1st Qu.:2013-03-07 07:00:00
## Median :15.00 Median :27.00 Median :2013-06-12 18:00:00
## Mean :14.67 Mean :25.61 Mean :2013-06-13 06:42:11
## 3rd Qu.:18.00 3rd Qu.:42.00 3rd Qu.:2013-08-22 15:30:00
## Max. :23.00 Max. :59.00 Max. :2013-12-31 20:00:00
##
Problem 4
Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
Working through these examples: * Anything to the zero power is 1 * Anything OR TRUE
is TRUE
* Anything AND FALSE
is FALSE
These results apply no matter what the LHS side, and so will apply to NA as well.
NA ^ 0
## [1] 1
NA | TRUE
## [1] TRUE
NA & FALSE
## [1] FALSE
However operations on NA will return NA. NA * 0 is counter intuitive since you would think that anything multiplied by 0 would be 0.
NA * 0
## [1] NA
NA ^ 2
## [1] NA
NA + 1
## [1] NA
5.3 - Arrange Rows with arrange()
Problem 1
How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
We can sort missing values using the format:
flights %>%
arrange(desc(is.na(dep_time))) %>%
head()
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 NA 1630 NA NA
## 2 2013 1 1 NA 1935 NA NA
## 3 2013 1 1 NA 1500 NA NA
## 4 2013 1 1 NA 600 NA NA
## 5 2013 1 2 NA 1540 NA NA
## 6 2013 1 2 NA 1620 NA NA
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
Problem 2
Sort flights to find the most delayed flights. Find the flights that left earliest.
The most delayed flights (by arr_delay
) are:
flights %>%
arrange(desc(arr_delay)) %>%
head()
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
The flights that left earliest (by dep_delay
) are:
flights %>%
arrange(dep_delay) %>%
head()
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 12 7 2040 2123 -43 40
## 2 2013 2 3 2022 2055 -33 2240
## 3 2013 11 10 1408 1440 -32 1549
## 4 2013 1 11 1900 1930 -30 2233
## 5 2013 1 29 1703 1730 -27 1947
## 6 2013 8 9 729 755 -26 1002
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
Problem 3
Sort flights to find the fastest flights.
We first calculate average speed in MPH as distance / hours in the air, and sort on the calculated variable.
flights %>%
mutate(speed = distance / (air_time / 60)) %>%
arrange(desc(speed)) %>%
select(speed) %>%
head()
## # A tibble: 6 x 1
## speed
## <dbl>
## 1 703.
## 2 650.
## 3 648
## 4 641.
## 5 591.
## 6 564
Problem 4
Which flights traveled the longest? Which traveled the shortest?
The longest flights are:
flights %>%
arrange(desc(distance)) %>%
head()
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 857 900 -3 1516
## 2 2013 1 2 909 900 9 1525
## 3 2013 1 3 914 900 14 1504
## 4 2013 1 4 900 900 0 1516
## 5 2013 1 5 858 900 -2 1519
## 6 2013 1 6 1019 900 79 1558
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
The shortest flights are:
flights %>%
arrange(distance) %>%
head()
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 27 NA 106 NA NA
## 2 2013 1 3 2127 2129 -2 2222
## 3 2013 1 4 1240 1200 40 1333
## 4 2013 1 4 1829 1615 134 1937
## 5 2013 1 4 2128 2129 -1 2218
## 6 2013 1 5 1155 1200 -5 1241
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
5.4 - Select Columns with select()
Problem 1
Brainstorm as many ways as possible to select dep_time
, dep_delay
, arr_time
, and arr_delay
from flights.
We can put all the variables directly into our select()
statement:
flights %>%
select(dep_time, dep_delay, arr_time, arr_delay) %>%
head()
## # A tibble: 6 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
Or this would be a good place to try the starts_with()
function:
flights %>%
select(starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
Or we can try a regex using matches()
flights %>%
select(matches("^dep"), matches("^arr")) %>%
head()
## # A tibble: 6 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
Problem 2
What happens if you include the name of a variable multiple times in a select() call?
Duplicating a variable within select()
will still just return that variable one time:
flights %>%
select(dep_time, dep_time) %>%
head()
## # A tibble: 6 x 1
## dep_time
## <int>
## 1 517
## 2 533
## 3 542
## 4 544
## 5 554
## 6 554
Problem 3
What does the one_of() function do? Why might it be helpful in conjunction with this vector?
one_of()
allows you select variables from within a character vector. We can pass vars
to select everything from the vector:
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>%
select(one_of(vars))
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ... with 336,766 more rows
Problem 4
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME")) %>%
head()
## # A tibble: 6 x 6
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## # ... with 1 more variable: time_hour <dttm>
contains()
contains an argument ignore.case
which defaults to TRUE
, we can set this to FALSE
if needed:
select(flights, contains("TIME", ignore.case = FALSE)) %>%
head()
## # A tibble: 6 x 0
5.5 - Add New Variables with mutate()
Problem 1
Currently dep_time
and sched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
Use the modular arithmetic operators to break the time into its hours and minute components:
flights %>%
select(dep_time, sched_dep_time) %>%
mutate(dep_time_cont = ((dep_time %/% 100) * 60 + (dep_time %% 100)),
sched_dep_time_cont = ((sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100))) %>%
head()
## # A tibble: 6 x 4
## dep_time sched_dep_time dep_time_cont sched_dep_time_cont
## <int> <int> <dbl> <dbl>
## 1 517 515 317 315
## 2 533 529 333 329
## 3 542 540 342 340
## 4 544 545 344 345
## 5 554 600 354 360
## 6 554 558 354 358
Problem 2
Compare air_time
with arr_time - dep_time
. What do you expect to see? What do you see? What do you need to do to fix it?
The issue is that air_time
is in minutes, while arr_time
and dep_time
are not:
flights %>%
mutate(air_time_derived = arr_time - dep_time) %>%
select(air_time, air_time_derived) %>%
head()
## # A tibble: 6 x 2
## air_time air_time_derived
## <dbl> <int>
## 1 227 313
## 2 227 317
## 3 160 381
## 4 183 460
## 5 116 258
## 6 150 186
Instead use the mutate()
statement from Problem 1, however the two differ:
flights %>%
mutate(dep_time_cont = ((dep_time %/% 100) * 60 + (dep_time %% 100)),
arr_time_cont = ((arr_time %/% 100) * 60 + (arr_time %% 100)),
air_time_derived = arr_time_cont - dep_time_cont) %>%
select(air_time, air_time_derived) %>%
head()
## # A tibble: 6 x 2
## air_time air_time_derived
## <dbl> <dbl>
## 1 227 193
## 2 227 197
## 3 160 221
## 4 183 260
## 5 116 138
## 6 150 106
Problem 3
Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
We would expect dep_delay
to be the difference between the dep_time
and the sched_dep_time
. But be sure to convert from time to continuous first:
flights %>%
mutate(dep_time_cont = ((dep_time %/% 100) * 60 + (dep_time %% 100)),
sched_dep_time_cont = ((sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100)),
dep_delay_derived = dep_time_cont - sched_dep_time_cont) %>%
select(dep_delay, dep_delay_derived) %>%
head()
## # A tibble: 6 x 2
## dep_delay dep_delay_derived
## <dbl> <dbl>
## 1 2 2
## 2 4 4
## 3 2 2
## 4 -1 -1
## 5 -6 -6
## 6 -4 -4
Problem 4
Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank()
.
We’ll use min_rank()
to rank the flights by arr_delay
:
flights %>%
select(arr_delay) %>%
mutate(most_delayed = min_rank(-arr_delay)) %>%
filter(most_delayed <= 10) %>%
arrange(most_delayed)
## # A tibble: 10 x 2
## arr_delay most_delayed
## <dbl> <int>
## 1 1272 1
## 2 1127 2
## 3 1109 3
## 4 1007 4
## 5 989 5
## 6 931 6
## 7 915 7
## 8 895 8
## 9 878 9
## 10 875 10
Problem 5
What does 1:3 + 1:10
return? Why?
We get an error because 1:3
+ 1:10
are not multiples of each other:
1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter
## object length
## [1] 2 4 6 5 7 9 8 10 12 11
Think through what is happening under the hood. This operation is recycling the shorter vector:
- 1 + 1
- 2 + 2
- 3 + 3
- 4 + 1
- 5 + 2
- 6 + 3
- 7 + 1
- 8 + 2
- 9 + 3
- 10 + 1 - error because
1:3
has not been fully cycled through
So the following will not return an error:
1:3 + 1:12
## [1] 2 4 6 5 7 9 8 10 12 11 13 15
Problem 6
What trigonometric functions does R provide?
R
has the following trig functions within base
:
- cos(x)
- sin(x)
tan(x)
- acos(x)
- asin(x)
- atan(x)
atan2(y, x)
- cospi(x)
- sinpi(x)
tanpi(x)
Note that angles are given in radians:
cos(pi * 0.25)
## [1] 0.7071068
# cospi(x) is equivalent to cos(pi * x)
cospi(0.25)
## [1] 0.7071068
5.6 - Grouped summaries with summarise()
Problem 1
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
flights %>%
group_by(flight) %>%
summarise(fifteen_early = mean(arr_delay <= -15, na.rm = TRUE),
fifteen_late = mean(arr_delay >= 15, na.rm = TRUE)) %>%
filter(fifteen_early == 0.50,
fifteen_late == 0.50)
## # A tibble: 21 x 3
## flight fifteen_early fifteen_late
## <int> <dbl> <dbl>
## 1 107 0.5 0.5
## 2 2072 0.5 0.5
## 3 2366 0.5 0.5
## 4 2500 0.5 0.5
## 5 2552 0.5 0.5
## 6 3495 0.5 0.5
## 7 3505 0.5 0.5
## 8 3518 0.5 0.5
## 9 3544 0.5 0.5
## 10 3651 0.5 0.5
## # ... with 11 more rows
- A flight is always 10 minutes late.
flights %>%
group_by(flight) %>%
summarise(ten_late = mean(arr_delay == 10, na.rm = TRUE)) %>%
filter(ten_late == 1.00)
## # A tibble: 5 x 2
## flight ten_late
## <int> <dbl>
## 1 2254 1
## 2 3656 1
## 3 3785 1
## 4 3880 1
## 5 5854 1
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
flights %>%
group_by(flight) %>%
summarise(thirty_early = mean(arr_delay <= -30, na.rm = TRUE),
thirty_late = mean(arr_delay >= 30, na.rm = TRUE)) %>%
filter(thirty_early == 0.50,
thirty_late == 0.50)
## # A tibble: 3 x 3
## flight thirty_early thirty_late
## <int> <dbl> <dbl>
## 1 3651 0.5 0.5
## 2 3916 0.5 0.5
## 3 3951 0.5 0.5
- 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
flights %>%
group_by(flight) %>%
summarise(on_time = mean(arr_delay <= 0, na.rm = TRUE),
late = mean(arr_delay >= 120, na.rm = TRUE)) %>%
filter(on_time == 0.99,
late == 0.01)
## # A tibble: 0 x 3
## # ... with 3 variables: flight <int>, on_time <dbl>, late <dbl>
Problem 2
Come up with another approach that will give you the same output as not_cancelled %>% count(dest)
and not_cancelled %>% count(tailnum, wt = distance)
(without using count()).
First setup the not_cancelled
data set.
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
The first chunk of code gives us the following output:
not_cancelled %>%
count(dest)
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
We can replicate this without using count()
by doing a group_by()
on dest
:
not_cancelled %>%
group_by(dest) %>%
summarise(n = n())
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
The second chunk of code gives us:
not_cancelled %>%
count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
Again we can avoid using count
by doing a group_by()
on tailnum
. Since wt = distance
gives the total number of miles flown, we use sum()
instead:
not_cancelled %>%
group_by(tailnum) %>%
summarise(n = sum(distance))
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
Problem 3
Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) )
is slightly suboptimal. Why? Which is the most important column?
arr_delay
is the more important of the two columns - filtering on arr_delay
alone will give the same subset:
flights %>%
summarise(suboptimal = sum(is.na(dep_delay) | is.na(arr_delay)),
optimal = sum(is.na(arr_delay)))
## # A tibble: 1 x 2
## suboptimal optimal
## <int> <int>
## 1 9430 9430
Problem 4
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
There is a generally positive trend between average delay and the proportion of cancelled flights, with a couple of outliers.
flights %>%
mutate(date = lubridate::make_date(year, month, day)) %>%
group_by(date) %>%
summarise(cancelled = mean(is.na(arr_delay)),
avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(aes(avg_delay, cancelled)) +
geom_point()
A lot of flights were cancelled on February 8th and 9th, although the average delays those days were not that large - a snowstorm hit the region that weekend, with a lot of flights preemptively cancelled.
flights %>%
mutate(date = lubridate::make_date(year, month, day)) %>%
group_by(date) %>%
summarise(cancelled = mean(is.na(arr_delay)),
avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(cancelled)) %>%
head()
## # A tibble: 6 x 3
## date cancelled avg_delay
## <date> <dbl> <dbl>
## 1 2013-02-09 0.575 6.64
## 2 2013-02-08 0.511 24.2
## 3 2013-05-23 0.235 62.0
## 4 2013-09-12 0.229 58.9
## 5 2013-12-10 0.216 44.5
## 6 2013-12-14 0.189 46.4
Problem 5
Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()
)
At first glance, Frontier and AirTran seem to have the worst delays.
flights %>%
group_by(carrier) %>%
summarise(median_delay = median(arr_delay, na.rm = TRUE)) %>%
arrange(desc(median_delay))
## # A tibble: 16 x 2
## carrier median_delay
## <chr> <dbl>
## 1 F9 6
## 2 FL 5
## 3 EV -1
## 4 MQ -1
## 5 YV -2
## 6 B6 -3
## 7 WN -3
## 8 UA -6
## 9 US -6
## 10 9E -7
## 11 OO -7
## 12 DL -8
## 13 AA -9
## 14 VX -9
## 15 HA -13
## 16 AS -17
We can try to get a better sense of bad airlines vs bad airports by grouping by both, but this will be thrown off by carrier
-dest
combinations that occur infrequently.
flights %>%
group_by(carrier, dest) %>%
summarise(median_delay = median(arr_delay, na.rm = TRUE)) %>%
ggplot(aes(carrier, median_delay)) +
geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Problem 6
What does the sort
argument to count()
do. When might you use it?
The sort
argument will arrange count()
in descending order. If we quickly wanted to find the most popular destinations, we could do:
flights %>%
group_by(dest) %>%
count(sort = TRUE) %>%
head()
## # A tibble: 6 x 2
## # Groups: dest [6]
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
5.7 - Grouped Mutates (and Filters)
Problem 1
Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
Problem 2
Which plane (tailnum)
has the worst on-time record?
We do a familiar group_by()
-summarise()
to calculate the proportion of flights with an arr_delay
less than or equal to 0, and then apply a filter
to see the on-time performance of planes with more than twenty flights.
flights %>%
group_by(tailnum) %>%
summarise(flights = n(),
on_time = mean(arr_delay <= 0, na.rm = TRUE)) %>%
select(tailnum, flights, on_time) %>%
filter(flights > 20) %>%
arrange(on_time) %>%
head()
## # A tibble: 6 x 3
## tailnum flights on_time
## <chr> <int> <dbl>
## 1 N988AT 37 0.2
## 2 N983AT 32 0.25
## 3 N980AT 47 0.255
## 4 N969AT 34 0.265
## 5 N932AT 30 0.267
## 6 N149AT 22 0.273
Problem 3
What time of day should you fly if you want to avoid delays as much as possible?
It appears that flights later in day have a greater chance of being delayed than those early in the morning.
flights %>%
group_by(hour) %>%
filter(!is.na(dep_delay)) %>%
summarise(delayed = mean(dep_delay > 0, na.rm = TRUE)) %>%
ggplot(aes(x = hour, y = delayed)) +
geom_col()
Problem 4
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
A grouped mutate comes in handy here, as we can first calculate the total minutes of delay for each destination and then use that value to compute the proportion of the total delay attributable to each flight.
flights %>%
group_by(dest) %>%
filter(arr_delay > 0) %>%
mutate(total_delay = sum(arr_delay),
prop_delay = arr_delay / sum(arr_delay)) %>%
select(dest, flight, total_delay, prop_delay) %>%
arrange(desc(total_delay)) %>%
head()
## # A tibble: 6 x 4
## # Groups: dest [1]
## dest flight total_delay prop_delay
## <chr> <int> <dbl> <dbl>
## 1 ATL 4650 300299 0.0000400
## 2 ATL 1547 300299 0.0000167
## 3 ATL 346 300299 0.0000566
## 4 ATL 4654 300299 0.0000200
## 5 ATL 347 300299 0.00000666
## 6 ATL 4876 300299 0.0000366
Problem 5
Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag() explore how the delay of a flight is related to the delay of the immediately preceding flight.
First group by origin
and the perform a mutate()
with lag()
to get the delay of the immediately preceding flight. We then use summarise
to get the correlation between the delay and lagged delay for each airport.
flights %>%
group_by(origin) %>%
filter(!is.na(dep_delay)) %>%
arrange(year, month, day, hour, minute) %>%
mutate(lag_delay = lag(dep_delay)) %>%
summarise(delay_correlation = cor(dep_delay, lag_delay, use = 'complete.obs'))
## # A tibble: 3 x 2
## origin delay_correlation
## <chr> <dbl>
## 1 EWR 0.265
## 2 JFK 0.242
## 3 LGA 0.302
Problem 6
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
A grouped mutate
is helpful here, as we can calculate the mean air time by destination and then immediately use that value to calculate a flight’s deviation from it. Note that we filtered out flights less than an hour long.
deviation <- flights %>%
group_by(dest) %>%
filter(!is.na(air_time)) %>%
mutate(mean_air_time = mean(air_time),
deviation = (air_time - mean_air_time) / mean_air_time) %>%
filter(mean_air_time > 60) %>%
arrange(deviation) %>%
select(air_time, mean_air_time, deviation, origin, dest)
deviation %>% head()
## # A tibble: 6 x 5
## # Groups: dest [6]
## air_time mean_air_time deviation origin dest
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 65 113. -0.424 LGA ATL
## 2 55 93.4 -0.411 EWR GSP
## 3 70 114. -0.388 EWR BNA
## 4 93 151. -0.382 EWR MSP
## 5 62 96.0 -0.354 EWR CVG
## 6 40 61.5 -0.349 LGA PIT
deviation %>% tail()
## # A tibble: 6 x 5
## # Groups: dest [5]
## air_time mean_air_time deviation origin dest
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 147 86.2 0.706 EWR CLT
## 2 198 116. 0.713 EWR ORD
## 3 185 106. 0.741 EWR SDF
## 4 138 70.9 0.947 JFK RDU
## 5 170 84.8 1.00 LGA DTW
## 6 170 84.8 1.00 JFK DTW
Problem 7
Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
We first use a group_by
with mutate()
and filter()
to subset the destinations serviced by at least 2 carriers, followed by a second group_by
to rank the carriers by total destinations served. We also merge data from the airlines
data set to get the full carrier names.
ExpressJet and Endeavor Air are regional airlines which operate as American Eagle, United Express, and Delta Connection.
flights %>%
group_by(dest) %>%
mutate(carriers = n_distinct(carrier)) %>%
filter(carriers >= 2) %>%
group_by(carrier) %>%
summarise(destinations = n_distinct(dest)) %>%
arrange(desc(destinations)) %>%
left_join(airlines) %>%
select(carrier, name, destinations) %>%
head()
## Joining, by = "carrier"
## # A tibble: 6 x 3
## carrier name destinations
## <chr> <chr> <int>
## 1 EV ExpressJet Airlines Inc. 51
## 2 9E Endeavor Air Inc. 48
## 3 UA United Air Lines Inc. 42
## 4 DL Delta Air Lines Inc. 39
## 5 B6 JetBlue Airways 35
## 6 AA American Airlines Inc. 19
Problem 8
For each plane, count the number of flights before the first delay of greater than 1 hour.
First group_by()
tailnum
and then apply a filter()
using the cumall
window function to subset everything before the first delay of more than 1 hour.
flights %>%
arrange(year, month, day, hour, minute) %>%
group_by(tailnum) %>%
filter(cumall(dep_delay <= 60)) %>%
count() %>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## # Groups: tailnum [6]
## tailnum n
## <chr> <int>
## 1 N954UW 206
## 2 N952UW 163
## 3 N957UW 142
## 4 N5FAAA 117
## 5 N38727 99
## 6 N3742C 98