Chapter 16 - Dates and times
Load the libraries needed for these exercises.
library(tidyverse)
library(lubridate)
library(nycflights13)
16.2 - Creating date/times
Problem 1
What happens if you parse a string that contains invalid dates?
ymd(c("2010-10-10", "bananas"))
## Warning: 1 failed to parse.
## [1] "2010-10-10" NA
Warning message: 1 failed to parse.
Problem 2
What does the tzone argument to today()
do? Why is it important?
tzone controls the time zone used when finding the current date. It defaults to the system time zone. It is important because every hour a different time zone moves from today to tomorrow and when analyzing data from another time zone dates can change.
Problem 3
Use the appropriate lubridate function to parse the following dates:
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
mdy(d1)
## [1] "2010-01-01"
ymd(d2)
## [1] "2015-03-07"
dmy(d3)
## [1] "2017-06-06"
mdy(d4)
## [1] "2015-08-19" "2015-07-01"
mdy(d5)
## [1] "2014-12-30"
16.3 - Date-time components
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
Problem 1
How does the distribution of flight times within a day change over the course of the year?
flights_dt %>%
mutate(dep_hour = update(dep_time, yday = 1),
month = month(dep_time, label = TRUE)) %>%
ggplot(aes(dep_hour, color = month)) +
geom_freqpoly(binwidth = 900) +
labs(title = "Distribution of Flight Times by Month")
Problem 2
Compare dep_time
, sched_dep_time
, and dep_delay
. Are they consistent? Explain your findings.
dep_time
, sched_dep_time
, and dep_delay
are mostly consistent. The only issue is when delays extend past midnight. The value for day doesn’t increase for dep_time
when a a flight is delayed beyond its scheduled day.
flights_dt %>%
mutate(dep_time2 = sched_dep_time + dep_delay * 60) %>%
filter(dep_time != dep_time2) %>%
select(sched_dep_time, dep_time, dep_time2)
## # A tibble: 1,205 x 3
## sched_dep_time dep_time dep_time2
## <dttm> <dttm> <dttm>
## 1 2013-01-01 18:35:00 2013-01-01 08:48:00 2013-01-02 08:48:00
## 2 2013-01-02 23:59:00 2013-01-02 00:42:00 2013-01-03 00:42:00
## 3 2013-01-02 22:50:00 2013-01-02 01:26:00 2013-01-03 01:26:00
## 4 2013-01-03 23:59:00 2013-01-03 00:32:00 2013-01-04 00:32:00
## 5 2013-01-03 21:45:00 2013-01-03 00:50:00 2013-01-04 00:50:00
## 6 2013-01-03 23:59:00 2013-01-03 02:35:00 2013-01-04 02:35:00
## 7 2013-01-04 23:59:00 2013-01-04 00:25:00 2013-01-05 00:25:00
## 8 2013-01-04 22:45:00 2013-01-04 01:06:00 2013-01-05 01:06:00
## 9 2013-01-05 23:59:00 2013-01-05 00:14:00 2013-01-06 00:14:00
## 10 2013-01-05 22:30:00 2013-01-05 00:37:00 2013-01-06 00:37:00
## # ... with 1,195 more rows
Problem 3
Compare air_time
with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)
TODO(aaron):
There is no way to explain my findings.
flights_dt %>%
mutate(air_time_calc = as.numeric(arr_time - dep_time),
air_time_diff = air_time - air_time_calc) %>%
select(origin, dest, air_time, air_time_calc, air_time_diff)
## # A tibble: 328,063 x 5
## origin dest air_time air_time_calc air_time_diff
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 EWR IAH 227 193 34
## 2 LGA IAH 227 197 30
## 3 JFK MIA 160 221 -61
## 4 JFK BQN 183 260 -77
## 5 LGA ATL 116 138 -22
## 6 EWR ORD 150 106 44
## 7 EWR FLL 158 198 -40
## 8 LGA IAD 53 72 -19
## 9 JFK MCO 140 161 -21
## 10 LGA ORD 138 115 23
## # ... with 328,053 more rows
Problem 4
How does the average delay time change over the course of a day? Should you use dep_time
or sched_dep_time
? Why?
The average delay time increases slightly over the course of a day. This makes sense. Events that delay flights, like weather, mechanical issues, and pilot flight limits, accumulate over the course of the day and increase the probability of a flight being delayed.
sched_dep_time
or dep_time
could make sense. sched_dep_time
is more useful if you’re planning on scheduling a flight and want to avoid delays!
flights_dt %>%
mutate(sched_dep_time = update(sched_dep_time, yday = 1)) %>%
ggplot(aes(sched_dep_time, dep_delay)) +
geom_point(alpha = 0.05) +
geom_smooth()
Problem 5
On what day of the week should you leave if you want to minimize the chance of a delay?
Saturday boasts the lowest percentage of flights that have delayed departures and delayed arrivals.
flights_dt %>%
mutate(day_of_week = wday(dep_time, label = TRUE),
delayed = ifelse(dep_delay > 0, 1, 0)) %>%
group_by(day_of_week) %>%
summarize(delay_prob = mean(delayed)) %>%
ggplot(aes(day_of_week, delay_prob)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::percent(delay_prob)), vjust = -0.25, size = 3) +
scale_y_continuous(labels = scales::percent, limits = c(0, 0.6)) +
labs(title = "Percentage of Flight Departures Delayed By Day of the Week",
subtitle = "Flights are Delayed if They Depart >= 1 Minute Behind Schedule",
x = "Day of the Week",
y = "Percentage of Flights Delayed")
flights_dt %>%
mutate(day_of_week = wday(arr_time, label = TRUE),
delayed = ifelse(arr_delay > 0, 1, 0)) %>%
group_by(day_of_week) %>%
summarize(delay_prob = mean(delayed, na.rm = TRUE)) %>%
ggplot(aes(day_of_week, delay_prob)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::percent(delay_prob)), vjust = -0.25, size = 3) +
scale_y_continuous(labels = scales::percent, limits = c(0, 0.6)) +
labs(title = "Percentage of Flight Arrivals Delayed By Day of the Week",
subtitle = "Flights are Delayed if They Arrive >= 1 Minute Behind Schedule",
x = "Day of the Week",
y = "Percentage of Flights Delayed")
Problem 6
What makes the distribution of diamonds$carat
and flights_dep_time
similar?
Humans round. In the case of the diamonds, they always round up!
ggplot(data = diamonds, mapping = aes(x = carat)) +
geom_histogram(bins = 100) +
scale_y_continuous(expand = c(0, 0), labels = scales::dollar) +
labs(title = "Diamond Prices Increase With Size",
subtitle = "Diamond Prices in Dollars and Sizes in Carats",
caption = "Urban Institute",
x = "Carat",
y = "Price"
)
flights_dt %>%
mutate(dep_time = update(dep_time, yday = 1)) %>%
ggplot(aes(dep_time)) +
geom_histogram(bins = 100)
Problem 7
Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells whether or not the flight was delayed.
Early departures of scheduled flights in minutes 20-30 and minutes 50-60 is definitely a contributing factor to the disuniform distribution of average delay times on page 245.
flights_dt %>%
mutate(Minute = minute(dep_time),
dep_delay_dummy = ifelse(dep_delay > 0, 1, 0),
dep_delay_dummy = factor(dep_delay_dummy, labels = c("On Time", "Delayed"))) %>%
ggplot(aes(Minute, color = dep_delay_dummy)) +
geom_freqpoly() +
labs(title = "Distribution of Minutes of Departure Times by Delay Status",
y = "Count")
16.4 - Time spans
Problem 1
Why is there months()
but no dmonths()
?
Unlike hours, days, and weeks, the number of months in a year never varies.
Problem 2
Explaindays(overnight * 1)
to someone who has just started learning R. How does it work?
Overnight is a logical vector where TRUE == 1
and FALSE == 0
. If it’s an overnight flight, days()
add 23, 24, or 25 hours to the value depending on the day of the year. I am unsure why * 1
is necessary.
Problem 3
Create a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the current year.
ymd("2015-01-01") + months(0:11)
## [1] "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01"
## [6] "2015-06-01" "2015-07-01" "2015-08-01" "2015-09-01" "2015-10-01"
## [11] "2015-11-01" "2015-12-01"
floor_date(today(), unit = "year") + months(0:11)
## [1] "2018-01-01" "2018-02-01" "2018-03-01" "2018-04-01" "2018-05-01"
## [6] "2018-06-01" "2018-07-01" "2018-08-01" "2018-09-01" "2018-10-01"
## [11] "2018-11-01" "2018-12-01"
Problem 4
Write a function that, given your birthday (as a date), returns how old you are in years.
age <- function(birthday) {
(birthday %--% today()) / dyears(1)
}
age(ymd("1992-03-14"))
## [1] 26.2274
Problem 5
Why can’t (today() %--% (today() + years(1)) / months(1)
work?
There is an uneven number of parentheses.