Chapter 24 - Model building
library(tidyverse)
library(modelr)
24.2 - Why are low quality diamonds more expensive?
Problem 1
In the plot of lcarat
vs. lprice
, there are some bright vertical strips. What do they represent?
The carats of the diamonds bunch at key values like 0.5 carats and 1 carat. This bunching shows up as bright vertical stripes when using geom_hex()
.
Problem 2
If log(price) = a_0 + a_1 * log(carat)
, what does that say about the relationship between price
and carat
?
A one log-unit increase in carat increases log-price by a_1
. A zero-carat diamond has log-price of a_0
.
Problem 3
Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are the particularly bad or good, or do you think these are pricing errors?
Most of the errors are predictions that are lower than the sale price. There could be some unobserved characteristic that explains why these diamonds are more valuable than the model suggests.
diamonds2 <- diamonds %>%
filter(carat <= 2.5) %>%
mutate(lprice = log2(price), lcarat = log2(carat))
mod_diamond2 <- lm(lprice ~ lcarat + color + cut + clarity, data = diamonds2)
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond2, "lresid2")
diamonds2 %>%
filter(abs(lresid2) > 1) %>%
add_predictions(mod_diamond2) %>%
mutate(pred = round(2 ^ pred)) %>%
select(price, pred, carat:table, x:z) %>%
arrange(price) %>%
mutate(error = ifelse(price > pred, "too low", "too high"))
## # A tibble: 16 x 12
## price pred carat cut color clarity depth table x y z
## <int> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1013 264 0.25 Fair F SI2 54.4 64 4.3 4.23 2.32
## 2 1186 284 0.25 Premium G SI2 59 60 5.33 5.28 3.12
## 3 1186 284 0.25 Premium G SI2 58.8 60 5.33 5.28 3.12
## 4 1262 2644 1.03 Fair E I1 78.2 54 5.72 5.59 4.42
## 5 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 6 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 7 1715 576 0.32 Fair F VS2 59.6 60 4.42 4.34 2.61
## 8 1776 412 0.290 Fair F SI1 55.8 60 4.48 4.41 2.48
## 9 2160 314 0.34 Fair F I1 55.8 62 4.72 4.6 2.6
## 10 2366 774 0.3 Very Good D VVS2 60.6 58 4.33 4.35 2.63
## 11 3360 1373 0.51 Premium F SI1 62.7 62 5.09 4.96 3.15
## 12 3807 1540 0.61 Good F SI2 62.5 65 5.36 5.29 3.33
## 13 3920 1705 0.51 Fair F VVS2 65.4 60 4.98 4.9 3.23
## 14 4368 1705 0.51 Fair F VVS2 60.7 66 5.21 5.11 3.13
## 15 10011 4048 1.01 Fair D SI2 64.6 58 6.25 6.2 4.02
## 16 10470 23622 2.46 Premium E SI2 59.7 59 8.82 8.76 5.25
## # ... with 1 more variable: error <chr>
Problem 4
Does the final model, mod_diamonds2
, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond?
mod_diamonds2
was estimated on a large data set with substantial variation and does a good job capturing the variance in the data with a relatively parsimonious specification. The residuals from this model are heteroskedasticitic which suggests there isn’t a significant confounding variable missing from the model. I would trust it to to get me in the ball-park but wouldn’t trust it the dollar.
24.3 - What affects the number of daily flights?
Problem 1
Use your Google sleuthing skills to brainstorm why there were fewer than expected flights on Jan 20, May 26, and Sep 1. (Hint: they all have the same explanation.) How would these days generalise to another year?
January 20th was the Saturday before Martin Luther King Jr. day. May 26th was the Saturday before Memorial Day. September 1st was the Saturday before Labor Day. This could be generalized to other years as the Saturdays before MLK Jr. day, Memorial Day, and Labor Day.
Problem 2
What do the three days with high positive residuals represent? How would these days generalise to another year?
November 30th was the Saturday after Thanksgiving, December 1st was the Sunday after Thanksgiving, and December 28th was the Saturday after Christmas. These could be generalized to another year as the Saturday and Sunday after Thanksgiving and the Saturday after Christmas.
Problem 3
Create a new variable that splits the wday variable into terms, but only for Saturdays, i.e. it should have Thurs, Fri, but Sat-summer, Sat-spring, Sat-fall. How does this model compare with the model with every combination of wday and term?
The new model is less accurate than the old model in the spring and summer.
The old model has 20 predictors and the new model has 8 predictors.
library(nycflights13)
library(lubridate)
term <- function(date) {
cut(date,
breaks = ymd(20130101, 20130605, 20130825, 20140101),
labels = c("spring", "summer", "fall")
)
}
daily <- flights %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarise(n = n()) %>%
mutate(wday = wday(date, label = TRUE)) %>%
mutate(term = term(date)) %>%
mutate(date_combined = case_when(
wday == "Sat" & term == "spring" ~ "Sat-spring",
wday == "Sat" & term == "summer" ~ "Sat-summer",
wday == "Sat" & term == "fall" ~ "Sat-fall",
TRUE ~ as.character(wday)
)) %>%
mutate(date_combined = as.factor(date_combined))
old_model <- lm(n ~ wday * term, data = daily)
new_model <- lm(n ~ date_combined, data = daily)
daily %>%
gather_residuals(old_model = old_model, new_model = new_model) %>%
ggplot(aes(date, resid, colour = model)) +
geom_line(alpha = 0.75)
Problem 4
Create a new wday variable that combines the day of week, term (for Saturdays), and public holidays. What do the residuals of that model look like?