Chapter 23 - Model basics
library(tidyverse)23.2 - A simple model
Problem 1
One downside of the linear model is that it is sensitive to unusual values because the distance incorporates a squared term. Fit a linear model to the simulated data below, and visualise the results. Rerun a few times to generate different simulated datasets. What do you notice about the model?
sim1a <- tibble(
  x = rep(1:10, each = 3),
  y = x * 1.5 + 6 + rt(length(x), df = 2)
)
set.seed(20180315)
sim1a <- tibble(
    x = rep(rep(1:10, each = 3), 4),
    y = x * 1.5 + 6 + rt(length(x), df = 2),
    model = rep(c("model1", "model2", "model3", "model4"), each = length(x) / 4)
)
sim1a %>%
  ggplot(aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~model)
The fitted line is surprsingly stable, but it doesn’t do a good job accounting for values that are far from the fitted line.
Problem 2
One way to make linear models more robust is to use a different distance measure. For example, instead of root-mean-squared distance, you could use mean-absolute distance:
measure_distance <- function(mod, data) {
  diff <- data$y - make_prediction(mod, data)
  mean(abs(diff))
}
Use optim() to fit this model to the simulated data above and compare it to the linear model.
library(modelr)
model1 <- function(a, data) {
  a[1] + data$x * a[2]
}
measure_distance <- function(mod, data) {
  diff <- data$y - model1(mod, data)
  mean(abs(diff))
}
best <- optim(c(0, 0), measure_distance, data = sim1a)
ggplot(sim1a, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(intercept = best$par[1], slope = best$par[2], color = "red")
Problem 3
One challenge with performing numerical optimisation is that it’s only guaranteed to find one local optima. What’s the problem with optimising a three parameter model like this?
model1 <- function(a, data) {
  a[1] + data$x * a[2] + a[3]
}
23.3 Visualizing models
Instead of using lm() to fit a straight line, you can use loess() to fit a smooth curve. Repeat the process of model fitting, grid generation, predictions, and visualisation on sim1 using loess() instead of lm(). How does the result compare to geom_smooth()?
sim1_loess <- loess(y ~ x, data = sim1)
grid <- sim1 %>% 
  data_grid(x) %>% 
  add_predictions(sim1_loess) 
ggplot(sim1, aes(x)) +
  geom_point(aes(y = y)) +
  geom_line(aes(y = pred), data = grid, colour = "red", size = 1) +
  geom_smooth(aes(y = y))
The result is identical to geom_smooth() without the standard error shading.
Problem 3
add_predictions() is paired with gather_predictions() and spread_predictions(). How do these three functions differ?
gather_predictions() and spread_predictions() work with multiple models and add_predictions() only works with one model. gather_predictions() adds each prediction as a new row with a column for model name and a column for prediction. spread_predictions() adds a new column for the predictions from each model.
Problem 3
What does geom_ref_line() do? What package does it come from? Why is displaying a reference line in plots showing residuals useful and important?
geom_ref_line(), from library(modelr), can add horizontal and vertical reference lines. This could be useful for visually partitioning points or comparing the slope or slopes from models against levels lines.
Problem 4
Why might you want to look at a frequency polygon of absolute residuals? What are the pros and cons compared to looking at the raw residuals?
A frequency polygon of absolute residuals is useful for exploring the magnitudes of errors from a model. A problem is that is can obscure real differences between residuals from above the fitted model and residuals below the fitted model.
23.4 Formulas and model families
Problem 1
What happens if you repeat the analysis of sim2 using a model without an intercept. What happens to the model equation? What happens to the predictions?
mod2 <- lm(y ~ x - 1, data = sim2)
grid <- sim2 %>% 
  data_grid(x) %>% 
  add_predictions(mod2)
  
ggplot(sim2, aes(x)) + 
  geom_point(aes(y = y)) +
  geom_point(data = grid, aes(y = pred), colour = "red", size = 4)
The model equation changes from y = a_0 + a_1 * xb + a_2 * xc + a_3 * xd to y = a_1 * xa + a_2 * xb + a_3 * xc + a_4 * xd. Basically, the first model replaces one level of the equation with the y-intercept. The second model has no intercept and all four levels of the discrete variable. The predictions don’t change.
Problem 2
Use model_matrix() to explore the equations generated for the models I fit to sim3 and sim4. Why is * a good shorthand for interaction?
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
model_matrix(y ~ x1 * x2, data = sim3)## # A tibble: 120 x 8
##    `(Intercept)`    x1   x2b   x2c   x2d `x1:x2b` `x1:x2c` `x1:x2d`
##            <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>
##  1             1     1     0     0     0        0        0        0
##  2             1     1     0     0     0        0        0        0
##  3             1     1     0     0     0        0        0        0
##  4             1     1     1     0     0        1        0        0
##  5             1     1     1     0     0        1        0        0
##  6             1     1     1     0     0        1        0        0
##  7             1     1     0     1     0        0        1        0
##  8             1     1     0     1     0        0        1        0
##  9             1     1     0     1     0        0        1        0
## 10             1     1     0     0     1        0        0        1
## # ... with 110 more rows
mod1 <- lm(y ~ x1 + x2, data = sim4)
mod2 <- lm(y ~ x1 * x2, data = sim4)