Chapter 23 - Model basics
library(tidyverse)
23.2 - A simple model
Problem 1
One downside of the linear model is that it is sensitive to unusual values because the distance incorporates a squared term. Fit a linear model to the simulated data below, and visualise the results. Rerun a few times to generate different simulated datasets. What do you notice about the model?
sim1a <- tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2)
)
set.seed(20180315)
sim1a <- tibble(
x = rep(rep(1:10, each = 3), 4),
y = x * 1.5 + 6 + rt(length(x), df = 2),
model = rep(c("model1", "model2", "model3", "model4"), each = length(x) / 4)
)
sim1a %>%
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~model)
The fitted line is surprsingly stable, but it doesn’t do a good job accounting for values that are far from the fitted line.
Problem 2
One way to make linear models more robust is to use a different distance measure. For example, instead of root-mean-squared distance, you could use mean-absolute distance:
measure_distance <- function(mod, data) {
diff <- data$y - make_prediction(mod, data)
mean(abs(diff))
}
Use optim()
to fit this model to the simulated data above and compare it to the linear model.
library(modelr)
model1 <- function(a, data) {
a[1] + data$x * a[2]
}
measure_distance <- function(mod, data) {
diff <- data$y - model1(mod, data)
mean(abs(diff))
}
best <- optim(c(0, 0), measure_distance, data = sim1a)
ggplot(sim1a, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(intercept = best$par[1], slope = best$par[2], color = "red")
Problem 3
One challenge with performing numerical optimisation is that it’s only guaranteed to find one local optima. What’s the problem with optimising a three parameter model like this?
model1 <- function(a, data) {
a[1] + data$x * a[2] + a[3]
}
23.3 Visualizing models
Instead of using lm()
to fit a straight line, you can use loess()
to fit a smooth curve. Repeat the process of model fitting, grid generation, predictions, and visualisation on sim1 using loess()
instead of lm()
. How does the result compare to geom_smooth()
?
sim1_loess <- loess(y ~ x, data = sim1)
grid <- sim1 %>%
data_grid(x) %>%
add_predictions(sim1_loess)
ggplot(sim1, aes(x)) +
geom_point(aes(y = y)) +
geom_line(aes(y = pred), data = grid, colour = "red", size = 1) +
geom_smooth(aes(y = y))
The result is identical to geom_smooth()
without the standard error shading.
Problem 3
add_predictions()
is paired with gather_predictions()
and spread_predictions()
. How do these three functions differ?
gather_predictions()
and spread_predictions()
work with multiple models and add_predictions()
only works with one model. gather_predictions()
adds each prediction as a new row with a column for model name and a column for prediction. spread_predictions()
adds a new column for the predictions from each model.
Problem 3
What does geom_ref_line()
do? What package does it come from? Why is displaying a reference line in plots showing residuals useful and important?
geom_ref_line()
, from library(modelr)
, can add horizontal and vertical reference lines. This could be useful for visually partitioning points or comparing the slope or slopes from models against levels lines.
Problem 4
Why might you want to look at a frequency polygon of absolute residuals? What are the pros and cons compared to looking at the raw residuals?
A frequency polygon of absolute residuals is useful for exploring the magnitudes of errors from a model. A problem is that is can obscure real differences between residuals from above the fitted model and residuals below the fitted model.
23.4 Formulas and model families
Problem 1
What happens if you repeat the analysis of sim2
using a model without an intercept. What happens to the model equation? What happens to the predictions?
mod2 <- lm(y ~ x - 1, data = sim2)
grid <- sim2 %>%
data_grid(x) %>%
add_predictions(mod2)
ggplot(sim2, aes(x)) +
geom_point(aes(y = y)) +
geom_point(data = grid, aes(y = pred), colour = "red", size = 4)
The model equation changes from y = a_0 + a_1 * xb + a_2 * xc + a_3 * xd
to y = a_1 * xa + a_2 * xb + a_3 * xc + a_4 * xd
. Basically, the first model replaces one level of the equation with the y-intercept. The second model has no intercept and all four levels of the discrete variable. The predictions don’t change.
Problem 2
Use model_matrix()
to explore the equations generated for the models I fit to sim3
and sim4
. Why is *
a good shorthand for interaction?
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
model_matrix(y ~ x1 * x2, data = sim3)
## # A tibble: 120 x 8
## `(Intercept)` x1 x2b x2c x2d `x1:x2b` `x1:x2c` `x1:x2d`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0 0 0 0 0 0
## 2 1 1 0 0 0 0 0 0
## 3 1 1 0 0 0 0 0 0
## 4 1 1 1 0 0 1 0 0
## 5 1 1 1 0 0 1 0 0
## 6 1 1 1 0 0 1 0 0
## 7 1 1 0 1 0 0 1 0
## 8 1 1 0 1 0 0 1 0
## 9 1 1 0 1 0 0 1 0
## 10 1 1 0 0 1 0 0 1
## # ... with 110 more rows
mod1 <- lm(y ~ x1 + x2, data = sim4)
mod2 <- lm(y ~ x1 * x2, data = sim4)