Aaron R. Williams - Data Scientist (IBP)
# please ensure that these packages are installed
library(tidyverse)
library(tidymodels)
Machine Learning: The use of computer algorithms to parse data and estimate models that can be used for prediction, categorization, or dimension reduction.
Predictive Modeling: “The process of developing a mathematical tool or model that generates an accurate prediction.” ~ Max Kuhn and Kjell Johnson in Applied Predictive Modeling
There are many applications of machine learning for applied econometrics and causal inference. This will not be a focus of this training. Many useful links are available here.
Supervised Learning: Predictive modeling with a “target”, “response”, or “outcome” variable.
Outcome variable: The dependent variable in a predictive model.
Regression: Supervised learning with a continuous numeric outcome.
Classification: Supervised learning with a categorical outcome. The output of these models can be predicted classes of a categorical variable or predicted probabilities (e.g. 0.75 for “A” and 0.25 for “B”).
Predictor: Independent variables or features in a predictive model.
Unsupervised Learning: A process of summarizing data without a “target”, “response”, or “outcome” variable.
Clustering: Grouping observations into homogeneous groups.
Dimension reduction: Reducing the number of variables in a data set while maintaining the statistical properties of the data.
The second reason to build a statistical model is for inference. Here, the goal is to test a set of formal hypotheses with \(H_0\) as the null hypothesis and \(H_a\) as the alternative hypothesis. The hypotheses usually focus on the coefficients (e.g. \(\beta_1 \ne 0\)). Care should be taken with inference to develop hypotheses based on theory and to limit the number of tests conducted on a given set of data. Assumptions are also tremendously important. For example, simple linear regression assumes:
If these assumptions are approximately met, then test statistics can be developed from known sampling distributions. For coefficients, this is the \(t\)-distribution with \(n - p\) degrees of freedom.
The final motivation for building a statistical model is prediction. Here, the goal is make informed and accurate guesses about the value or level of a variable given a set of predictor variables. Unlike inference, which usually focuses on coefficients of predictor variables, the focus here is on the dependent variable.
Prediction will be the focus of supervised machine learning.
\[\cdot\cdot\cdot\]
The same statistical model can summarize, be used for inference, and make valid and accurate predictions. However, the optimal model for one motivation is rarely best for all three motivations. Thus, it is important to clearly articulate the motivation for a statistical model before picking which tools and diagnostics to use.
Switching from an inferential framework to a predictive framework results in two important implications.
Models that are easily interpretable are less useful if the sole objective is to make accurate predictions. Accordingly, predictive modeling considers a much wider range of parametric and nonparametric models. Some of the models are difficult or nearly impossible to understand.
It is easy to get caught up in all of algorithms, but it is far more important to understand the process of predictive modeling. We will focus on two new algorithms:
Unlike algorithms used for inference, many of the algorithms used for predictive modeling do not have distributional assumptions. This means that many diagnostics based on distributional assumptions are no longer available for model evaluation: \(F\)-test, standard error of a coefficients, \(t\)-test, prediction interval, regression line interval.
Instead, an error metric is chosen and then a key objective is estimating the out-of-sample value of that error rate using available data. The out-of-sample error rate is rarely known. Instead it is estimated.
GOAL: Minimize the prediction error of the estimated regression model on new data (out-of-sample).
There are several metrics for estimating error. The most popular for regression is Root Mean Square Error (RMSE):
\[RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^n (Y_i - \hat{Y_i})^2}\]
where \(n\) is the number of observations, \(Y_i\) is the observed value, and \(\hat{Y}_i\) is the predicted value.
We will discuss error metrics for classification later.
Generalizability: How well a model makes predictions on unseen data relative to how well it makes predictions on the data used to estimate the model. (like external validity)
In-sample error: The predictive error of a model measured on the data used to estimate the predictive model.
Out-of-sample error: The predictive error of a model measured on the data not used to estimate the predictive model. Out-of-sample error is generally greater than the in-sample error.
Training set: A subset of data used to develop a predictive model. The share of data committed to a training set depends on the number of observations, the number of predictors in a model, and heterogeneity in the data. 0.8 is a common share.
Testing set: A subset of data used to estimate model performance. The testing set usually includes all observations not included in the testing set. Do not look at these data until the very end and only estimate the out-of-sample error rate on the testing set once. If the error rate is estimated more than once on the testing data, it will underestimate the error rate.
Data leakage: When information that won’t be available when the model makes out-of-sample predictions is used when estimating a model. Looking at data from the testing set creates data leakage. Data leakage leads to an underestimate of out-of-sample error.
\(K\)-Nearest Neighbors (KNN) is an algorithm that makes predictions based on the average (regression) or majority vote (classification) of the \(k\) most similar observations. Similar is measured by the distances between predictors in the training data and the observation for which a prediction is being made.
Consider the following five observations:
Suppose for a given new observation with a known value of \(x_{new}\) and an unknown value of \(y\), the goal is to predict \(\hat{y}\). KNN finds the closest \(k\) values of \(x\) to \(x_{new}\) and predicts \(\hat{y}\) as the mean of the \(y\) values for the \(k\) observations.
If \(x_{new} = 3\) and \(k = 1\), then \(\hat{y} = 9\).
If \(x_{new} = 3\) and \(k = 3\), then \(\hat{y} = \frac{3 + 9 + 8}{3} \approx 6.67\).
Finding the closest observation is trivial for a 1-dimensional predictor. Of course, most applications have more than one predictor. In these applications, Euclidean distance is a common way of measuring the distances between \(\mathbf{X}\) and \(\vec{x}_{new}\):
\[\text{Euclidean Distance} = \sqrt{\sum_{k = 1}^P (x_{ik} - x_{jk})^2}\]
Consider the following four observations with two predictors, \(x_1\) and \(x_2\).
\[dist_{a} = \sqrt{(3 - 0) ^ 2 + (3 - 0)^2} = \sqrt{9 + 9} = \sqrt{18}\]
\[dist_{b} = \sqrt{(-3 - 0) ^ 2 + (-4 - 0)^2} = \sqrt{9 + 16} = \sqrt{25}\]
\[dist_{c} = \sqrt{(-2 - 0) ^ 2 + (5 - 0)^2} = \sqrt{4 + 25} = \sqrt{29}\]
\[dist_{d} = \sqrt{(6 - 0) ^ 2 + (-6 - 0)^2} = \sqrt{36 + 36} = \sqrt{72}\]
When \(k = 3\), the closest three observations are \(\{a, b, c\}\)
Note: the scale of predictors is an important consideration when using KNN.