Econometrics Puzzler #1: To Instrument or Not?
Welcome to the first installment of the Econometrics Puzzler, a new series of shorter posts that will test and strengthen your econometric intuition. Here’s the format: I’ll pose a question that requires only introductory econometrics knowledge, but has an unexpected answer. The idea is for you to ponder the question before reading my solution. Many of these questions are based on common misconceptions that come up year-after-year in my econometrics teaching. I hope you’ll find them both challenging and enlightening. Today we’ll revisit everyone’s favorite example: Angrist & Krueger’s 1991 paper on the returns to education.1
To Instrument or Not to Instrument?
Suppose I want to predict someone’s wage as accurately as possible using a linear model–that is, I want my predictions to be as close as they can be to the actual wages. (In fact we will predict the log of wage.) I observe a representative sample of workers that includes their log wage \(Y_i\) and years of schooling \(X_i\). I could use an OLS regression of \(Y\) on \(X\) to make my predictions, but years of schooling are the classic example of an endogenous regressor; they’re correlated with myriad unobserved causes of wages, like “ability” and family background. Fortunately, I also have a valid and relevant instrument: quarter of birth \(Z_i\) is correlated with years of schooling and (supposedly) uncorrelated with unobserved causes of wage.2
So here’s the question: to get the best possible predictions of wage from the information I have, should I run OLS or IV? More specifically, let’s use mean squared error (MSE) as our measure of “best”. To borrow a term from Grant Sanderson, “pause and ponder” before reading further.
Taking it to the Data
The Angrist & Krueger (1991) dataset is available from Michal Kolesár’s ManyIV
R package.3
Here I’ll restrict attention to people born in the first or fourth quarter of the year.
Instrument is a dummy variable for being born in the fourth quarter, relative to being born in the first quarter:
# remotes::install_github("kolesarm/ManyIV") # if needed
library(ManyIV) # Contains Angrist & Krueger (1991) dataset
# For information about the dataset, see the package documentation:
# ?ManyIV::ak80
library(dplyr)
dat <- ak80 |>
as_tibble() |>
filter(qob %in% c('Q1', 'Q4')) |>
mutate(z = (qob == 'Q4')) |>
select(x = education, y = lwage, z)
To test how well OLS and IV perform as predictors, we’ll carry out a “pseudo-out-of-sample” experiment. First we’ll randomly split dat
into a “training” sample containing 80% of the observations and a “test” sample containing the remaining 20%:
set.seed(1693) # For reproducibility
n_total <- nrow(dat)
n_train <- round(0.8 * n_total)
n_test <- n_total - n_train
train_indices <- sample(n_total, n_train, replace = FALSE)
dat_train <- dat[train_indices, ]
dat_test <- dat[-train_indices, ]
Now we’ll use dat_train
to fit IV and OLS:
ols_fit <- lm(y ~ x, data = dat_train)
ols_coefs <- coef(ols_fit)
library(ivreg) # install with `install.packages("ivreg")` if needed
iv_fit <- ivreg(y ~ x | z, data = dat_train)
iv_coefs <- coef(iv_fit)
rbind(OLS = ols_coefs, IV = iv_coefs)
## (Intercept) x
## OLS 5.004283 0.07008633
## IV 4.749959 0.09000644
Now we’re ready to make our predictive comparison! We’ll “pretend” that we don’t know the wages of the people in our test sample and use the OLS and IV coefficients from above to predict the “missing” wages:
dat_test <- dat_test |>
mutate(ols_pred = ols_coefs[1] + ols_coefs[2] * x,
iv_pred = iv_coefs[1] + iv_coefs[2] * x)
Of course we actually do know the wages of everyone in dat_test
; this is the column y
. So we can now compare our predictions against the truth.4 A common measure of predictive quality is mean squared error (MSE), the average squared difference between the truth and our predictions. Because it squares the difference between the truth and our prediction, MSE penalizes larger errors more than smaller ones. While there are other ways to measure prediction error, MSE is a common choice and one that will play a key role in the rest of this post. And the winner is … OLS! Because it has a lower MSE, the predictions from the OLS model are, on average, closer to the true wages than the predictions from the IV model:
dat_test |>
summarize(ols_mse = mean((y - ols_pred)^2),
iv_mse = mean((y - iv_pred)^2))
## # A tibble: 1 × 2
## ols_mse iv_mse
## <dbl> <dbl>
## 1 0.407 0.411
OLS beats IV by a small but appreciable margin. (The relatively small difference in this case reflects the fact that IV and OLS estimates are fairly similar in this example.) It turns out that this isn’t a fluke. The same will be true in any example. Unless the instrument is perfectly correlated with the endogenous regressor, OLS will always have a lower predictive MSE than IV.
What’s really going on here?
I ask this question of my introductory econometric students every year and most of them are surprised by the answer. If we have an endogenous regressor OLS is biased and inconsistent; why would we ever pass up the opportunity to use a valid and relevant instrument! The answer is surprisingly simple: by definition the OLS estimand gives the best linear predictor of \(Y\), the one that minimizes MSE: \(\min_{a,b} \mathbb{E}[\{Y - (a + b X)\}^2]\). This is true regardless of whether \(X\) is endogenous. Indeed, from a predictive perspective, endogeneity is a feature not a bug! The fact that years of schooling “smuggles in” information about ability and family background is exactly why it gives better predictions than IV. Remember: the whole point of IV is to remove the part of \(X\) that is related to unobserved causes of \(Y\). This is exactly what we want if our goal is to understand cause-and-effect, but it’s the opposite of what would make sense in a prediction problem, where we’d like to use as much information as possible.
A Red Herring: The Bias-Variance Tradeoff
Students sometimes answer this question by invoking the bias-variance tradeoff, pointing out that “OLS is biased but has a lower variance than IV, so it could have a lower MSE.” This is correct, but misses the deeper point. They’re thinking about bias in estimating the causal parameter.5 But, again, the point here is that this isn’t relevant when prediction is our goal. When ML researchers discuss the bias-variance tradeoff in predictive settings, they mean something entirely different: bias of a linear predictive model relative to the true conditional mean function. OLS gives the best linear approximation to \(\mathbb{E}[Y|X]\), so it’s what we want in this example, since I stipulated we’d be working with linear models.
Take Home Message
Causal inference and prediction are different goals. Causality is about counterfactuals: what would happen if we intervened to change someone’s years of education? Prediction answers a different question: if I observe that someone has eight years of schooling, what is my best guess of their wage? If you want to predict, use OLS; if you want to estimate a causal effect, use IV.
I’m sick of this example too, but the point of this puzzler is to get you thinking about instrumental variables; using an example that most people know will get us to the punch line faster.↩︎
If you’re unfamiliar with this example check out my video overview, including some discussion of why quarter of birth might not really be exogenous after all!↩︎
You can install this package using the
remotes
package, which is a convenient way to install packages from GitHub.↩︎It’s crucial that we used one dataset to estimate our models and a different one to evaluate their predictive performance to avoid a problem called “overfitting”. This issue calls for a post of it’s own, but if you want a preview check out this blog post on Goodhart’s law.↩︎
When our goal is to learn the causal parameter, this bias-variance tradeoff becomes relevant. I even wrote a paper!↩︎