econometrics.blog

Not Quite the James-Stein Estimator

Sat, 10 Aug 2024 00:00:00 +0000

If you study enough econometrics or statistics, you’ll eventually hear someone mention “Stein’s Paradox” or the “James-Stein Estimator”. You’ve probably learned in your introductory econometrics course that ordinary least squares (OLS) is the best linear unbiased estimator (BLUE) in a linear regression model under the Gauss-Markov assumptions. The stipulations “linear” and “unbiased” are crucial here. If we remove them, it’s possible to do better–maybe even much better–than OLS.¹ Stein’s paradox is a famous example of this phenomenon, one that created much consternation among statisticians and fellow-travelers when it was first pointed out by Charles Stein in the mid-1950s. The example is interesting in its own right, but also has deep connections to ideas in Bayesian inference and machine learning making it much more than a mere curiosity.

The supposed paradox is most simply stated by considering a special case of linear regression–that of estimating multiple unknown means. Efron & Morris (1977) introduce the basic idea as follows:

A baseball player who gets seven hits in 20 official times at bat is said to have a batting average of .350. In computing this statistic we are forming an estimate of the player’s true batting ability in terms of his observed average rate of success. Asked how well the player will do in his next 100 times at bat, we would probably predict 35 more hits. In traditional statistical theory it can be proved that no other estimation rule is uniformly better than the observed average. The paradoxical element in Stein’s result is that it sometimes contradicts this elementary law of statistical theory. If we have three or more baseball players, and if we are interested in predicting future batting averages for each of them, then there is a procedure that is better than simply extrapolating from the three separate averages. Here “better” has a strong meaning. The statistician who employs Stein’s method can expect to predict the future averages more accurately no matter what the true batting abilities of the players may be.

I first encountered Stein’s Paradox in an offhand remark by my PhD supervisor. I dutifully looked it up in an attempt to better understand the point he had been making, but lacked sufficient understanding of decision theory at the time to see what the fuss was all about. The second time I encountered it, after I knew a bit more, it seemed astounding: almost like magic. I decided to include the topic in my Econ 722 course at Penn, but struggled to make it accessible to my students. A big problem, in my view, is that the proof–see lecture 1 or section 7.3–is ultimately a bit of a let-down: algebra, followed by repeated integration by parts, and then a fact about the existence of moments for an inverse-chi-squared random variable. It seems like a sterile technical exercise when in fact that result itself is deep, surprising, and important. As if a benign deity were keen on making my point for me, the wikipedia article on the James-Stein Estimator is flagged as “may be too technical for readers to understand” at the time of this writing!

After six months of pondering, this post is my attempt to explain the James-Stein Estimator in a way that is accessible to a broad audience. The assumed background is minimal: just an introductory course probability and statistics. I’ll show how we can arrive at something that is very nearly the James-Stein estimator by following some very simple and natural intuition. After you understand my “not quite James-Stein” estimator, it’s a short step to the real thing. So the “let-down” proof I mentioned before becomes merely a technical justification for a slight modification of a formula that is already intuitively compelling. As far as possible, I’ve tried to keep this post self-contained by introducing, or at least reviewing, key background material as we go along. The cost of this approach, unfortunately, is that the post is pretty long! I hope you’ll soldier on the the end and that you’ll find the payoff worth your time and effort.

As far as I know, the precise way that I motivate the James-Stein estimator in this post is new, but there are are many other papers that aim to make sense of the supposed paradox in an intuitive way. In keeping with my injunction that you should always consider reading something else instead, here are a few references that you may find helpful. Efron & Morris (1977) is a classic article aimed at the general reader without a background in statistics. Stigler (1988) is a more technical but still accessible discussion of the topic while Casella (1985) is a very readable paper that discusses the James-Stein estimator in the context of empirical Bayes. A less well-known paper that I found helpful is Ijiri & Leitch (1980), who consider the James-Stein estimator in a real-world setting, namely “Audit Sampling” in accounting. They discuss several interesting practical and philosophical issues including the distinction between “composite” and “individual” risk that I’ll pick up on below.

Warm-up Exercise

This section provides some important background that we’ll need to understand Stein’s Paradox later in the post reviewing the ideas of bias, variance and mean-squared error along with introducing a very simple shrinkage estimator. To make these ideas as transparent as possible we’ll start with a ridiculously simple problem. Suppose that you observe $X \sim \text{Normal}(\mu, 1)$, a single draw from a normal distribution with variance one and unknown mean $\mu$. Your task is to estimate $\mu$. This may strike you as a very silly problem: it only involves a single datapoint and we assume the variance of $X$ is one! But in fact there’s nothing special about $n = 1$ and a variance of one: these merely make the notation simpler. If you prefer, you can think of $X$ as the sample mean of $n$ iid draws from a population with unknown mean $\mu$ where we’ve rescaled everything to have variance one. So how should we estimate $\mu$? A natural and reasonable idea is to use the sample mean, in this case $X$ itself. This is in fact the maximum likelihood estimator for $\mu$, so I’ll define $\hat{\mu}_{\text{ML}} = X$. But is this estimator any good? And can we find something better?

Review of Bias, Variance and MSE

The concepts of bias and variance are key ideas that we typically reach for when considering the quality of an estimator. To refresh your memory, bias is the difference between an estimators expected value and the true value of the parameter being estimated while variance is the expected squared difference between an estimator and its expected value. So if $\hat{\theta}$ is an estimator of some unknown parameter $\theta$, then $\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta$ while $\text{Var}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2]$. A bias of zero means that an estimator is correctly centered: its expectation equals the truth. We say that such an estimator is unbiased.² A small variance means that an estimator is precise: it doesn’t “jump around” too much. Ideally we’d like an estimator that is correctly centered and precise. But it turns out that there is generally a trade-off between bias and variance: if you want to reduce one of them, you have to accept an increase in the other.

A common way of trading off bias and variance relies on a concept called mean-squared error (MSE) defined as the sum of the squared bias and the variance.³ In particular: $\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + \text{Bias}(\hat{\theta})^2$. Equivalently, we can write $\text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2]$.⁴ To borrow some terminology from introductory microeconomics, you can think of MSE as the negative of a utility function over bias and variance. Both bias and variance are “bads” in that we’d rather have less rather than more of each. This formula expresses our preferences in terms of how much of one we’d be willing to accept in exchange for less of the other. Slightly foreshadowing something that will come later in this post, we can think of MSE as the square of the average distance that an archer’s arrows land from the bulls-eye. Smaller values of MSE are better: variance measures how closely the arrows cluster together while bias measures how far the center of the cluster is from the bulls-eye, as in the following diagram:

A Shrinkage Estimator

Returning to our maximum likelihood estimator: it’s unbiased, $\text{Bias}(\hat{\mu}_{\text{ML}}) = 0$, so $\text{MSE}(\hat{\mu}_{\text{ML}}) = \text{Var}(\hat{\mu}_{\text{ML}}) = 1$. Suppose that low MSE is what we’re after. Is there any way to improve on the ML estimator? In other words, can we achieve an MSE that’s lower than one? The answer turns out to be yes. Here’s the idea. Suppose we had some reason to believe that the true mean $\mu$ isn’t very large. Then perhaps we could try to adjust our maximum likelihood estimate by shrinking slightly towards zero. One way to do this would be by taking a weighted average of the ML estimator and zero: \[ \hat{\mu}(\lambda) = (1 - \lambda) \times \hat{\mu}_{\text{ML}} + \lambda \times 0 = (1 - \lambda)X \] for $0 \leq \lambda \leq 1$. The constant $(1 - \lambda)$ is called the “shrinkage factor” and controls how the ML estimator gets pulled towards zero.⁵ We get a different estimator for every value of $\lambda$. If $\lambda = 0$ then we get the ML estimator back. If $\lambda = 1$ then we get a very silly estimator that ignores the data and simply reports zero no matter what! So let’s see how the MSE depends on our choice of $\lambda$. Substituting the definition of $\hat{\mu}(\lambda)$ into the formulas for bias and variance gives: \[ \begin{align*} \text{Bias}[\hat{\mu}(\lambda)]&= \mathbb{E}[(1 - \lambda)\hat{\mu}_\text{ML}] - \mu = (1 - \lambda)\mathbb{E}[\hat{\mu}_\text{ML}] - \mu = (1 - \lambda)\mu - \mu = -\lambda\mu\\ \\ \text{Var}[\hat{\mu}(\lambda)]&= \text{Var}[(1 - \lambda)\hat{\mu}_\text{ML}] = (1 - \lambda)^2\text{Var}[\hat{\mu}_\text{ML}] = (1 - \lambda)^2\\ \\ \text{MSE}[\hat{\mu}(\lambda)]&= \text{Var}[\hat{\mu}(\lambda)] + \text{Bias}[\hat{\mu}(\lambda)]^2 = (1 - \lambda)^2 + \lambda^2\mu^2 \end{align*} \] Unless $\lambda = 0$, the shrinkage estimator is biased. And while the MSE of the ML estimator is always one, regardless of the true value of $\mu$, the MSE of the shrinkage estimator depends on the unknown parameter $\mu$.

So why should we use a biased estimator? The answer is that by tolerating a small amount of bias we may be able to achieve a larger reduction in variance, resulting in a lower MSE compared to the higher variance but unbiased ML estimator. A quick plot shows us that the shrinkage estimator can indeed have a lower MSE than the ML estimator depending on the value of $\lambda$ and the true value of $\mu$:

# Range of values for the unknown parameter mu
mu <- seq(-4, 4, length = 100)
# Try three different values of lambda
lambda1 <- 0.1
lambda2 <- 0.2
lambda3 <- 0.3
# Plot the MSE of the shrinkage estimator as a function of mu for all
# three values of lambda at once
matplot(mu, cbind((1 - lambda1)^2 + lambda1^2 * mu^2,
(1 - lambda2)^2 + lambda2^2 * mu^2,
(1 - lambda3)^2 + lambda3^2 * mu^2),
type = 'l', lty = 1, lwd = 2,
col = c('red', 'blue', 'green'),
xlab = expression(mu), ylab = 'MSE',
main = 'MSE of Shrinkage Estimator')
# Add legend
legend('topright', legend = c(expression(lambda == 0.1),
expression(lambda == 0.2),
expression(lambda == 0.3)),
col = c('red', 'blue', 'green'), lty = 1, lwd = 2)
# Add dashed line for MSE of ML estimator
abline(h = 1, lty = 2, lwd = 2)

Some Algebra

It’s time for some algebra. If you’re tempted to skip this please don’t: this section is a warm-up for our main event. If you thoroughly understand the mechanics of shrinkage in this simple example, everything that follows below will seem much more natural.

As seen from the plot above, the MSE of our shrinkage estimator (the solid lines) is lower than that of the ML estimator (the dashed line) provided that our chosen value of $\lambda$ isn’t too large relative to the true value of $\mu$. With a bit of algebra, we can work out precisely how large $\lambda$ can be to make shrinkage worthwhile. Since $\text{MSE}[\hat{\mu}_\text{ML}]= 1$, by expanding and simplifying the expression for $\text{MSE}[\hat{\mu}(\lambda)]$ we see that $\text{MSE}[\hat{\mu}(\lambda)] < \text{MSE}[\hat{\mu}_\text{ML}]$ if and only if \[ \begin{align*} (1 - \lambda)^2 + \lambda^2\mu^2 &< 1 \\ 1 - 2\lambda + \lambda^2 + \lambda^2\mu^2 &< 1 \\ \lambda^2 (1 + \mu^2) -2 \lambda &< 0 \\ \lambda [\lambda (1 + \mu^2) - 2] &< 0. \end{align*} \] Since $\lambda \geq 0$, the final inequality can only hold if the factor inside the square brackets is negative, i.e. \[ \begin{align*} \lambda (1 + \mu^2) - 2 &< 0 \\ \lambda &< \frac{2}{1 + \mu^2}. \end{align*} \] This shows that any choice of $\lambda$ between $0$ and $2 / (1 + \mu^2)$ will give us a shrinkage estimator with an MSE less than one. To check our algebra, we can change the inequality to an equality and solve for $\mu$ to obtain the boundary of the region where shrinkage is better than ML: \[ \begin{align*} \lambda (1 + \mu^2) - 2 &= 0 \\ 1 + \mu^2 &= 2/\lambda \\ \mu &= \pm \sqrt{2/\lambda - 1}. \end{align*} \] Adding these boundaries to a simplified version of our previous plot with only $\lambda = 0.3$ we see that everything works out correctly: the dashed red lines intersect the blue curve at the points where the MSE of the shrinkage estimator equals that of the ML estimator.

# Plot the MSE of the shrinkage estimator as a function of mu for lambda = 0.3
lambda <- 0.3
plot(mu, (1 - lambda)^2 + lambda^2 * mu^2, type = 'l', lty = 1, lwd = 2,
col = 'blue', xlab = expression(mu), ylab = 'MSE',
main = 'Boundary of Region Where Shrinkage is Better than ML')
# Add dashed line for MSE of ML estimator
abline(h = 1, lty = 2, lwd = 2)
# Add boundaries of region where shrinkage is better than ML estimator
abline(v = c(sqrt(2/lambda - 1), -sqrt(2/lambda - 1)), lty = 3, lwd = 2,
col = 'red')

But there’s still more to learn! Suppose we wanted to take things one step further and find the optimal value of $\lambda$ for any given value of $\mu$. In other words, suppose we wanted the value of $\lambda$ that minimizes the MSE of our shrinkage estimator given a particular assumed value for $\mu$. Since $\text{MSE}[\hat{\mu}(\lambda)]$ is a quadratic function of $\lambda$, as shown above, this turns out to be a fairly straightforward calculation. Differentiating, \[ \begin{align*} \frac{d}{d\lambda}\text{MSE}[\hat{\mu}(\lambda)] &= \frac{d}{d\lambda}[(1 - \lambda)^2 + \lambda^2 \mu^2] \\ &= -2(1 - \lambda) + 2\lambda \mu^2 \\ &= 2 [\lambda (1 + \mu^2) - 1]\\ \\ \frac{d^2}{d\lambda^2}\text{MSE}[\hat{\mu}(\lambda)] &= 2(1 + \mu^2) > 0 \end{align*} \] so there is a unique global minimum at $\lambda^* \equiv 1/(1 + \mu^2)$. This gives the optimal shrinkage factor in the sense that it minimizes the MSE of the shrinkage estimator. Substituting $\lambda^*$ into the expression for $\text{MSE}[\hat{\mu}(\lambda)]$ gives: \[ \begin{align*} \text{MSE}[\hat{\mu}(\lambda^*)] &= \left(1 - \frac{1}{1 + \mu^2} \right)^2 + \left(\frac{1}{1 + \mu^2}\right)^2 \mu^2 \\ &= \left( \frac{\mu^2}{1 + \mu^2}\right)^2 + \left(\frac{1}{1 + \mu^2}\right)^2 \mu^2 \\ &= \left( \frac{1}{1 + \mu^2}\right)^2 (\mu^4 + \mu^2) \\ &= \left( \frac{1}{1 + \mu^2}\right)^2 \mu^2(1 + \mu^2) \\ &= \frac{\mu^2}{1 + \mu^2} < 1. \end{align*} \]

Stein’s Paradox

Recap

We’re moments away from having all the ingredients we need to introduce Stein’s Paradox! But first let’s review what we’ve uncovered thus far. We’ve seen that the shrinkage estimator can improve on the ML estimator in terms of MSE provided that $\lambda$ is chosen judiciously: it needs to be between zero and $2/(1 + \mu^2)$. The optimal choice of $\lambda$, namely $\lambda^* = 1 / (1 + \mu^2)$, gives an MSE of $\mu^2/(1 + \mu^2)$. This is always lower than one, the MSE of the ML estimator.

There’s just one massive problem we’ve ignored this whole time: we don’t know the value of $\mu$! As seen from the figure plotted above, the MSE curves for different values of $\lambda$ cross each other: the best one to use depends on the true value of $\mu$. This doesn’t mean that all is lost. Perhaps in practice we have some outside information about the likely value of $\mu$ that could help guide our choice of $\lambda$. What it does mean is that there’s no “one-size-fits-all” value.

Admissibility

It’s time to introduce a bit of technical vocabulary. We say that an estimator $\tilde{\theta}$ dominates another estimator $\hat{\theta}$ if $\text{MSE}[\tilde{\theta}] \leq \text{MSE}[\hat{\theta}]$ for all possible values of the parameter $\theta$ being estimated and $\text{MSE}[\tilde{\theta}] < \text{MSE}[\hat{\theta}]$ for at least one possible value of $\theta$.⁶ In words, this means that it never makes sense to use $\hat{\theta}$ in preference to $\tilde{\theta}$. No matter what the true parameter value is, you can’t do worse with $\tilde{\theta}$ and you might do better. An estimator that is not dominated by any other estimator is called admissible; an estimator that is dominated by some other estimator is called inadmissible. The concept of admissibility in decision theory is a bit like the concept of Pareto efficiency in microeconomics. An admissible estimator is only “good” in the sense that it doesn’t leave any money on the table: there’s no way to do better for one parameter value without doing worse for another. In a similar way, a Pareto efficient allocation in economics is one in which no individual can be made better off without making another person worse off.

It’s quite challenging to prove, but in fact the ML estimator $\hat{\theta}_{ML} = X$ turns out to be admissible in our little example. So while we could potentially do better by using shrinkage, it’s not a slam-dunk case. If we really have no idea of how large $\mu$ is likely to be, the ML estimator is a reasonable choice. Because it’s admissible, at the very least we know that there’s no free lunch!

A More General Example

Now let’s make things a bit more interesting. For the rest of this post, suppose that we observe not a single draw $X$ from a $\text{Normal}(\mu, 1)$ distribution but a collection of $p$ independent draws from $p$ different normal distributions: \[ X_1, X_2, ..., X_p \sim \text{independent Normal}(\mu_j, 1), \quad j = 1, ..., p. \] You can think of this as $p$ copies of our original problem: we observe $X_j \sim \text{Normal}(\mu_j, 1)$ and our task is to estimate $\mu_j$. The observations are all independent, and each comes from a distribution with a potentially different mean. At first glance it seems like these $p$ separate problems should have absolutely nothing to do with each other. And indeed the maximum likelihood estimator for the collection of $p$ means is simply $\hat{\mu}^{(j)}_\text{ML} = X_j$. As above in our example with $p=1$, the question is: how good is the ML estimator, and can we do any better?

Composite MSE

But first things first: how can we evaluate the quality of $p$ estimators for $p$ different parameters at the same time? A common approach, and the one we will follow here, is to take the sum of the individual MSEs of each estimator, yielding a quantity called composite MSE. If $\hat{\mu}_1, \hat{\mu}_2, \dots, \hat{\mu}_p$ is a collection of estimators for each of the individual unknown means, then the composite MSE is defined as \[ \text{Composite MSE} \equiv \sum_{j=1}^p \text{MSE}(\hat{\mu}_j) = \sum_{j=1}^p \left[ \text{Bias}(\hat{\mu}_j)^2 + \text{Var}(\hat{\mu}_j)\right] = \sum_{j=1}^p \mathbb{E}[(\hat{\mu}_j - \mu_j)^2]. \] Adopting composite MSE as our measure of good performance means that we view each of the $p$ estimation problems as in some way “interchangeable”–we’re happy to accept a trade in which we do a slightly worse job estimating $\mu_j$ in exchange for doing a much better job estimating $\mu_k$. At the end of the post I’ll say a few more words about this idea and when it may or may not be reasonable. But for the rest of the post, we will assume that our goal is to minimize the composite MSE. The concept of composite MSE will be crucial in understanding why the James-Stein estimator works the way it does.

Stein’s Paradox

Putting our new idea into practice, we see that the composite MSE of the ML estimator is $p$ regardless of the true values of the individual means $\mu_1, \dots, \mu_p$ since \[ \sum_{j=1}^p \text{MSE}\left[\hat{\mu}^{(j)}_\text{ML}\right] = \sum_{j=1}^p \text{MSE}(X_j) = \sum_{j=1}^p \text{Var}(X_j) = p. \] If the ML estimator is admissible, then there should be no other estimator that always has an MSE less than or equal to $p$ and sometimes has an MSE strictly less than $p$. I’ve already told you that this is true when $p = 1$. When $p = 2$ it’s still true: the ML estimator remains admissible. But when $p \geq 3$ something very unexpected happens: it becomes possible to construct an estimator that dominates the ML estimator by using information from all of the $(X_1, ..., X_p)$ observations to estimate $\mu_j$. This is spite of the fact that there is no obvious connection between the observations. Again: they are all independent and come from distributions with different means!

The estimator that does the trick is the so-called “James-Stein Estimator” (JS), defined according to \[ \hat{\mu}^{(j)}_\text{JS} = \left(1 - \frac{p - 2}{\sum_{k=1}^p X_k^2}\right)X_j. \] This this estimator dominates the ML estimator when $p \geq 3$ in that
\[ \sum_{j=1}^p \text{MSE}\left[\hat{\mu}^{(j)}_\text{JS}\right] \leq \sum_{j=1}^p \text{MSE}\left[\hat{\mu}^{(j)}_\text{ML}\right]= p \] for all possible values of the $p$ unknown means $\mu_j$ with strict inequality for at least some values. Taking a closer look at the formula, we see that the James-Stein estimator is just a shrinkage estimator applied to each of the $p$ means, namely \[ \hat{\mu}^{(j)}_\text{JS} = (1 - \hat{\lambda}_\text{JS})X_j, \quad \hat{\lambda}_\text{JS} \equiv \frac{p - 2}{\sum_{k=1}^p X_k^2}. \] The shrinkage factor in the James-Stein estimator depends on the number of means we’re estimating, $p$, along with the overall sum of the squared observations. All else equal, the more parameters we need to estimate, the more we shrink each of them towards zero. And the farther the observations are from zero overall, the less we shrink each of them towards zero.

Just like our simple shrinkage estimator from above, the James-Stein estimator achieves a lower MSE by tolerating a small bias in exchange for a larger reduction in variance, compared to the higher-variance but unbiased ML estimator. Unlike our simple shrinkage estimator, the James-Stein estimator uses the data to determine the shrinkage factor. And as long as $p\leq 3$ it is always at least as good as the ML estimator and sometimes much better. The paradox is that this seems impossible: how can information from all of the observations be useful when they come from different distributions with no obvious connection?

The rest of this post will not prove that the James-Stein estimator dominates the ML estimator. Instead it will try to convince you that there is some very good intuition for why the formula for the James-Stein estimator. By the end, I hope you’ll feel that, far from seeming paradoxical, using all of the observations to determine the shrinkage factor for one particular $\mu_j$ makes perfect sense.

Where does the James-Stein Estimator Come From?

An Infeasible Estimator When $p = 2$

To start the ball rolling, let’s assume a can-opener: suppose that we don’t know any of the individual means $\mu_j$ but for some strange reason a benevolent deity has told us the value of their sum of squares: \[ c^2 \equiv \sum_{j=1}^p \mu_j^2 \equiv c^2. \] It turns out that this is enough information to construct a shrinkage estimator that always has a lower composite MSE than the ML estimator. Let’s see why this is the case. If $p = 1$, then telling you $c^2$ is the same as telling you $\mu^2$. Granted, knowledge of $\mu^2$ isn’t as informative as knowledge of $\mu$. For example, if I told you that $\mu^2 = 9$ you couldn’t tell whether $\mu = 3$ or $\mu = -3$. But, as we showed above, the optimal shrinkage estimator when $p=1$ sets $\lambda^* = 1/(1 + \mu^2)$ and yields an MSE of $\mu^2/(1 + \mu^2) < 1$. Since $\lambda^*$ only depends on $\mu$ through $\mu^2$, we’ve already shown that knowledge of $c^2$ allows us to construct a shrinkage estimator that dominates the ML estimator when $p = 1$.

So what if $p$ equals 2? In this case, knowledge of $c^2 = \mu_1^2 + \mu_2^2$ is equivalent to knowing the radius of a circle centered at the origin in the $(\mu_1, \mu_2)$ plane where the two unknown means must lie. For example, if I told you that $c^2 = 1$ you would know that $(\mu_1, \mu_2)$ lies somewhere on a circle of radius one centered at the origin. As illustrated in the following plot, the points $(x_1, x_2)$ and $(y_1, y_2)$ would then be potential values of $(\mu_1, \mu_2)$ as would all other points on the blue circle.

So how can we construct a shrinkage estimator of $(\mu_1, \mu_2)$ with lower composite MSE than the ML estimator if $c^2$ is known? While there are other possibilities, the simplest would be to use the same shrinkage factor for each of the two coordinates. In other words, our estimator would be \[ \hat{\mu}_1(\lambda) = (1 - \lambda)X_1, \quad \hat{\mu}_2(\lambda) = (1 - \lambda)X_2 \] for some $\lambda$ between zero and one. The composite MSE of this estimator is just the sum of the MSE of each individual component, so we can re-use our algebra from above to obtain \[ \begin{align*} \text{MSE}[\hat{\mu}_1(\lambda)] + \text{MSE}[\hat{\mu}_2(\lambda)] &= [(1 - \lambda)^2 + \lambda^2\mu_1^2] + [(1 - \lambda)^2 + \lambda^2\mu_2^2] \\ &= 2(1 - \lambda)^2 + \lambda^2(\mu_1^2 + \mu_2^2) \\ &= 2(1 - \lambda)^2 + \lambda^2c^2. \end{align*} \] Notice that the composite MSE only depends on $(\mu_1, \mu_2)$ through their sum of squares, $c^2$. Differentiating with respect to $\lambda$, just as we did above in the $p=1$ case, \[ \begin{align*} \frac{d}{d\lambda}\left[2(1 - \lambda)^2 + \lambda^2c^2\right] &= -4(1 - \lambda) + 2\lambda c^2 \\ &= 2 \left[\lambda (2 + c^2) - 2\right]\\ \\ \frac{d^2}{d\lambda^2}\left[2(1 - \lambda)^2 + \lambda^2c^2\right] &= 2(2 + c^2) > 0 \end{align*} \] so there is a unique global minimum at $\lambda^* = 2/(2 + c^2)$. Substituting this value of $\lambda$ into the expression for the composite MSE, a few lines of algebra give \[ \begin{align*} \text{MSE}[\hat{\mu}_1(\lambda^*)] + \text{MSE}[\hat{\mu}_2(\lambda^*)] &= 2\left(1 - \frac{2}{2 + c^2}\right)^2 + \left(\frac{2}{2 + c^2}\right)^2c^2 \\ &= 2\left(\frac{c^2}{2 + c^2}\right). \end{align*} \] Since $c^2/(2 + c^2) < 1$ for all $c^2 > 0$, the optimal shrinkage estimator always has a composite MSE lower less than $2$, the composite MSE of the ML estimator. Strictly speaking this estimator is infeasible since we don’t know $c^2$. But it’s a crucial step on our journal to make the leap from applying shrinkage to an estimator for a single unknown mean, to using the same idea for more than one uknown mean.

A Simulation Experiment for $p = 2$

You may have already noticed that it’s easy to generalize this argument to $p>2$. But before we consider the general case, let’s take a moment to understand the geometry of shrinkage estimation for $p=2$ a bit more deeply. The nice thing about two-dimensional problems is that they’re easy to plot. So here’s a graphical representation of both the ML estimator and our infeasible optimum shrinkage estimator when $p = 2$. I’ve set the true, unknown, values of $\mu_1$ and $\mu_2$ to one so the true value of $c^2$ is $2$ and the optimal choice of $\lambda$ is $\lambda^* = 2/(2 + c^2) = 2/4 = 0.5$. The following R code simulates our estimators and visualizes their performance, helping us see the shrinkage effect in action.

set.seed(1983)
nreps <- 50
mu1 <- mu2 <- 1
x1 <- mu1 + rnorm(nreps)
x2 <- mu2 + rnorm(nreps)
csq <- mu1^2 + mu2^2
lambda <- csq / (2 + csq)
par(mfrow = c(1, 2))
# Left panel: ML Estimator
plot(x1, x2, main = 'MLE', pch = 20, col = 'black', cex = 2,
xlab = expression(mu[1]), ylab = expression(mu[2]))
abline(v = mu1, lty = 1, col = 'red', lwd = 2)
abline(h = mu2, lty = 1, col = 'red', lwd = 2)
# Add MSE to the plot
text(x = 2, y = 3, labels = paste("MSE =",
round(mean((x1 - mu1)^2 + (x2 - mu2)^2), 2)))
# Right panel: Shrinkage Estimator
plot(x1, x2, main = 'Shrinkage', xlab = expression(mu[1]),
ylab = expression(mu[2]))
points(lambda * x1, lambda * x2, pch = 20, col = 'blue', cex = 2)
segments(x0 = x1, y0 = x2, x1 = lambda * x1, y1 = lambda * x2, lty = 2)
abline(v = mu1, lty = 1, col = 'red', lwd = 2)
abline(h = mu2, lty = 1, col = 'red', lwd = 2)
abline(v = 0, lty = 1, lwd = 2)
abline(h = 0, lty = 1, lwd = 2)
# Add MSE to the plot
text(x = 2, y = 3, labels = paste("MSE =",
round(mean((lambda * x1 - mu1)^2 +
(lambda * x2 - mu2)^2), 2)))

My plot has two panels. The left panel shows the raw data. Each black point is a pair $(X_1, X_2)$ of independent normal draws with means $(\mu_1 = 1, \mu_2 = 1)$ and variances $(1, 1)$. As such, each point is also the ML estimate (MLE) of $(\mu_1, \mu_2)$ based on $(X_1, X_2)$. The red cross shows the location of the true values of $(\mu_1, \mu_2)$, namely $(1, 1)$. There are 50 points in the plot, representing 50 replications of the simulation, each independent of the rest and with the same parameter values. This allows us to measure how close the ML estimator is to the true value of $(\mu_1, \mu_2)$ in repeated sampling, approximating the composite MSE.

The right panel is more complicated. This shows both the ML estimates (unfilled black circles) and the corresponding shrinkage estimates (filled blue circles) along with dashed lines connecting them. Each shrinkage estimate is constructed by “pulling” the corresponding MLE towards the origin by a factor of $\lambda = 0.5$. Thus, if a given unfilled black circle is located at $(X_1, X_2)$, the corresponding filled blue circle is located at $(0.5X_1, 0.5X_2)$. As in the left panel, the red cross in the right panel shows the true values of $(\mu_1, \mu_2)$, namely $(1, 1)$. The black cross, on the other hand, shows the point towards which the shrinkage estimator pulls the ML estimator, namely $(0, 0)$.

We see immediately that the ML estimator is unbiased: the black filled dots in the left panel (along with the unfilled ones in the right) are centered at $(1, 1)$. But the ML estimator is also high-variance: the black dots are quite spread out around $(1, 1)$. We can approximate the composite MSE of the ML estimator by computing the average squared Euclidean distance between the black points and the red cross.⁷ And in keeping with our theoretical calculations, the simulation gives a composite MSE of almost exactly 2 for the ML estimator.

In contrast, the optimal shrinkage estimator is biased: the filled blue dots in the right panel centered somewhere between the red cross (the true means) and the origin. But the shrinkage estimator also has a lower variance: the filled blue dots are much closer together than the black ones. Even more importantly they are on average closer to $(\mu_1, \mu_2)$, as indicated by the red cross and as measured by composite MSE. Our theoretical calculations showed that the composite MSE of the optimal shrinkage estimator equals $2c^2/(2 + c^2)$. When $c^2 = 2$, as in this case, we obtain $2\times 2/(2 + 2) = 1$. Again, this is almost exactly what we see in the simulation.

If we had used more than 50 simulation replications, the composite MSE values would have been even closer to our theoretical predictions, at the cost of making the plot much harder to read! But I hope the key point is still clear: shrinkage pulls the MLE towards the origin, and can give a much lower composite MSE.

An Infeasible Estimator: The General Case

Now that we understand the case of $p=2$, the general case is a snap. Our shrinkage estimator of each $\mu_j$ will take the form \[ \hat{\mu}_j(\lambda) = (1 - \lambda) X_j, \quad j = 1, \dots, p \] for some $\lambda$ between zero and one. To find the optimal choice of $\lambda$, we minimize \[ \sum_{j=1}^p\text{MSE}\left[\hat{\mu}_j(\lambda) \right] = \sum_{j=1}^p \left[(1 - \lambda)^2 + \lambda^2 \mu_j^2\right] = p(1 - \lambda)^2 + \lambda^2 c^2 \] with respect to $\lambda$. Again, the key is that the composite MSE only depends on the unknown means through $c^2$. Using almost exactly the same calculations as above for the case of $p = 2$, we find that \[ \lambda^* = \frac{p}{p + c^2}, \quad \sum_{j=1}^p \text{MSE}\left[\hat{\mu}_j(\lambda^*) \right] = p\left(\frac{c^2}{p + c^2}\right). \] since $c^2/(p + c^2) < 1$ for all $c^2 > 0$, the optimal shrinkage estimator always has a composite MSE less than $p$, the composite MSE of the ML estimator.

Not Quite the James-Stein Estimator

The end is in sight! We’ve shown that if we knew the sum of squares of the unknown means, $c^2$, we could construct a shrinkage estimator that always has a lower composite MSE than the ML estimator. But we don’t know $c^2$. So what can we do? To start off, re-write $\lambda^*$ as follows \[ \lambda^* = \frac{p}{p + c^2} = \frac{1}{1 + c^2/p}. \] This way of writing things makes it clear that it’s not $c^2$ per se that matters but rather $c^2/p$. And this quantity is simply is the average of the unknown squared means: \[ \frac{c^2}{p} = \frac{1}{p}\sum_{j=1}^p \mu_j^2. \] So how could we learn $c^2/p$? An idea that immediately suggests itself is to estimate this quantity by replacing each unobserved $\mu_j$ with the corresponding observation $X_j$, in other words \[ \frac{1}{p}\sum_{j=1}^p X_j^2. \] This is a good starting point, but we can do better. Since $X_j \sim \text{Normal}(\mu_j, 1)$, we see that \[ \mathbb{E}\left[\frac{1}{p} \sum_{j=1}^p X_j^2 \right] = \frac{1}{p} \sum_{j=1}^p \mathbb{E}[X_j^2] = \frac{1}{p} \sum_{j=1}^p [\text{Var}(X_j) + \mathbb{E}(X_j)^2] = \frac{1}{p} \sum_{j=1}^p (1 + \mu_j^2) = 1 + \frac{c^2}{p}. \] This means that $(\sum_{j=1}^p X_j^2)/p$ will on average overestimate $c^2/p$ by one. But that’s a problem that’s easy to fix: simply subtract one! This is a rare situation in which there is no bias-variance tradeoff. Subtracting a constant, in this case one, doesn’t contribute any additional variation while completely removing the bias. Plugging into our formula for $\lambda^*$, this suggests using the estimator \[ \hat{\lambda} \equiv \frac{1}{1 + \left[\left(\frac{1}{p}\sum_{j=1}^p X_j^2 \right) - 1\right]} = \frac{1}{\frac{1}{p}\sum_{j=1}^p X_j^2} = \frac{p}{\sum_{j=1}^p X_j^2} \] as our stand-in for the unknown $\lambda^*$, yielding a shrinkage estimator that I’ll call “NQ” for “not quite” for reasons that will become apparent in a moment: \[ \hat{\mu}^{(j)}_\text{NQ} = \left(1 - \frac{p}{\sum_{k=1}^p X_k^2}\right)X_j. \] Notice what’s happening here: our optimal shrinkage estimator depends on $c^2/p$, something we can’t observe. But we’ve constructed an unbiased estimator of this quantity by using all of the observations $X_j$. This is the resolution of the paradox discussed above: all of the observations contain information about $c^2$ since this is simply the sum of the squared means. And because we’ve chosen to minimize composite MSE, the optimal shrinkage factor only depends on the individual $\mu_j$ parameters through $c^2$! This is the sense in which it’s possible to learn something useful about, say, $\mu_1$ from $X_2$ in spite of the fact that $\mathbb{E}[X_2] = \mu_2$ may bear no relationship to $\mu_1$.

But wait a minute! This looks suspiciously familiar. Recall that the James-Stein estimator is given by \[ \hat{\mu}^{(j)}_\text{JS} = \left(1 - \frac{p - 2}{\sum_{k=1}^p X_k^2}\right)X_j. \] Just like the JS estimator, my NQ estimator shrinks each of the $p$ means towards zero by a factor that depends on the number of means we’re estimating, $p$, and the overall sum of the squared observations. The key difference between JS and NQ is that JS uses $p - 2$ in the numerator instead of $p$. This means that NQ is a more “aggressive” shrinkage estimator than JS: it pulls the means towards zero by a larger amount than JS. This difference turns out to be crucial for proving that the JS estimator dominates the ML estimator. But when it comes to understanding why the JS estimator has the form that it does, I would argue that the difference is minor. If you want all the gory details of where that extra $-2$ comes from, along with the closely related issue of why $p\geq 3$ is crucial for JS to dominate the ML estimator, see lecture 1 or section 7.3 from my Econ 722 teaching materials.

Conclusion

Before we conclude, there’s one important caveat to bear in mind. In addition to the qualifications that NQ isn’t quite JS, and that JS only dominates the MLE when $p \geq 3$, there’s one more fundamental issue that could be easily missed. Our decision to minimize composite MSE is absolutely crucial to the reasoning given above. The magic of shrinkage depends on our willingness to accept a trade-off in which we do a worse job estimating one mean in exchange for doing a better job estimating another, as composite MSE imposes. Whether this makes sense in practice depends on the context.

If we’re searching for a lost submarine in the ocean (a 3-dimensional problem), it makes perfect sense to be willing to be farther from the submarine in one dimension in exchange for being closer in another. That’s because Euclidean distance is obviously what we’re after here. But if instead we’re estimating teacher value-added and the results of our estimation exercise will be used to determine which teachers lose their jobs, it’s less clear that we should be willing to be farther from one teacher in exchange for being closer to another. Certainly that would be no consolation to someone who had been wrongly dismissed! If we were merely using this information to identify teachers who might need extra help, it’s another story. But the point I’m trying to make here is that our choice of which criterion to minimize necessarily encodes our values in a particular problem.

But with that said, I hope you’re satisfied that this extremely long post was worth the effort. Without using any fancy mathematics or statistical theory, we’ve managed to invent something that is nearly identical to the James-Stein estimator and thus to resolve Stein’s paradox. We started by pretending what we knew $c^2$ and showed that this would allow us to derive a shrinkage estimator with a lower composite MSE than the ML estimator. Then we simply plugged in an unbiased estimator of the key unknown quantity: $c^2/p$. Because all the observations contain information about $c^2$, it makes sense that we should decide how much to shrink one component $X_j$ by using all of the others. At this point, I hope that the James-Stein estimator seems not only plausible but practically obvious, excepting of course that pesky $-2$ in the numerator.

If I ruled the universe, the Gauss-Markov Theorem be demoted to much less exalted status in econometrics teaching!↩︎
Don’t let words do your thinking for you: “bias” sounds like a very bad thing, like kicking puppies. But that’s because the word “bias” has a negative connotation in English. In statistics, it’s just a technical term for “not centered”. An estimator can be biased and still be very good. Indeed the punchline of this post is that the James-Stein estimator is biased but can be much better than the obvious alternative!↩︎
Why squared bias and not simply bias itself? The answer is units: bias is measured in the same units as the parameter being estimated while the variance is in squared units. It doesn’t make sense to add things with different units, so we either have to square the bias or take the square root of the variance, i.e. replace it with the standard deviation. But bias can be negative, and we wouldn’t want a large negative bias to cancel out a large standard deviation so MSE squares the bias instead.↩︎
See if you can prove this as a homework exercise!↩︎
In Bayesian terms, we could view this “shrinkage” idea as calculating the posterior mean of $\mu$ conditional on our data $X$ under a normal prior. In this case $\lambda$ would equal $\tau/(1 + \tau)$ where $\tau$ is the prior precision, i.e. the reciprocal of the prior variance. But for this post we’ll mainly stick to the Frequentist perspective.↩︎
Strictly speaking all of this pre-supposes that we’re working with squared-error loss so that MSE is the right thing to minimize. There are other loss functions we could have used instead and these would lead to different risk functions. But for the purposes of this post, I prefer to keep things simple. See lecture 1 of my Econ 722 slides for more detail.↩︎
Remember that there are two equivalent definitions of MSE: bias squared plus variance on the one hand and expected squared distance from the truth on the other hand.↩︎

How to Do Regression Adjustment

Fri, 02 Aug 2024 00:00:00 +0000

By the end of a typical introductory econometrics course students have become accustomed to the idea of “controlling” for covariates by adding them to the end of a linear regression model. But this familiarity can sometimes cause confusion when students later encounter regression adjustment, a widely-used approach to causal inference under the selection-on-observables assumption. While regression adjustment is simple in theory, the finer points of how and when to apply it in practice are much more subtle. One of these finer points is how to tell whether a particular covariate is a “good control” that will help us learn the causal effect of interest or a “bad control” that will only make things worse.¹ Another, and the topic of today’s post, is how to actually implement regression adjustment after we’ve decided which covariates to adjust for.

The pre-requisites for this post are a basic understanding of selection-on-observables and regression adjustment. If you’re a bit rusty on these points, you might find it helpful to glance at the first half of my lecture slides along with this series of short videos. If you’re still hungry for more after this, you might also enjoy this earlier post from econometrics.blog on common misunderstandings about the selection-on-observables assumption.

A Quick Review

Consider a binary treatment $D$ and an observed outcome $Y$. Let $(Y_0, Y_1)$ be the potential outcomes corresponding to the treatment $D$. Our goal is to learn the average treatment effect $\text{ATE} \equiv \mathbb{E}(Y_1 - Y_0)$ but, unless $D$ is randomly assigned, using the difference of observed means $\mathbb{E}(Y|D=1) - \mathbb{E}(Y|D=0)$ to estimate the ATE in general won’t work. The idea of selection-on-observables is that $D$ might be “as good as randomly assigned” after we adjust for a collection of observed covariates $X$.

Regression adjustment relies on two assumptions: selection-on-observables and overlap. The selection-on-observables assumption says that learning $D$ provides no additional information about the average values of $Y_0$ and $Y_1$, provided that we already know $X$. This implies that we can learn the conditional average treatment effect (CATE) by comparing observed outcomes of the treated and untreated holding $X$ fixed: \[ \text{CATE}(x) \equiv \mathbb{E}[Y_1 - Y_0|X = x] = \mathbb{E}[Y|D=1, X = x] - \mathbb{E}[Y|D=0, X = x]. \] For example: older people might be more likely to take a new medication but also more likely to die without it. If so, perhaps by comparing average outcomes holding age fixed we can learn the causal effect of the medication. The overlap assumption says that, for any fixed value $x$ of the covariates, there are some treated and some untreated people. This allows us to learn $\text{CATE}(x)$ for every value of $x$ in the population and average it using the law of iterated expectations to recover the ATE:
\[ \text{ATE} = \mathbb{E}[\text{CATE}(X)] = \mathbb{E}[\mathbb{E}(Y|D=1, X) - \mathbb{E}(Y|D=0, X)]. \] In the medication example, this would correspond to computing the difference of means for each age group separately, and then averaging them using the share of people in each age group. Notice that this is only possible if there are some people who took the medication and some who didn’t in each age group. That’s exactly what the overlap assumption buys us. For example, if there were no senior citizens who didn’t take the medication, we wouldn’t be able to learn the effect of the medication for senior citizens.

Which regression should we run?

So suppose that we’ve found a set of covariates $X$ that satisfy the required assumptions. How should we actually carry out regression adjustment? To answer this question, let’s start by making things a bit simpler. Suppose that $X$ is a single binary covariate. At the end of the post, we’ll return to the general case. Since $X$ and $D$ are both binary, we can write the conditional mean function of $Y$ given $(D, X)$ as \[ \mathbb{E}(Y|D, X) = \beta_0 + \beta_1 D + \beta_2 X + \beta_3 DX. \] Since the true conditional mean function is linear, a linear regression of $Y$ on $D$, $X$, $DX$ and an intercept will recover $(\beta_0, \beta_1, \beta_2, \beta_3)$. But what on earth do these coefficients actually mean?! Substituting all possible values of $(D, X)$, \[ \begin{align*} \mathbb{E}(Y|D=0, X=0) &= \beta_0 \\ \mathbb{E}(Y|D=1, X=0) &= \beta_0 + \beta_1 \\ \mathbb{E}(Y|D=0, X=1) &= \beta_0 + \beta_2 \\ \mathbb{E}(Y|D=1, X=1) &= \beta_0 + \beta_1 + \beta_2 + \beta_3. \end{align*} \] And so, after a bit of re-arranging, \[ \begin{align*} \beta_0 &= \mathbb{E}(Y|D=0, X=0)\\ \beta_1 &= \mathbb{E}(Y|D=1, X=0) - \mathbb{E}(Y|D=0, X=0)\\ \beta_2 &= \mathbb{E}(Y|D=0, X=1) - \mathbb{E}(Y|D=0, X=0)\\ \beta_3 &= \mathbb{E}(Y|D=1, X=1) - \mathbb{E}(Y|D=1, X=0) - \mathbb{E}(Y|D=0, X=1) + \mathbb{E}(Y|D=0, X=0). \end{align*} \] What a mess! Alas, we’ll need a few more steps of algebra to figure out how these relate to the ATE. Notice that $\beta_1$ equals the CATE when $X=0$ since \[ \begin{align*} \text{CATE}(0) &\equiv \mathbb{E}(Y|D=1, X=0) - \mathbb{E}(Y|D=0, X=0)\\ &= (\beta_0 + \beta_1) - \beta_0\\ & = \beta_1 \end{align*} \] Proceeding similarly for the CATE when $X = 1$, we find that \[ \begin{align*} \text{CATE}(1) &\equiv \mathbb{E}(Y|D=1, X=1) - \mathbb{E}(Y|D=0, X=1) \\ &= (\beta_0 + \beta_1 + \beta_2 + \beta_3) - (\beta_0 + \beta_2) \\ &= \beta_1 + \beta_3. \end{align*} \] Now that we have expressions for each of the two conditional average treatment effects, corresponding to each of the values that $X$ can take, we’re finally ready to compute the ATE: \[ \begin{align*} \text{ATE} &= \mathbb{E}[\text{CATE}(X)] \\ &= \text{CATE}(0) \times \mathbb{P}(X = 0) + \text{CATE}(1) \times \mathbb{P}(X = 1) \\ &= \beta_1 \left[1 - \mathbb{P}(X = 1)\right] + (\beta_1 + \beta_3) \mathbb{P}(X = 1) \\ &= \beta_1 + \beta_3 p \end{align*} \] where we define the shorthand $p \equiv \mathbb{P}(X=1)$. So to compute the ATE, we need to know the coefficients $\beta_1$ and $\beta_3$ from the regression of $Y$ on $D$, $X$, and $DX$, in addition to the share of people with $X = 1$. Needless to say, your favorite regression package will not spit out the ATE for you if you run the regression from above. And it certainly won’t spit out the standard error! So what can we do besides computing everything by hand?

Two Simple Alternatives

It turns out that there are two simple ways to get the your favorite software package to spit out the ATE for you and associated standard error. Each involves a slight re-parameterization of the conditional mean expression from above. The first one replaces $DX$ with $D\tilde{X}$ where $\tilde{X} \equiv X - p$ and $p \equiv \mathbb{P}(X=1)$. To see why this works, notice that \[ \begin{align*} \mathbb{E}(Y|D, X) &= \beta_0 + \beta_1 D + \beta_2 X + \beta_3 DX \\ &= \beta_0 + \beta_1 D + \beta_2 X + \beta_3 D(X - p) + \beta_3 pD\\ &= \beta_0 + (\beta_1 + \beta_3 p) D + \beta_2 X + \beta_3 D\tilde{X}\\ &= \beta_0 + \text{ATE}\times D + \beta_2 X + \beta_3 D\tilde{X}. \end{align*} \] This works perfectly well, but there’s something about it that offends my sense of order: why subtract the mean from $X$ in one place but not in another? If you share my aesthetic sensibilities, then you can feel free to replace that offending $X$ with another $\tilde{X}$ since \[ \begin{align*} \mathbb{E}(Y|D, X) &= \beta_0 + \text{ATE}\times D + \beta_2 X + \beta_3 D\tilde{X}\\ &= \beta_0 + \text{ATE}\times D + \beta_2 (X-p) + p \beta_2 + \beta_3 D\tilde{X}\\ &= (\beta_0 + p \beta_2) + \text{ATE}\times D + \beta_2 \tilde{X} + \beta_3 D\tilde{X}\\ &= \tilde{\beta}_0 + \text{ATE}\times D + \beta_2 \tilde{X} + \beta_3 D\tilde{X} \end{align*} \] where we define $\tilde{\beta}_0 \equiv \beta_0 + p \beta_2$. Notice that the only coefficient that changes is the intercept, and we’re typically not interested in this anyway!

What if we ignore the interaction?

Wait a minute, you may be ready to object, when researchers claim to be “adjusting” or “controlling” for $X$ in practice, they very rarely include an interaction term between $D$ and $X$ in their regression! Instead, they just regress $Y$ on $D$ and $X$. What can we say about this approach? To answer this question, let’s continue with our example from above and define the following population linear regression model: \[ Y = \alpha_0 + \alpha_1 D + \alpha_2 X + V \] where $U$ is the population linear regression error term so that, by construction, $\mathbb{E}(U) = \mathbb{E}(XU) = 0$. Notice that I’ve called the coefficients in this regression $\alpha$ rather than $\beta$. That’s because they will not in general coincide with the conditional mean function from above, namely $\mathbb{E}(Y|D, X) = \beta_0 + \beta_1 D + \beta_2 X + \beta_3 DX$. In particular, the regression of $Y$ on $D$ and $X$ without an interaction will only coincide with the true conditional mean function if $\beta_3 = 0$.

So what, if anything, can we say about $\alpha_1$ in relation to the ATE? By Yule’s Rule ² we have \[ \alpha_1 = \frac{\text{Cov}(Y, \tilde{D})}{\text{Var}(\tilde{D})}, \quad D = \gamma_0 + \gamma_1 X + \tilde{D}, \quad \mathbb{E}(\tilde{D}) = \mathbb{E}(X\tilde{D}) = 0 \] where $\tilde{D}$ is the error term from a population linear regression of $D$ on $X$. In words, the way that a regression of $Y$ on $D$ and $X$ “adjusts” for $X$ is by first regressing $D$ on $X$, taking the part of $D$ that is not correlated with $X$, namely $\tilde{D}$, and regressing $Y$ on this alone.³ As shown in the appendix to this post, \[ \frac{\text{Cov}(Y,\tilde{D})}{\text{Var}(\tilde{D})} = \frac{\mathbb{E}[\text{Var}(D|X)(\beta_1 + \beta_3 X)]}{\mathbb{E}[\text{Var}(D|X)]}. \] in this example. And since $\text{CATE}(X) = \beta_1 + \beta_3 X$ it follows that \[ \alpha_1 = \frac{\mathbb{E}[\text{Var}(D|X) \cdot \text{CATE}(X)]}{\mathbb{E}[\text{Var}(D|X)]}. \] The only thing that’s random in this expression is $X$. Both expectations involve averaging over its distribution. To make this clearer, define the propensity score $\pi(x) \equiv \mathbb{P}(D=1|X=x)$. Using this notation, \[ \begin{align*} \text{Var}(D|X) &= \mathbb{E}(D^2|X) - \mathbb{E}(D|X)^2 = \mathbb{E}(D|X) - \mathbb{E}(D|X)^2\\ &= \pi(X) - \pi(X)^2 = \pi(X)[1 - \pi(X)] \end{align*} \] since $D$ is binary. Defining $p(x) \equiv \mathbb{P}(X = x)$, we see that \[ \begin{align*} \alpha_1 &= \frac{\mathbb{E}[\pi(X)\{1 - \pi(X)\}\cdot \text{CATE}(X)]}{\mathbb{E}[\pi(X)\{1 - \pi(X)\}]}\\ \\ &= \frac{p(0) \cdot \pi(0)[1 - \pi(0)]\cdot \text{CATE}(0) + p(1) \cdot \pi(1)[1 - \pi(1)]\cdot \text{CATE}(1)}{p(0) \cdot \pi(0)[1 - \pi(0)] + p(1) \cdot \pi(1)[1 - \pi(1)]}\\ \\ &= w_0 \cdot \text{CATE}(0) + w_1 \cdot \text{CATE}(1) \end{align*} \] where we introduce the shorthand \[ w(x) \equiv \frac{p(x) \cdot \pi(x)[1 - \pi(x)]}{\sum_{\text{all } k} p(k) \cdot \pi(k)[1 - \pi(k)]}. \] In other words, the coefficient on $D$ in a regression of $Y$ on $D$ and $X$ excluding the interaction term $DX$ gives a weighted average of the conditional average treatment effects for the different values of $X$. The weights are between zero and one and sum to one. Because $w(x)$ is increasing in $p(x)$, values of $X$ that are more common are given more weight just as they are in the ATE. But since $w(x)$ is also increasing in $\pi(x)[1 - \pi(x)]$, values of $X$ for which $\pi(x)$ is closer to 0.5 are given more weight, unlike in the ATE. As such, we could describe $\alpha_1$ as a variance-weighted average of the conditional average treatment effects.

In general, the weighted average $\alpha_1$ will not coincide with the ATE, although there are two special cases where it will. The first case is when $\text{CATE}(X)$ does not depend on $X$, i.e. treatment effects are homogeneous. In this case $\beta_3 = 0$ so there is no interaction term in the conditional mean function! The second is when $\pi(X)$ does not depend on $X$, in which case the probability of treatment does not depend on $X$, so we don’t need to adjust for $X$ in the first place!

What about the general case?

All of the above derivations assumed that $X$ is one-dimensional and binary. So how much of this still applies more generally? First, if $X$ is a vector of binary variables representing categories like sex, race etc., everything goes through exactly as above.⁴ All that changes is that $\beta_2$, $\beta_3$ and $p = \mathbb{E}(X)$ become vectors. The coefficient on $D$ in a regression of $Y$ on $D$, $X$ and the interaction $D \tilde{X}$ is still the ATE, and the coefficient on $D$ in a regression that excludes the interaction term is still a weighted average of CATEs that does not in general equal the ATE.

So whenever the covariates you need to adjust for are categorical, this post has you covered.⁵ But what if some of our covariates are continuous? In this case things are a bit more complicated, but all of the results from above still go through if we’re willing to assume that the conditional mean functions $\mathbb{E}(Y|D=0, X)$, $\mathbb{E}(Y|D=1,X)$ and $\mathbb{E}(D|,X)$ are linear in $X$. This is undoubtedly a strong assumption, but not perhaps as strong as it seems. For example, $X$ could include logs, squares or other functions of some underlying continuous covariates, e.g. age or years of experience. In this case, the weighted average interpretation of the coefficient on $D$ in a regression that excludes the interaction term still holds but now involves an integral rather than a sum.

Does it really work? An Empirical Example

But perhaps you don’t trust my algebra.⁶ To assuage your fears, let’s take this to the data! The following example is based on Peisakhin & Rozenas (2018) - Electoral Effects of Biased Media: Russian Television in Ukraine. I’ve adapted it from Llaudet and Imai’s fantastic book Data Analysis for Social Science, the perfect holiday or birthday gift for the budding social scientist in your life.

Here’s a bit of background. In the lead-up to Ukraine’s 2014 parliamentary election, Russian state-controlled TV mounted a fierce media campaign against the Ukrainian government. Ukrainians who lived near the border with Russia could potentially receive Russian TV signals. Did receiving these signals cause them to support pro-Russia parties in the election? To answer this question, we’ll use a dataset called precincts that contains aggregate election results in precincts close to the Russian border:

library(tidyverse)
precincts <- read_csv('https://ditraglia.com/data/UA_precincts.csv')

Each row of precincts is an electoral precinct in Ukraine that is near the Russian border. The columns pro_russion and prior_pro_russian give the vote share (in percentage points) of pro-Russian parties in the 2014 and 2012 Ukrainian elections, respectively. Our outcome of interest will be the change in pro-Russian vote share between the two elections, so we first need to construct this:

precincts <- precincts |>
mutate(change = pro_russian - prior_pro_russian) |>
select(-pro_russian, -prior_pro_russian)
precincts

## # A tibble: 3,589 × 3
## russian_tv within_25km change
## <dbl> <dbl> <dbl>
## 1 0 1 -22.4
## 2 0 0 -34.5
## 3 1 1 -18.8
## 4 0 1 -12.2
## 5 0 0 -27.7
## 6 1 0 -44.2
## 7 0 0 -34.5
## 8 0 0 -29.5
## 9 0 0 -24.1
## 10 0 0 -25.4
## # ℹ 3,579 more rows

The column russian_tv equals 1 if the precinct has Russian TV reception. This is our treatment variable: $D$. But crucially, this is not randomly assigned. While it’s true that there is some natural variation in signal strength that is plausibly independent of other factors related to voting behavior, on average precincts closer to Russia are more likely to receive a signal. So suppose for the sake of argument that conditional on proximity to the Russian border, russian_tv is as good as randomly assigned. This is the selection on observables assumption. There’s no way to check this using our data alone. It’s something we need to justify based on our understanding of the world and the substantive problem at hand.

As our measure of proximity, we’ll use the dummy variable within_25km which equals 1 if the precinct is within 25km of the Russian border. This our $X$-variable. The overlap assumption requires that there are some precincts with Russian TV reception and some without in each distance category. This is an assumption that we can check using the data, so let’s do so before proceeding:

precincts |>
group_by(within_25km) |>
summarize(`share with Russion tv` = mean(russian_tv))

## # A tibble: 2 × 2
## within_25km `share with Russion tv`
## <dbl> <dbl>
## 1 0 0.105
## 2 1 0.692

We see that just over 10% of that are not within 25km of the border have Russian TV reception while just under 70% of those within 25km have reception, so overlap is satisfied in this example. Neither of these values is close to 0% or 100%, so this dataset comfortably satisfies the overlap assumption.

To avoid taxing your memory about which variable is which, for the rest of this exercise, I’ll create a new dataset that renames the columns of precincts to D, X, and Y for the treatment, covariate, and outcome, respectively.

dat <- precincts |>
rename(D = russian_tv, X = within_25km, Y = change)

Computing the ATE the Hard Way

Now we’re ready to verify the calculations from above. First we’ll compute the ATE “the hard way”, in other words by computing each of the CATEs separately and averaging them. Warning: there’s a fair bit of dplyr to come!

# Step 1: compute the mean Y for each combination of (D, X)
means <- dat |>
group_by(D, X) |>
summarize(Ybar = mean(Y))
means # display the results

## # A tibble: 4 × 3
## # Groups: D [2]
## D X Ybar
## <dbl> <dbl> <dbl>
## 1 0 0 -24.6
## 2 0 1 -34.2
## 3 1 0 -13.0
## 4 1 1 -32.2

# Step 2: reshape so the means of Y|D=0,X and Y|D=1,X are in separate cols
means <- means |>
pivot_wider(names_from = D,
values_from = Ybar,
names_prefix = 'Ybar')
means # display the results

## # A tibble: 2 × 3
## X Ybar0 Ybar1
## <dbl> <dbl> <dbl>
## 1 0 -24.6 -13.0
## 2 1 -34.2 -32.2

# Step 3: attach a column with the proportion of X = 0 and X = 1
regression_adjustment <- dat |>
group_by(X) |>
summarize(count = n()) |>
mutate(p = count / sum(count)) |>
select(-count) |>
left_join(means) |>
mutate(CATE = Ybar1 - Ybar0) # compute the CATEs
regression_adjustment # display the results

## # A tibble: 2 × 5
## X p Ybar0 Ybar1 CATE
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.849 -24.6 -13.0 11.6
## 2 1 0.151 -34.2 -32.2 2.01

# Step 4: at long last, compute the ATE!
ATE <- regression_adjustment |>
mutate(out = (Ybar1 - Ybar0) * p) |>
pull(out) |>
sum()
ATE

## [1] 10.12062

Computing the ATE the Easy Way

And now the easy way, using the two regressions described above⁷

# Construct Xtilde = X - mean(X)
dat <- dat |>
mutate(Xtilde = X - mean(X))
# Regression of Y on D, X, and D*Xtilde
lm(Y ~ D + X + D:Xtilde, dat)

##
## Call:
## lm(formula = Y ~ D + X + D:Xtilde, data = dat)
##
## Coefficients:
## (Intercept) D X D:Xtilde
## -24.591 10.121 -9.604 -9.562

# Regression of Y on D, Xtilde, and Xtilde
lm(Y ~ D * Xtilde, dat)

##
## Call:
## lm(formula = Y ~ D * Xtilde, data = dat)
##
## Coefficients:
## (Intercept) D Xtilde D:Xtilde
## -26.045 10.121 -9.604 -9.562

Everything works as it should! The coefficient on D in each regression equals the ATE we computed by hand, namely 10.121, and the two regression agree with each other with the exception of the intercept.

Standard Errors

The nice thing about computing the ATE by running a regression rather than computing it “by hand” is that we can easily obtain valid standard errors, confidence intervals, and p-values if desired. For example, if you wanted “robust” standard errors for the ATE, you could simply use lm_robust() from the estimatr package as follows

library(estimatr)
library(broom)
lm_robust(Y ~ D * Xtilde, dat) |>
tidy() |>
filter(term == 'D') |>
select(-df, -outcome)

## term estimate std.error statistic p.value conf.low conf.high
## 1 D 10.12062 0.4838613 20.91636 9.315921e-92 9.171946 11.06929

Getting these “by hand” would have been much more work!

There is one subtle point that I should mention. I’ve heard it said on numerous occasions that the above standard error calculation is “not quite right” since we estimated the mean of X and used it to re-center X in the regression. Surely we should account for the sampling variability in $\bar{X}$ around its mean, the argument goes.

Perhaps I’m about to get blacklisted by the Econometrician’s alliance for saying this, but I’m not convinced. The usual way of thinking about inference for regression is conditional on the regressors, in this case $X$ and $D$. Viewed from this perspective, $\bar{X}$ isn’t random. Now, of course, if you prefer to see the world through finite-population design-based lenses, $D$ is definitely random. But in this case it’s the only thing that’s random. The design-based view situates randomness exclusively in the treatment assignment mechanism. Under this view, since the units in our dataset are not considered as having been drawn from a hypothetical super-population, any summary statistic of their covariates $X$ is fixed. So again, $\bar{X}$ isn’t random and doesn’t contribute any uncertainty.

Update: I initially concluded this section with “as far as I can see, it’s perfectly reasonable to use the sample mean of $X$ to re-center $X$ in the regression” but apoorva.lal pointed out that this elides an important distinction. The key is that whether $\bar{X}$ is random or not depends on the question you’re interested in. If you want inference for the ATE computed using the population values of $X$, then $\bar{X}$ is random and you should account for its variability. But if you’re interested in the ATE computed using the observed values of $X$ in the sample, then $\bar{X}$ is fixed and you shouldn’t:

Point about whether Xbar is random depends on whether you're interested in SATE v PATE right? In any case, it is surprisingly easy to propagate that uncertainty forward with (what else?) GMM (earlier posts in the thread discuss the recentering point)https://t.co/3GXfTeF9DW
— apoorva.lal (@Apoorva__Lal) August 2, 2024

This agrees with my logic about conditioning on $X$ and the design-based perspective, but it’s a much clearer way of making the relevant distinction so thanks for pointing it out!

Excluding the Interaction

Finally, we’ll verify the derivations from above for $\alpha_1$ in the regression that excludes an interaction term. First we’ll compute the “variance weighted average” of CATEs by hand and check that it does not agree with the ATE:

# Compute the propensity score pi(X)
pscore <- dat |>
group_by(X) |>
summarize(pi = mean(D))
# Compute the weights w
regression_adjustment <- left_join(regression_adjustment, pscore) |>
mutate(w = p * pi * (1 - pi) / sum(p * pi * (1 - pi)))
regression_adjustment # display the results

## # A tibble: 2 × 7
## X p Ybar0 Ybar1 CATE pi w
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.849 -24.6 -13.0 11.6 0.105 0.713
## 2 1 0.151 -34.2 -32.2 2.01 0.692 0.287

# Compute the variance weighted average of the CATEs
wCATE <- regression_adjustment |>
summarize(wCATE = sum(w * CATE)) |>
pull(wCATE)
c(wCATE = wCATE, ATE = ATE)

## wCATE ATE
## 8.822285 10.120617

Finally, we’ll compare this hand calculation to the results of a regression of $Y$ on $D$ and $X$ without an interaction:

lm(Y ~ D + X, dat)

##
## Call:
## lm(formula = Y ~ D + X, data = dat)
##
## Coefficients:
## (Intercept) D X
## -24.302 8.822 -14.614

As promised, the coefficient on $D$ equals the variance-weighted average of CATEs that we computed by hand, namely 8.822, which does not equal the ATE, 10.121. Here the CATE for $X=1$ receives more weight when the interaction term is omitted, pulling the coefficient on $D$ away from the ATE and towards the (smaller) CATE for $X=1$.

Conclusion

I hope this post has convinced you that regression adjustment isn’t simply a matter of tossing a collection of covariates into your regression! In general, the coefficient on $D$ in a regression of $Y$ on $X$ and $D$ will not equal the ATE of $D$. Instead it will be a weighted average of CATEs. To obtain the ATE we need to include an interaction between $X$ and $D$. The simplest way to get your favorite statistical software package to calculate this for you, along with an appropriate standard error, is by de-meaning $X$ before including the interaction. And don’t forget that causal inference always requires untestable assumptions, in this case the selection-on-observables assumption. While implementation details are important, getting them right won’t make any difference if you’re not adjusting for the right covariates in the first place.

Appendix: The Missing Algebra

This section provides the algebra needed to justify the expression for $\alpha_1$ from a regression that omits the interaction between $D$ and $X$. In particular, we will show that \[ \frac{\text{Cov}(Y,\tilde{D})}{\text{Var}(\tilde{D})} = \frac{\mathbb{E}[\text{Var}(D|X)(\beta_1 + \beta_3 X)]}{\mathbb{E}[\text{Var}(D|X)]}. \] where $\tilde{D}$ is the error term from a population linear regression of $D$ on $X$, namely $D = \gamma_0 + \gamma_1 X + \tilde{D}$ so that $\mathbb{E}(\tilde{D}) = \mathbb{E}(X\tilde{D}) = 0$ by construction. The proof isn’t too difficult, but it’s a bit tedious so I thought you might prefer to skip it on a first reading. Still here? Great! Let’s dive into the algebra.

We need to calculate $\text{Cov}(Y, \tilde{D})$ and $\text{Var}(\tilde{D})$. A nice way to carry out this calculation is by applying the law of total covariance. You may have heard of the law of total variance, but in my view the law of total covariance is more useful. Just as you can deduce all the properties of variance from the properties of covariance, using $\text{Cov}(W, W) = \text{Var}(W)$, you can deduce the law of total variance from the law of covariance! In the present example, the law of total covariance allows us to write \[ \text{Cov}(Y, \tilde{D}) = \mathbb{E}[\text{Cov}(Y, \tilde{D}|X)] + \text{Cov}[\mathbb{E}(Y|X), \mathbb{E}(\tilde{D}|X)]. \] If this looks intimidating, don’t worry: we’ll break it down piece by piece. The second term on the RHS is a covariance between two random variables: $\mathbb{E}(Y|X)$ and $\mathbb{E}(\tilde{D},X)$.⁸ We already have an equation for $\tilde{D}$, namely the population linear regression of $D$ on $X$, so let’s use it to simplify $\mathbb{E}(\tilde{D}|X)$: \[ \mathbb{E}(\tilde{D}|X) = \mathbb{E}(D - \gamma_0 - \gamma_1 X|X) = \mathbb{E}(D|X) - \gamma_0 - \gamma_1 X. \] Here’s the key thing to note: since $D$ is binary, the population linear regression of $D$ on $X$ is identical to the conditional mean of $D$ given $X$.⁹ This tells us that $\mathbb{E}(\tilde{D}|X)=0$. Since the covariance of anything with a constant is zero, the second term on the RHS of the law of total covariance drops out, leaving us with \[ \text{Cov}(Y, \tilde{D}) = \mathbb{E}[\text{Cov}(Y, \tilde{D}|X)] = \mathbb{E}[\text{Cov}(Y, D - \gamma_0 - \gamma_1 X | X)]. \] Now let’s deal with the conditional covariance inside the expectation. Remember that conditioning on $X$ is equivalent to saying “suppose that $X$ were known”. Anything that’s known is constant, not random. So we can treat both $X$ and $\delta$ as constants and apply the usual rules for covariance to obtain \[ \text{Cov}(Y, D - \gamma_0 - \gamma_1 X | X) = \text{Cov}(Y, D|X). \] Therefore, $\text{Cov}(Y, \tilde{D}) = \mathbb{E}[\text{Cov}(Y, D|X)]$. A very similar calculation using the law of total variance gives \[ \begin{align*} \text{Var}(\tilde{D}) &= \mathbb{E}[\text{Var}(\tilde{D}|X)] + \text{Var}[\mathbb{E}(\tilde{D}|X)] =\mathbb{E}[\text{Var}(\tilde{D}|X)]\\ &= \mathbb{E}[\text{Var}(D - \gamma_0 - \gamma_1 X| X)]\\ &= \mathbb{E}[\text{Var}(D|X)] \end{align*} \] since $\mathbb{E}(\tilde{D}|X) = 0$ and the variance of any constant is simply zero. So, with the help of the laws of total covariance and variance, we’ve established that
\[ \alpha_1 \equiv \frac{\text{Cov}(Y, \tilde{D})}{\text{Var}(\tilde{D})}= \frac{\mathbb{E}[\text{Cov}(Y, D|X)]}{\mathbb{E}[\text{Var}(D|X)]} \] in this example. Note that this does not hold in general: it relies on the fact that $\mathbb{E}(\tilde{D}|X)=0$, which holds in our example because $\mathbb{E}(D|X) = \gamma_0 + \gamma_1 X$ given that $X$ is binary.

We’re very nearly finished. All that remains is to simplify the numerator. To do this, we’ll use the equality \[ Y = \beta_0 + \beta_1 D + \beta_2 X + \beta_3 DX + U \] where $U \equiv Y - \mathbb{E}(Y|D, X)$ satisfies $\mathbb{E}(U|D,X) = 0$ by construction. This allows us to write \[ \begin{align*} \text{Cov}(Y, D|X) &= \text{Cov}(\beta_0 + \beta_1 D + \beta_2 X + \beta_3 DX + U, D|X)\\ &= \beta_1 \text{Cov}(D, D|X) + \beta_3 \text{Cov}(DX, D|X) + \text{Cov}(U,D|X)\\ &= \beta_1 \text{Var}(D|X) + \beta_3 X \cdot \text{Var}(D|X) + \text{Cov}(U,D|X)\\ &= \text{Var}(D|X)(\beta_1 + \beta_3 X) + \text{Cov}(U, D| X). \end{align*} \] So what about that pesky $\text{Cov}(U,D|X)$ term? By the law of iterated iterations this turns out to equal zero, since \[ \begin{align*} \text{Cov}(U,D|X) &= \mathbb{E}(DU|X) - \mathbb{E}(D|X) \mathbb{E}(U|X)\\ &= \mathbb{E}_{D|X}[D\mathbb{E}(U|D,X)] - \mathbb{E}(D|X) \mathbb{E}_{D|X}[\mathbb{E}(U|D,X)] \end{align*} \] and, again, $\mathbb{E}(U|D,X) = 0$ by construction. So we’re left with \[ \alpha_1 = \frac{\mathbb{E}[\text{Cov}(Y, D|X)]}{\mathbb{E}[\mathbb{E}[\text{Var}(D|X)]} = \frac{\mathbb{E}[\text{Var}(D|X)(\beta_1 + \beta_3 X)]}{\mathbb{E}[\text{Var}(D|X)]}. \]

See this post for a prototypical example of a “bad control” and the second half of my slides for some general discussion of “bad controls.” These alternative slides from my core ERM course cover similar ground but make a more explicit connection to good and bad advice about bad controls that one encounters in introductory econometrics books.↩︎
Call it “Frisch-Waugh-Lovell” if you must, but I will continue trying to make fetch happen.↩︎
If you want the standard error of $\beta$ and not just the point estimate, then replace $Y$ with the residual from a regression of $Y$ on $X$.↩︎
This is a nice homework exercise to test your understanding of the post!↩︎
If you have a very large number of categories things are still fine in theory but can break down in practice, since you’ll typically have very few observations in each “cell” corresponding to the different values of the categorical variables. But this is a topic for another day!↩︎
I certainly don’t!↩︎
If you’re rusty on R’s formula syntax, you may find my cheat sheet helpful.↩︎
An unconditional expectation like $\mathbb{E}(Y)$ is a constant: it’s a probability-weighted average of all possible realizations of $Y$. In contrast, a conditional expectation like $\mathbb{E}(Y|X)$ is a random variable: it’s our “best guess” of $Y$ based on observing $X$, where “best” means “minimum mean-squared error”. See this video for some more details on conditional expectation.↩︎
In general, a population linear regression gives the best linear approximation of the conditional mean, but when the conditional mean is in fact linear, the two coincide. The reason these coincide in our example is that we can write $\mathbb{E}[D|X] = X \mathbb{E}(D|X=1) + (1 - X) \mathbb{E}(D|X=0)$. There are only two values that $X$ can take, and we are simply “picking out” the average value of $D$ in each case. But we can re-arrange this to take precisely the form $\delta + \kappa X$ defining $\delta = \mathbb{E}(D|X=0)$ and $\kappa = \mathbb{E}(D|X=1) - \mathbb{E}(X=0)$.↩︎

Is it better to improve sensitivity or specificity?

Thu, 25 Jul 2024 00:00:00 +0000

Here’s a slightly unusual exercise on the topic of Bayes’ Theorem for those of you teaching or studying introductory probability. Imagine that you’re developing a diagnostic test for a disease. The test is very simple: it either comes back positive or negative. You have a choice between slightly increasing either your test’s sensitivity or its specificity. If your goal is to maximize the positive predictive value (PPV) of your test, i.e. the probability that a patient has the disease given that the test comes back positive, which test characteristic should you choose to improve?

An Open Invitation

If you’re still hungry for more Bayes’ Theorem after reading this post, then why not join the Summer of Bayes 2024 online reading group? If you’d like to be added to the mailing list, just send an email to bayes [at] user.sent.as. Recordings of past sessions along with slides and other materials are available to group members via the Summer of Bayes discussion board. And now back your regularly-scheduled blog content…

Odds aren’t so odd!

While I give you a few minutes to pause and ponder this question, here’s a brief rant on the topic of odds. If you’re anything like me, the first time you encountered odds, you thought to yourself

What is this $*@%^!? Why would anyone want to spoil a perfectly good probability by dividing it by one minus itself?“¹

But it’s time to take the red pill and see the world as it really is: the only reason you prefer to think in terms of probabilities rather than odds is because you’ve been brainwashed by the educational system. Of course I exaggerate slightly, but the point is that odds are just as natural as probabilities; we’re just not as accustomed to working with them. In many situations in probability, statistics, and econometrics, it turns out that working with odds (or their logarithm) makes life much simpler, as I will try to convince you with a simple example.

First we need to define odds. Consider some event $A$ with probability $p$ of occurring. Then we say that the odds of $A$ are $p/(1 - p)$. For example, if $p = 1/3$ then the event $A$ is equivalent to drawing a red ball from an urn that contains one red and two blue balls: the probability gives the ratio of red balls to total balls. The odds of $A$, on the other hand, equal $1/2$: odds give the ratio of red balls to blue balls. Since probabilities are between 0 and 1, odds are between 0 and $\infty$. Odds of 0 mean that the event is impossible, while odds of $\infty$ mean that the event is certain. Odds of 1 mean that the event is just as likely to occur as not to occur.

Now here’s an example that you’ve surely seen before:

One in a hundred women has breast cancer $(B)$. If you have breast cancer, there is a 95% chance that you will test positive $(+)$; if you do not have breast cancer $(B^C)$, there is a 2% chance that you will nonetheless test positive $(+)$. We know nothing about Alice other than the fact that she tested positive. How likely is it that she has breast cancer?

It’s easy enough to solve this problem using Bayes’ Theorem, as long as you have pen and paper handy: \[ \begin{aligned} P(B | +) &= \frac{P(+|B)P(B)}{P(+)} = \frac{P(+|B)P(B)}{P(+|B)P(B) + P(+|B^C)P(B^C)}\\ &= \frac{0.95 \times 0.01}{0.95 \times 0.01 + 0.02 \times 0.99} \approx 0.32. \end{aligned} \] But what if I asked you how the result would change if only one in a thousand women had breast cancer? What if I changed the sensitivity of the test from 95% to 99% or the specificity from 98% to 95%? If you’re anything like me, you would struggle to do these calculations in your head. +hat’s because $P(B|+)$ is a highly non-linear function of $P(B)$, $P(+|B)$, and $P(+|B^C)$.

In contrast, working with odds makes this problem a snap. The key point is that $P(B|+)$ and $P(B^C|+)$ have the same denominator, namely $P(+)$: \[ P(B | +) = \frac{P(+|B)P(B)}{P(+)}, \quad P(B^C | +) = \frac{P(+|B^C)P(B^C)}{P(+)} \] Notice that $P(+)$ was the “complicated” term in $P(B|+)$; the numerator was simple. Since the odds of $B$ given $(+)$ is defined as the ratio of $P(B|+)$ to $P(B^C|+)$, the denominator cancels and we’re left with \[ \text{Odds}(B|+) \equiv \frac{P(B|+)}{P(B^C|+)} = \frac{P(+|B)}{P(+|B^C)} \times \frac{P(B)}{P(B^C)}. \] In other words, the posterior odds of $B$ equal the likelihood ratio, $P(+|B)/P(+|B^C)$, multiplied by the prior odds of $B$, $P(B)/P(B^C)$: \[ \text{Posterior Odds} = \text{(Likelihood Ratio)} \times \text{(Prior Odds)}. \] Now we can easily solve the original problem in our head. The prior odds are 1/99 while the likelihood ratio is 95/2. Rounding these to 0.01 and 50 respectively, we find that the posterior odds are around 1/2. This means that Alice’s chance of having breast cancer is roughly equivalent to the chance of drawing a red ball from an urn with one red and two blue balls. There’s no need to convert this back to a probability since we can already answer the question: it’s considerably more likely that Alice does not have breast cancer. But if you insist, odds of 1/2 give a probability of 1/3, so in spite of rounding and calculating in our heads we’re within 0.3% of the exact answer!

Repeat after me: odds are on a multiplicative scale. This is their key virtue and the reason why they make it so easy to explore variations on the original problem. If one in a thousand women has breast cancer, the prior odds become 1/999 so we simply divide our previous result by 10, giving posterior odds of around 1/20. If we instead changed the sensitivity from 95% to 99% and the specificity from 98% to 95%, then the likelihood ratio would change from $95/2 \approx 50$ to $99/5 \approx 20$.

The Solution

Have I given you enough time to come up with your own solution? Fantastic! In case you hadn’t already guessed, that little digression about odds served an important purpose: my solution will use odds rather than probabilities. Our goal is to increase the positive predictive value (PPV) of the test, namely \[ \text{PPV} \equiv P(\text{Has Disease}|\text{Test Positive}), \] by as much as possible, either by improving the test’s sensitivity \[ \text{Sensitivity} \equiv P(\text{Test Positive} | \text{Has Disease}) \] or its specificity \[ \text{Specificity} \equiv P(\text{Test Negative} | \text{Doesn't Have Disease}). \] To answer this question, we’ll start by substituting these definitions into the odds form of Bayes’ Theorem introduced above, yielding \[ \text{Posterior Odds} = \frac{\text{PPV}}{1 - \text{PPV}} = \frac{\text{Sensitivity}}{1 - \text{Specificity}} \times \text{Prior Odds}. \] This expression makes it clear that increasing either the sensitivity or specificity of the test increases the posterior odds. And because the PPV is a strictly increasing function of the posterior odds, namely \[ \text{PPV} = \frac{\text{Posterior Odds}}{1 + \text{Posterior Odds}}, \] this also increases the PPV. So now the question is: which of these two possibilities gives us the most bang for our buck? A natural idea would be to compare the marginal effect of increasing sensitivity by a small amount to the marginal effect of increasing specificity by the same amount. We can do this by comparing the partial derivatives of the PPV with respect to sensitivity and specificity. But, again, the PPV is an increasing function of the posterior odds, so we can simplify our task by comparing the derivatives of the posterior odds with respect to sensitivity and specificity. By the chain rule, any claim about the relative magnitudes of these derivatives computed for the odds will also hold for the PPV.

But why stop with the odds? We can simplify our task even further by comparing the derivatives of the logarithm of the posterior odds with respect to sensitivity and specificity. This is because the logarithm is, again, an increasing transformation of the odds. Since \[ \log(\text{Posterior Odds}) = \log(\text{Sensitivity}) - \log(1 - \text{Specificity}) + \log(\text{Prior Odds}). \] our required derivatives are \[ \frac{\partial \log(\text{Posterior Odds})}{\partial \text{Sensitivity}} = \frac{1}{\text{Sensitivity}} \quad \text{and} \quad \frac{\partial \log(\text{Posterior Odds})}{\partial \text{Specificity}} = \frac{1}{1 - \text{Specificity}}. \] Now for the punchline: the ratio of the derivative with respect to specificity divided by that with respect to sensitivity is \[ \frac{\partial \log(\text{Posterior Odds})/\partial \text{Specificity}}{\partial \log(\text{Posterior Odds})/\partial \text{Sensitivity}} = \frac{1/(1 - \text{Specificity})}{1/\text{Sensitivity}} = \frac{\text{Sensitivity}}{1 - \text{Specificity}} \] and this is precisely the likelihood ratio from the odds form of Bayes Theorem! Hence, whenever the likelihood ratio is greater than one we’d prefer to increase the test’s specificity; whenever it’s less than one we’d prefer to increase the sensitivity. If the likelihood ratio is equal to one, then it doesn’t matter which we choose.

Case closed, right? Well not quite. We can say a bit more by thinking about what it means for the likelihood ratio to be greater than or less than one. Examining the odds form of Bayes’ Theorem from above, we see that a likelihood ratio less than one means that our posterior probability that a person is sick falls when she tests positive. In other words, this corresponds to a test that is worse than useless: it’s actually misleading. In contrast, a likelihood ratio greater than one means that the test is informative: a positive test result increases our belief that the person is sick. Any real-world diagnostic test will have a likelihood ratio greater than one. Indeed, if we had such an actively mis-leading test, we could easily convert it into an informative one by simply reversing the test’s outcome: if someone tests positive, we tell them they’re negative, and vice versa. This reversal would result in a likelihood ratio greater than one. Therefore, in all cases–whether we start with an informative test or reverse a misleading one–we should prefer to increase the test’s specificity.

Epilogue

Of course, this exercise is predicated upon the assumption that we want to maximize the PPV and that we can freely adjust both the test’s sensitivity and its specificity. In practice, one or more of these assumptions might not hold. Indeed, PPV is not the be all and end all of diagnostic testing. A full accounting would need to consider the relative costs of false positives and false negatives along with the prevalence of the disease. Still, I hope this exercise gives you a flavor of the power of odds for simplifying complex problems in probability and statistics.

I know first-hand that this sentiment is shared by at least one distinguished professor of probability theory, so at least I’m not completely alone in my earlier view of things!↩︎

How to Read an Econometrics Paper

Sat, 20 Jul 2024 00:00:00 +0000

Reading and understanding econometrics papers can be hard work. Most published articles, even review articles, are written by specialists for specialists. Unless you’re already familiar with the literature, it can be a real uphill battle to make it through a recent paper. In grad school I remember our professors repeatedly admonishing me and the rest of the cohort to “read the papers!” But when I did my best to follow this advice, I nearly always felt like I was banging my head against a wall.

Effective reading is a skill that can be learned, and the only way to learn is through practice. But you can learn the easy way or the hard way. The hard way is to keep trying and hope for the best; the easy way is to adjust your approach based on the experiences of others. With that in mind, this post offers some tips and tricks that I’ve picked up through the years for reading technical material efficiently and effectively. My target audience is PhD students in Economics, especially students in the Econometrics Reading Group at Oxford, but I hope that some of the following tips will be helpful for others as well.

If you have any tips of your own, or if you violently agree or disagree with any of mine, I hope to hear from you in the comments section below!

Read Something Else Instead

The first question to ask yourself is whether you should even be reading this paper in the first place. Just because White’s (1980) paper on heteroskedasticity-robust standard errors is a “classic” in econometrics, that doesn’t mean that you should read it. In fact, as a graduate student just starting out, you probably shouldn’t! The paper that introduces a new idea or procedure is rarely the paper that gives the clearest explanation. Reading a good textbook explanation is a much more effective way to get to grips with a new idea. You might, for example, try reading the relevant chapters in White’s textbook Asymptotic Theory for Econometricians instead.

But sometimes you have to read a particular paper. Maybe it’s the paper you’ve been assigned to present in a reading group, or maybe it’s highly relevant to your own research. In that case you may still want to start by reading something else. For example, there might be a more recent paper or review article that gives a good summary of the idea or method in question. Reading this paper first can make it much easier for you to tackle the original paper.

So to all those professors out there who keep telling their students to “read the papers!” I say: “read the papers, but only after you’ve read something else first!”

Don’t Assume You Have to Understand the Whole Thing

As a general rule you should not expect to understand everything when you read a paper. You may only get 10% on the first read, but that’s fine! Besides papers I’ve written myself, there are relatively few articles that I’ve checked line-by-line from start to finish. Even if you’ve been assigned to present a paper that doesn’t mean that you need to understand every detail of every lemma in the online technical appendix. Instead your goal should be to understand the key ideas and contributions of the paper. Like anything in life, there are diminishing returns to effort in reading a paper. When reading papers to support your own research, you can be even more selective. The key question becomes: “how is this relevant for what I’m doing?” It may be that you only need to understand a small part of the paper to get what you need.

Don’t Assume You’re Stupid

If you’re confused, don’t assume that it’s your fault. Notice your confusion and try to get to the bottom of it without taking things for granted or engaging in negative self-talk. The only way to learn is by getting confused and then unconfusing yourself!

You may be confused because the authors assume you know something that you don’t. They are likely experts in the field who have spent years thinking about this particular question. You, on the other hand, are just starting out. As you gain a bit more context, things may fall rapidly into place. (See my next tip below.)

You may be confused because the paper is confusingly written. Writing is hard, and technical writing is especially hard. The referee process can even make papers more confusing, since our present system for evaluating research involves multiple rounds of revisions in which the authors must try to satisfy referees with differing views. The result is that published papers often contain a substantial element of “cruft” that distracts from the main message.

You may even be confused because the paper is wrong! As a good Bayesian, you shouldn’t immediately jump to the conclusion that you, a newcomer to this field, have stumbled upon a crucial error that everyone else has missed. On the other hand, you definitely shouldn’t believe everything that you see in print! All papers are wrong in some way, and some papers are wrong in serious and important ways. If you’re confused, it’s worth considering whether the authors were confused too!

Spread Yourself Thin

Let’s say you really need to get to grips with paper X on topic Y. You’ve read the relevant textbook material, you’ve tried a review article, and you’re still struggling. What now? Strange though it may sound, one helpful answer is to read more papers on topic Y in an extremely shallow way. Skim the abstracts, introductions, and conclusions. Note any terms or concepts that keep appearing, especially ones that you don’t understand.

I can think of many occasions when I skimmed nine papers and didn’t understand any of them, but then read a tenth and suddenly everything clicked. The key here is context. When you’re new to topic Y, there will be lots of little things that you’ve never thought of before but that the literature takes for granted. Since most papers are written for specialists by specialists, crucial details are often left out or glossed over as if they were obvious. Just as fish don’t realize that they’re in water, specialists often fail to realize that they’re taking a lot of things for granted. The reason that reading many papers can help is that different specialists will leave out different details. The key that you need to understand paper X might be a seemingly throwaway comment in paper Z!

Explain It to Someone Else

The best way to understand something is by trying to explain it to someone else. This holds true even when the “someone else” in question is just a figment of your imagination. As you read, start by trying to explain the paper to yourself in your own words. I find it helpful to write in the margins of the paper as I go, summarizing the key ideas with less jargon and simpler terminology and notation. When you’re confused about something, try to put your confusion into words; make it concrete and write it down.

Talking to a real person can be even more helpful. If you’re in a reading group, try discussing the paper informally with one your peers who has also read it. You may be surprised at how much two people, neither of whom understands something on their own, can learn from each other. In this brave new world of LLMs like Claude and GPT4o, you could even try uploading your paper and discussing it with an AI. You cannot assume that the AI will necessarily give you reliable information about the paper, but just like a peer who only partially understands it, an AI can be a useful sounding board for your own ideas and confusions. Noticing mistakes in the AI’s understanding, pointing them out and continuing the conversation can also be a great way to clarify your own thinking.

Head Straight for the Simulation / Empirical Example

Ideally every paper would have a fantastic introduction that makes it clear what the paper is about and why it’s important. In real life, introductions can be hit-or-miss. So after reading the introduction, you might consider heading straight for the simulation study and/or empirical example. Most econometrics papers propose a method that solves a particular problem. What is the problem, and why does the particular data generating process (DGP) in the simulation (or the real data in the empirical example) exhibit it? What parameters of the simulation DGP control the extent of the problem? What is the “old” method on which the paper improves? This is likely to be something familiar such as a “textbook” method. How exactly is the new method implemented? In other words, how exactly is it computed from real or simulated data? Try to write down all the steps in the implementation in a sufficiently precise way that you could code it yourself.

Once you know how to answer these questions you’re in a much better position to understand the rest of the paper. As you read through the assumptions and theorems, refer back to the simulation study. Why does the DGP satisfy the assumptions? Can you think of a different DGP in which the assumptions fail? Is there anything “fishy” about the simulation example? Does it seem like the authors have cooked the books in some way, e.g. by introducing a very “mild” version of the central problem, or something else that would be unrealistic in practice? Answering these questions will help you to evaluate the paper, understand its limitations and possibly think about how to improve upon it.

Make Things Simpler

Many econometrics papers present results at an extremely high level of generality. On the one hand this is a good thing. Much of the power of mathematics comes from abstraction and general results are more widely-applicable. But from an expositional standpoint, this is terrible. This history of mathematics is a history of concrete problems to specific problems that were progressively generalized and expanded over time. The history of ideas mirrors the way that the average person learns most effectively: by starting with concrete examples and then generalizing.

With this in mind, try to simplify the theorems and examples in the paper. Getting rid of covariates often cuts down on both algebra and notation, so start with this. Try re-writing the assumptions and theorems in this simpler notation. Are some of the assumptions confusing? Try strengthening them or try to see if you can find a concrete example in which they hold, possibly taken from the simulation DGP.

Don’t Get Hung Up on Technicalities

Some parts of a paper are “core material” and some parts are “technicalities”. Keeping these separate in your mind will make it much easier to understand a paper. One helpful approach is to make a dependency tree of the assumptions, lemmas, and theorems before trying to understand them. Once you see how things fit together you may notice, for example, that the only role of Proposition 3 is to establish that an appropriate Central Limit Theorem holds and the only role of Assumptions 2-6 is to prove Proposition 3. Fantastic! In this case, just assume the conclusion of Proposition 3 and move on to see where this is needed in the core results. Even when you’re reading assumptions, lemmas, propositions, theorems, and proofs, you should be aiming to get the “big picture” rather than to assimilate every tiny detail.

Be Appropriately Skeptical of Asymptotics

Asymptotics are a crucial tool in econometrics but remember that it is finite sample properties that we actually care about. The “asymptotic distribution” of an estimator is just a thought experiment, not something you can take to the bank. An asymptotic argument is a kind of approximation that in effect supposes that certain things are “negligible.” This approximation could be fantastic or it could be terrible. It’s only through simulation studies that we can really know which is the case. Or, to quote van der Vaart (1998),

strictly speaking, most asymptotic results that are currently available are logically useless. This is because most asymptotic results are limit results, rather than approximations consisting of an approximating formula plus an accurate error bound … This is why there is good asymptotics and bad asymptotics and why two types of asymptotics sometimes lead to conflicting claims … Because it may be theoretically very hard to ascertain that approximation errors are small, one often takes recourse to simulation studies

For an example of “good” versus “bad” asymptotics applied to power analysis, see this post.

Sims and Uhlig (1991) Replication

Mon, 15 Jul 2024 00:00:00 +0000

As a teaser for our upcoming (2024-07-23) virtual reading group session on Bayesian macro / time series econometrics, this post replicates a classic paper by Sims & Uhlig (1991) contrasting Bayesian and Frequentist inferences for a unit root. In the post I’ll focus on explaining and implementing the authors’ simulation design. In the reading group session (and possibly a future post) we’ll talk more about the paper’s implications for the Bayesian-Frequentist debate and relate it to more recent work by Mueller & Norets (2016). We’ll also be joined by special guest Frank Schorfheide who will help guide us through the recent literature on Bayesian approaches to VARs, including Giannone et al (2015) and (2019). If you’re an Oxford student or staff member, you can sign up for the reading group here. Otherwise, send me an email and I’ll add you manually.

A Simple Example

To set the stage for Sims & Uhlig (1991), consider the following simple example: $X_1, X_2, \dots X_{100} \sim \text{Normal}(\mu, \sigma^2)$ where $\mu$ is unknown but $\sigma$ is known to equal $1$. Let $\bar{X} = \frac{1}{100} \sum_{i=1}^{100} X_i$ be the sample mean. Then $\bar{X} \pm 0.2$ is an approximate 95% Frequentist confidence interval for $\mu$. In words: among 95% of the possible datasets that we could potentially observe, the interval $\bar{X} \pm 0.2$ will cover the true, unknown value of $\mu$; in the remaining $5\%$ of datasets, the interval will not cover $\mu$.

The Frequentist interval conditions on $\mu$ and treats $\bar{X}$ as random. In contrast, a Bayesian credible interval conditions on $\bar{X}$ and treats $\mu$ as random. This doesn’t require us to believe that $\mu$ is “really” random. Bayesian reasoning simply uses the language of probability to express uncertainty about any quantity that we cannot observe. Let $\bar{x}$ be the observed value of $\bar{X}$. Under a vague prior for $\mu$, e.g. a Normal(0, 100) distribution, the 95% Bayesian highest posterior density interval for $\mu$ is approximately $\bar{x} \pm 0.2$. In words: given that we have observed $\bar{X} = \bar{x}$, there is a 95% probability that $\mu$ lies in the interval $\bar{x} \pm 0.2$.

The comforting thing about this example is that, regardless of whether we choose a Bayesian or Frequentist perspective, our inference remains the same: compute the sample mean, then add and subtract $0.2$. This means that the Frequentist interval inherits all the nice properties of Bayesian inferences, and the Bayesian interval has correct Frequentist coverage. This equivalence between Bayesian and Frequentist methods crops up in many simple examples, especially in situations where the sample size is large. But in more complex settings, the two approaches can give radically different answers. And to head off a common mis-understanding, this isn’t because Bayesians use priors. In the limit as we accumulate more and more data, the influence of the prior wanes. The key difference is that Bayesian inference adheres to the likelihood principle, whereas common Frequentist methods do not.¹

A Not-so-simple Example

Sims & Uhlig consider the AR(1) model \[ y_t = \rho y_{t-1} + \varepsilon_t, \quad \varepsilon_t \sim \text{iid Normal}(0, 1) \] and the conditional maximum likelihood estimator given the initial $y_0$, namely \[ \widehat{\rho} = \frac{\sum_{t=1}^T y_{t-1} y_t}{\sum_{t=1}^T y_{t-1}^2}. \] Their simulation contrasts the Frequentist sampling distribution of $\widehat{\rho}|\rho$ with the Bayesian posterior distribution of $\rho|\widehat{\rho}$ under a flat prior on $\rho$. When $\rho$ is near one, these two distributions differ markedly: while the Bayesian posterior is always symmetric and centered at $\widehat{\rho} = \widehat{\rho}$, the Frequentist sampling distribution is highly skewed when $\rho$ is close to one. This shows that the Bayesian-Frequentist equivalence we found in our simple population mean example from above breaks down completely in this more complex example.

Sims & Uhlig argue that the Bayesian posterior provides a much more sensible and useful characterization of the information contained in the data and after reading the paper, I’m inclined to agree. My replication code follows below, along with plots of the joint distribution of $(\rho, \widehat{\rho})$ under a uniform prior for $\rho$ and the conditional distributions $\widehat{\rho}|\rho=1$ (Frequentist Sampling Distribution) and $\rho|\widehat{\rho} = 1$ (Bayesian Posterior).²

The Replication

#-------------------------------------------------------------------------------
# Sims, C. A., & Uhlig, H. (1991). Understanding unit rooters: A helicopter tour
#
# (See also: Example 6.10.6 from Poirier "Intermediate Statistics and 'Metrics")
#-------------------------------------------------------------------------------
# In the next section we will proceed to construct, by Monte Carlo, an estimated
# joint pdf for \rho and \hat{\rho} under a uniform prior pdf on \rho. We choose
# 31 values of \rho, from 0.8 to 1.1 at intervals of 0.01. We draw 10000 100 x 1
# iid N(0,1) vectors of random variables to use a realizations of \epsilon. For
# each of the 10000 \epsilon vectors and each of the 31 \rho values, we
# construct a y vector with y(0) = 0, y(t) generated by equation (1).
#
# Equation (1): y(t) = \rho y(t-1) + \epsilon(t), t = 0, ..., T
#
# For each of these y vectors, we construct \hat{\rho}. Using as bins the
# intervals [-\infty, 0.795), [0.795, 0.805), [0.805, 0.815), etc. we construct
# a histogram that estimates the pdf of \hat{rho} for each fixed \rho value.
# When these histograms are lined up side by side, they form a surface that is
# the joint pdf for \rho and \hat{\rho} under a flat prior on \rho.
#-------------------------------------------------------------------------------
set.seed(1693)
library(tidyverse)
library(tictoc)
library(patchwork)
draw_rho_hat <- function(rho) {
# Carry out the simulation once for a fixed value of rho; return rho_hat
nT <- 100
y <- rep(0, nT + 1)
for (t in 2:(nT + 1)) {
y[t] <- rho * y[t - 1] + rnorm(1)
}
y_t <- y[-1]
y_tminus1 <- y[-length(y)]
sum(y_t * y_tminus1) / sum(y_tminus1^2)
}
# Function to run the simulation for a fixed value of rho (10000 times)
run_sim <- \(rho) map_dbl(1:1e4, \(i) draw_rho_hat(rho))
tic()
foo <- run_sim(0.9)
toc() # ~0.6 seconds on my machine

## 0.595 sec elapsed

# Full sequence of rho values from Sims & Uhlig (1991)
rho <- seq(from = 0.8, to = 1.1, by = 0.01)
tic()
results <- tibble(rho = rho,
rho_hat = map(rho, run_sim)) # List columns
toc() # ~17 seconds on my machine (1991 was a long time ago!)

## 16.814 sec elapsed

# The results tibble uses a list column for rho_hat. This is convenient for
# making histograms of the frequentist sampling distribution (rho fixed) but
# not for making histograms of the Bayesian posterior (rho_hat) fixed. For the
# latter, we will use the unnest() function to "expand" the list column rho_hat
# into a regular column. This is the "joint" distribution of rho and rho_hat.
joint <- results |>
unnest(rho_hat)
joint |>
ggplot(aes(x = rho, y = rho_hat)) +
geom_density2d_filled() +
coord_cartesian(ylim = c(0.8, 1.1)) + # Restrict rho_hat axis
labs(title = "Joint Distribution",
x = expression(rho),
y = expression(hat(rho))) +
theme_minimal()

joint |>
filter(rho_hat >= 0.995 & rho_hat < 1.005) |>
ggplot(aes(x = rho)) +
geom_histogram(binwidth = 0.01, fill = "skyblue", color = "black") +
labs(title = expression(hat(rho) == 1),
x = expression(rho),
y = "Frequency") +
theme_minimal()

joint |>
filter(rho == 1) |>
ggplot(aes(x = rho_hat)) +
geom_histogram(binwidth = 0.01, fill = "skyblue", color = "black") +
labs(title = expression(rho == 1),
x = expression(hat(rho)),
y = "Frequency") +
theme_minimal()

# Function that makes the preceding two plots, puts them side-by-side and lets
# the user specify the value of rho/rho_hat that we condition on:
plot_Bayes_vs_Freq <- \(r) {
p1 <- joint |>
filter(rho_hat >= r - 0.005 & rho_hat < r + 0.005) |>
ggplot(aes(x = rho)) +
geom_histogram(aes(y = after_stat(density)),
binwidth = 0.01, fill = "skyblue", color = "black") +
geom_vline(xintercept = r, color = "red", linetype = "dashed", linewidth = 1) +
labs(title = bquote(hat(rho) == .(round(r, 3))),
x = expression(rho)) +
theme_minimal()
p2 <- joint |>
filter(rho >= r - 0.005 & rho < r + 0.005) |>
ggplot(aes(x = rho_hat)) +
geom_histogram(aes(y = after_stat(density)),
binwidth = 0.01, fill = "skyblue", color = "black") +
geom_vline(xintercept = r, color = "red", linetype = "dashed", linewidth = 1) +
labs(title = bquote(rho == .(round(r, 3))),
x = expression(hat(rho))) +
theme_minimal()
p1 + p2
}
plot_Bayes_vs_Freq(0.98)

plot_Bayes_vs_Freq(0.99)

plot_Bayes_vs_Freq(1.0)

plot_Bayes_vs_Freq(1.01)

plot_Bayes_vs_Freq(1.02)

A detailed discussion of the likelihood principle would require at least a whole post of its own. If you want to learn more, I highly recommend the classic monograph by Berger & Wolpert.↩︎
For further discussion of Sims and Uhlig’s illuminating simulation experiment, see Chapter 6 of Poirier.↩︎

The Return of econometrics.blog!

Sun, 14 Jul 2024 00:00:00 +0000

After a year-long hiatus, I’m excited to return to regular blogging about econometrics! I have a long list of posts that I’m eager to write, and I hope you’ll find them interesting. To whet your appetite, here’s a preview of some of the topics I plan to cover in the coming weeks:

Bayesian versus Frequentist Approaches to Unit Roots
How Not To Do Regression Adjustment
Understanding the James-Stein Estimator

In the meantime, I have a few econometrics-related announcements:

I’ll be teaching a summer course on causal inference at Oxford this September. If you’re interested in attending here are the registration details and here’s the course website.
I’m currently running a virtual summer reading group on Bayesian Econometrics that will continue at least until September and potentially beyond, depending on interest. If you have an email address that ends in .ox.ac.uk you can self-register here. If you don’t have an Oxford email address, send me an email and I’ll add you manually.
Martin Weidner and I recently kicked of an initiative to change the way that research in econometrics is assessed. To find out more, visit sqare.org.

I’m looking forward to getting back to regular posting. If you have any special requests, please add them in the comments below.

A Good Instrument is a Bad Control

Thu, 29 Jun 2023 00:00:00 +0000

Here’s a puzzle for you. What will happen if we regress some outcome of interest on both an endogenous regressor and a valid instrument for that regressor? I hadn’t thought about this question until 2018, when one of my undergraduate students asked it during class. If memory serves, my off-the-cuff answer left much to be desired.¹ Five years later I’m finally ready to give a fully satisfactory answer; better late than never I suppose!

The Model

We’ll start by being a bit more precise about the setup. Suppose that $Y$ is related to $X$ according to the following linear causal model \[ Y \leftarrow \alpha + \beta X + U \] where $\beta$ is the causal effect of interest and $U$ represents unobserved causes of $Y$ that may be related to $X$. Now, for any observed random variable $Z$, we can define \[ V \equiv X - (\pi_0 + \pi_1 Z), \quad \pi_0 \equiv \mathbb{E}[X] - \pi_1 \mathbb{E}[Z], \quad \pi_1 \equiv \frac{\text{Cov}(X,Z)}{\text{Var}(Z)}. \] This is the population linear regression of $X$ on $Z$. By construction it satisfies $\mathbb{E}[V] = \text{Cov}(Z,V) = 0$.² Thus we can write, \[ X = \pi_0 + \pi_1 Z + V, \quad \mathbb{E}[V] = \text{Cov}(Z,V) = 0 \] for any random variables $X$ and $Z$, simply by constructing $V$ as described above. If $\pi_1 \neq 0$, we say that $Z$ is relevant. If $\text{Cov}(Z,U) = 0$, we say that $Z$ is exogenous. If $Z$ is both relevant and exogenous, we say that it is a valid instrument for $X$.

As we’ve defined it above, $V$ is simply a regression residual. But if $Z$ is a valid instrument, it turns out that we can think of $V$ as the “endogenous part” of $X$. To see why, expand $\text{Cov}(X,U)$ as follows: \[ \text{Cov}(X,U) = \text{Cov}(\pi_0 + \pi_1 Z + V, \,U) = \pi_1 \text{Cov}(Z,U) + \text{Cov}(U,V) = \text{Cov}(U,V) \] since we have assumed that $\text{Cov}(Z,U) = 0$. In words, the endogeneity of $X$ is precisely the same thing as the covariance between $U$ and $V$.

Here’s a helpful way of thinking about this. If $Z$ is exogenous then our regression of $X$ on $Z$ partitions the overall variation in $X$ into two components: the “good” (exogenous) variation $\pi_1 Z$ is uncorrelated with $U$, while the “bad” (endogenous) variation $V$ is correlated with $U$. The logic of two-stage least squares is that regressing $Y$ on the “good” variation, $\pi_1 Z$ allows us to recover $\beta$, the causal effect of interest.³

A Simulation Example

Using the model and derivations from above, let’s run a little simulation. To simulate a valid instrument $Z$ and an endogenous regressor $X$ we can proceed as follows. First generate independent standard normal draws $\{Z_i\}_{i=1}^n$. Next independently generate pairs of correlated standard normal draws $\{(U_i, V_i)\}_{i=1}^n$ with $\text{Corr}(U_i, V_i) = \rho$. Finally, set \[ X_i = \pi_0 + \pi_1 Z_i + V_i \quad \text{and} \quad Y_i = \alpha + \beta X_i + U_i \] for each value of $i$ between $1$ and $n$.⁴ The following chunk of R code runs this simulation with $n = 5000$, $\rho = 0.5$, $\pi_0 = 0.5$, $\pi_1 = 0.8$, $\alpha = -0.3$ and $\beta = 1$:

set.seed(1234)
n <- 5000
z <- rnorm(n)
library(mvtnorm)
Rho <- matrix(c(1, 0.5,
0.5, 1), 2, 2, byrow = TRUE)
errors <- rmvnorm(n, sigma = Rho)
u <- errors[, 1]
v <- errors[, 2]
x <- 0.5 + 0.8 * z + v
y <- -0.3 + x + u

In the simulation $Z$ is a valid instrument, $X$ is an endogenous regressor, and the true causal effect of interest equals one. Using our simulation data, let’s test out three possible estimators:

$\widehat{\beta}_\text{OLS}\equiv$ the slope coefficient from an OLS regression of $Y$ on $X$.
$\widehat{\beta}_\text{IV}\equiv$ slope coefficient from an IV regression of $Y$ on $X$ with $Z$ as an instrument.
$\widehat{\beta}_{X.Z}\equiv$ the coefficient on $X$ in an OLS regression of $Y$ on $X$ and $Z$.

c(truth = 1,
b_OLS = cov(x, y) / var(x),
b_IV = cov(z, y) / cov(z, x),
b_x.z = unname(coef(lm(y ~ x + z))[2])) |> # unname() makes the names prettier!
round(2)

## truth b_OLS b_IV b_x.z
## 1.00 1.31 1.01 1.49

As expected, OLS is far from the truth while IV pretty much nails it. Interestingly, the regression of y on x and z gives the worst performance of all! Is this just a fluke? Perhaps it’s an artifact of the simulation parameters I chose, or just bad luck arising from some unusual simulation draws. To find out, we’ll need a bit more algebra. But stay with me: the payoff is worth it, and there’s not too much extra math required!

The General Result

Regression of $Y$ on $X$ and $Z$

The coefficient on $X$ in a population linear regression of $Y$ on $X$ and $Z$ is given by \[ \beta_{X.Z} = \frac{\text{Cov}(\tilde{X}, Y)}{\text{Var}(\tilde{X})} \] where $\tilde{X}$ is defined as the residual in another population linear regresasion: the regression of $X$ on $Z$.⁵ But wait a minute: we’ve already seen this residual! Above we called it $V$ and used it to write $X = \pi_0 + \pi_1 Z + V$. Using this equation, along with the linear causal model relating $Y$ to $X$ and $U$, we can re-express $\beta_{X.Z}$ as \[ \begin{align*} \beta_{X.Z} &= \frac{\text{Cov}(V, Y)}{\text{Var}(V)} = \frac{\text{Cov}(V, \alpha + \beta X + U)}{\text{Var}(V)}\\ &= \frac{\text{Cov}(U,V) + \beta\text{Cov}(V, \pi_0 + \pi_1 Z + V)}{\text{Var}(V)}\\ &= \beta + \frac{\text{Cov}(U,V)}{\text{Var}(V)} \end{align*} \] since $\text{Cov}(Z, V) = 0$ by construction. We have some simulation data at our disposal, so let’s check this calculation. In the simulation $\beta = 1$ and \[ \frac{\text{Cov}(U, V)}{\text{Var}(V)} = 0.5 \] since $\text{Var}(U) = \text{Var}(V) = 1$ and $\text{Cov}(U, V) = 0.5$. Therefore $\beta_{X.Z} = 1.5$. And, indeed, this is almost exactly the value of our estimate from our simulation above.

Regression of $Y$ on $X$ Only

So far so good. Now what about the “usual” OLS estimator? A quick calculation gives \[ \beta_{\text{OLS}} = \beta + \frac{\text{Cov}(X,U)}{\text{Var}(X)} = \beta + \frac{\text{Cov}(V,U)}{\text{Var}(X)} \] using the fact that $\text{Cov}(X,U) = \text{Cov}(U,V)$, as explained above. Again, we can check this against our simulation results. We know that $\text{Cov}(V,U) = 0.5$ and \[ \text{Var}(X) = \text{Var}(\pi_0 + \pi_1 Z + V) = \pi_1^2 \text{Var}(Z) + \text{Var}(V) = (0.8)^2 + 1 = 41/25 \] since $Z$ and $V$ are uncorrelated by construction, $\text{Var}(Z) = \text{Var}(V) = 1$ and $\pi_1 = 0.8$ in the simulation design. Hence, $\beta_{\text{OLS}} = 1 + 25/82 \approx 1.305$. Again, this agrees almost perfectly with our simulation.

Comparing the Results

To summarize, we have shown that \[ \beta_{X.Z} = \beta + \frac{\text{Cov}(U,V)}{\text{Var}(V)}, \quad \text{while} \quad \beta_{\text{OLS}} = \beta + \frac{\text{Cov}(U,V)}{\text{Var}(X)}. \] There is only one difference between these two expressions: $\beta_{X.Z}$ has $\text{Var}(V)$ where $\beta_{\text{OLS}}$ has $\text{Var}(X)$. Returning to our expression for $\text{Var}(X)$ from above, \[ \text{Var}(X) = \pi_1^2 \text{Var}(Z) + \text{Var}(V) > \text{Var}(V) \] as long as $\pi_1 \neq 0$ and $\text{Var}(Z) \neq 0$. In other words, there is always more variation in $X$ than there is in $V$, since $V$ is the “leftover” part of $X$ after regressing on $Z$. Because the variances of $X$ and $V$ appear in the denominators of our expressions from above, it follows that \[ \left| \text{Cov}(U,V)/\text{Var}(V)\right| > \left| \text{Cov}(U,V)/\text{Var}(X)\right|. \] In other words, $\beta_{X.Z}$ is always farther from the truth than $\beta_{OLS}$, exactly as we found in our simulation.

Some Intuition

In our simulation, $\widehat{\beta}_{X.Z}$ gave a worse estimate of $\beta$ than $\widehat{\beta}_{X.Z}$. The derivations from above show that this wasn’t a fluke: adding a valid instrument $Z$ as an additional control regressor only makes the bias in our estimated causal effect worse than it was to begin with. This holds for any valid instrument and any endogenous regressor in a linear causal model. I hope you found the derivations from above convincing. But even so, you may be wondering if there’s an intuitive explanation for this phenomenon. I am please to inform you that the answer is yes!

In an earlier post I described the control function approach to instrumental variables regression. That post showed that the coefficient on $X$ in a regression of $Y$ on $X$ and $V$ gives the correct causal effect. We don’t know $V$, but we can estimate it by regressing $X$ on $Z$ and saving the residuals. The logic of multiple regression shows that including $V$ as a control regressor “soaks up” the portion of $X$ that is explained by $V$. Because $V$ represents the “bad” (endogenous) variation in $X$, this solves our endogeneity problem. In effect, $V$ captures the unobserved “omitted variables” that play havoc with a naive regression of $Y$ on $X$.

Now, contrast this with a regression of $Y$ on $X$ and $Z$. In this case, we soak up the variation in $X$ that is explained by $Z$. But $Z$ represents the good (exogenous) variation in $X$! Soaking up this variation leaves only the bad variation behind, making our endogeneity problem worse than it was to begin with. In this example, $Z$ is what is known as a bad control, a control regressor that makes things worse rather than better. A common piece of advice for avoiding bad controls is to only include control regressors that are correlated with $X$ and $Y$ but are not themselves caused by $Z$. The example in this post shows that this advice wrong. Here $Z$ is not caused by $X$, and is correlated with both $X$ and $Y$. Nevertheless, it is a bad control In short, a valid instrument provides a powerful way to carry out causal inference from observational data, but only if you use it in the right way. A good instrument is a bad control!

I seem to recall saying something like “this won’t in general give us the causal effect we’re interested in, but I don’t think it’s possible to say anything more without extra assumptions.” Fortunately my lackluster response didn’t derail the student who asked the question: he’s currently pursuing a PhD in Economics at UChicago!↩︎
Check if you don’t believe me: substitute the expressions for $\pi_0$ and $\pi_1$, take expectations / covariances, and simplify.↩︎
See this blog post for more discussion.↩︎
We don’t necessarily need $Z_i$ to be normally distributed, as long as it’s independent of $(U_i, V_i)$, so you could use e.g. uniform draws if you prefer. Generating $(U_i, V_i)$ from a bivariate normal distribution isn’t necessary either, but it’s a simple way of controlling the endogeneity in $X$.↩︎
This is a special case of the so-called FWL Theorem, although I’d argue that we should call it “Yule’s Rule” since George Udny Yule was arguably the first person to popularize it, decades before F, W, or L.↩︎

The R Formula Cheatsheet

Wed, 19 Apr 2023 00:00:00 +0000

R’s formula syntax is extremely powerful but can be confusing for beginners.¹ This post is a quick reference covering all of the symbols that have a “special” meaning inside of an R formula: ~, +, ., -, 1, :, *, ^, and I(). You may never use some of these in practice, but it’s nice to know that they exist. It was many years before I realized that I could simply type y ~ x * z instead of the lengthier y ~ x + z + x:z, for example. While R formulas crop up in a variety of places, they are probably most familiar as the first argument of lm(). For this reason, my verbal explanations assume a simple linear regression setting in which we hope to predict y using a number of regressors x, z, and w.

Symbol	Purpose	Example	In Words
`~`	separate LHS and RHS of formula	`y ~ x`	regress `y` on `x`
`+`	add variable to a formula	`y ~ x + z`	regress `y` on `x` and `z`
`.`	denotes “everything else”	`y ~ .`	regress `y` on all other variables in a data frame
`-`	remove variable from a formula	`y ~ . - x`	regress `y` on all other variables except `x`
`1`	denotes intercept	`y ~ x - 1`	regress `y` on `x` without an intercept
`:`	construct interaction term	`y ~ x + z + x:z`	regress `y` on `x`, `z`, and the product `x` times `z`
`*`	shorthand for levels plus interaction	`y ~ x * z`	regress `y` on `x`, `z`, and the product `x` times `z`
`^`	higher order interactions	`y ~ (x + z + w)^3`	regress `y` on `x`, `z`, `w`, all two-way interactions, and the three-way interactions
`I()`	“as-is” - override special meanings of other symbols from this table	`y ~ x + I(x^2)`	regress `y` on `x` and `x` squared

Fun fact: R’s formula syntax originated in this 1973 paper by Wilkinson and Rogers.↩︎

Random Variables Cheatsheet

Sat, 07 Jan 2023 00:00:00 +0000

To do well in an econometrics or statistics course at any level, you need to have a large number of simple properties of random variables at your fingertips. Some years back I made a handout containing the most important properties for my undergraduate students at the University of Pennsylvania. In the hopes that this might be of use to others, I’ve released an updated pdf on github. You can fork the repository here. If you spot any errors or want to suggest any additions, feel free to raise an issue or send me a pull request.

Why Econometrics is Confusing Part II: The Independence Zoo

Sun, 01 Jan 2023 00:00:00 +0000

In econometrics it’s absolutely crucial to keep track of which things are dependent and which are independent. To make this as confusing as possible for students, a typical introductory econometrics course moves back and forth between different notions of dependence, stopping occasionally to mention that they’re not equivalent but never fully explaining why, on the premise that “you’ve certainly already learned this in your introductory probability and statistics course.” I remember finding this extremely frustrating as a student, but only recently managed to translate this frustration into meaningful changes in my own teaching.¹ Building on some of my recent teaching materials, this post is a field guide to the menagerie–or at least petting zoo–of “dependence” notions that appear regularly in econometrics. We’ll examine each property on its own along with the relationships between them, using the simple examples to build your intuition. Since a picture is worth a thousand words, here’s one that summarizes the entire post:

Figure 1: Different notions of dependence in econometrics and their relationships. A directed double arrow indicates that one property implies another.

Prerequisites

While written at an introductory level, this post assumes basic familiarity with calculations involving discrete and continuous random variables. In particular, I assume that:

You know the definitions of expected value, variance, covariance, and correlation.
You are comfortable working with joint, marginal, and conditional distributions of a pair of discrete random variables.
You understand the uniform distribution and how to compute its moments (mean, variance, etc.).
You’ve encountered the notion of conditional expectation and the law of iterated expectations.

If you’re a bit rusty on this material, lectures 7-11 from these slides should be helpful. For bivariate, discrete distributions I also suggest watching this video from 1:07:00 to the end and this other video from 0:00:00 up to the one hour mark.

Two Examples

Example #1 - Discrete RVs $(X,Y)$

My first example involves two discrete random variables $X$ and $Y$ with joint probability mass function $p_{XY}(x,y)$ given by

	$Y=0$	$Y=1$
$X = -1$	$1/3$	$0$
$X = 0$	$0$	$1/3$
$X= 1$	$1/3$	$0$

Even without doing any math, we see that knowing $X$ conveys information about $Y$, and vice-versa. For example, if $X = -1$ then we know that $Y$ must equal zero. Similarly, if $Y=1$ then $X$ must equal zero. Spend a bit of time thinking about this joint distribution before reading further. We’ll have plenty of time for mathematics below, but it’s always worth seeing where our intuition takes us before calculating everything.

To streamline our discussion below, it will be helpful to work out a few basic results about $X$ and $Y$. A quick calculation with $p_{XY}$ shows that \[ \mathbb{E}(XY) \equiv \sum_{\text{all } x} \sum_{\text{all } y}= x y \cdot p_{XY}(x,y) = 0. \] Calculating the marginal pmfs for $X$ we see that \[ p_X(-1) = p_X(0) = p_X(1) = 1/3 \implies \mathbb{E}(X) \equiv \sum_{\text{all } x} x \cdot p_X(x) = 0. \] Similarly, calculating the marginal pmf of $Y$, we obtain \[ p_Y(0) = 2/3,\, p_Y(1) = 1/3 \implies \mathbb{E}(Y) \equiv \sum_{\text{all } y} p_Y(y) = 1/3. \] We’ll use these results as ingredients below as we explain and relate three key notions of dependence: correlation, conditional mean independence, and statistical independence.

Example #2 - Continuous RVs $(W,Z)$

My second example concerns two continuous random variables $W$ and $Z$, where $W \sim \text{Uniform}(-1, 1)$ and $Z = W^2$. In this example, $W$ and $Z$ are very strongly related: if I tell you that the realization of $W$ is $w$, then you know for sure that the realization of $Z$ must be $w^2$. Again, keep this intuition in mind as we work through the mathematics below.

In the remainder of the post, we’ll find it helpful to refer to a few properties of $W$ and $Z$, namely \[ \begin{aligned} \mathbb{E}[W] &\equiv \int_{-\infty}^\infty w\cdot f_W(w)\, dw = \int_{-1}^1 w\cdot \frac{1}{2}\,dw = \left. \frac{w^2}{4}\right|_{-1}^1 = 0\\ \mathbb{E}[Z] &\equiv \mathbb{E}[W^2] = \int_{-\infty}^{\infty} w^2 \cdot f_W(w)\, dw = \int_{-1}^1 w^2 \cdot \frac{1}{2} \, dw = \left. \frac{w^3}{6}\right|_{-1}^1 = \frac{1}{3}\\ \mathbb{E}[WZ] &= \mathbb{E}[W^3] \equiv \int_{-\infty}^\infty w^3 \cdot f_W(w)\, dw =\int_{-1}^1 w^3 \cdot \frac{1}{2}\, dw = \left. \frac{w^4}{8} \right|_{-1}^1 = 0. \end{aligned} \] Since $W$ is uniform on the interval $[-1,1]$, its pdf is simply $1/2$ on this interval, and zero otherwise. All else equal, I prefer easy integration problems!

Uncorrelatedness

Recall that the correlation between two random variables $X$ and $Y$ is defined as \[ \text{Corr}(X,Y) \equiv \frac{\text{Cov}(X,Y)}{\text{SD}(X)\text{SD}(Y)} = \frac{\mathbb{E}[(X - \mu_X)(Y - \mu_Y)]}{\sqrt{\mathbb{E}[(X - \mu_X)^2]\mathbb{E}[(Y - \mu_Y)^2]}} \] where $\mu_X \equiv \mathbb{E}(X)$ and $\mu_Y \equiv \mathbb{E}(Y)$. We say that $X$ and $Y$ are uncorrelated if $\text{Corr}(X,Y)= 0$. Unless $X$ and $Y$ are both constants their variances must be positive. This means that the denominator of our expression for $\text{Corr}(X,Y)$ is likewise positive. It follows that zero correlation is the same thing as zero covariance. Correlation is simply covariance rescaled so that the units of $X$ and $Y$ cancel out and the result always lies between $-1$ and $1$.

Correlation and covariance are both measures of linear dependence. If $X$ is, on average, above its mean when $Y$ is above its mean, then $\text{Corr}(X,Y)$ and $\text{Cov}(X,Y)$ are both positive. If $X$ is, on average, below its mean when $Y$ is above its mean, then $\text{Corr}(X,Y)$ and $\text{Cov}(X,Y)$ are both negative. If there is, on average, no linear relationship $X$ and $Y$, then both the correlation and covariance between them are zero. Using the “shortcut formula” for covariance, namely \[ \text{Cov}(X,Y) \equiv \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y], \] it follows that uncorrelatedness is equivalent to \[ \mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]. \] Rendering this in English rather than mathematics,

Two random variables $X$ and $Y$ are uncorrelated if and only if the expectation of their product equals the product of their expectations.

Example #1: $X$ and $Y$ are uncorrelated.

In Example #1 from above, $\mathbb{E}[XY]=0$ and $\mathbb{E}(X)\mathbb{E}(Y) = 0 \times 1/3 = 0$ so $X$ and $Y$ are uncorrelated. Lack of correlation is one possible way in which two random variables can be thought of as “unrelated.” But it is a relatively weak property. Indeed, $X$ and $Y$ are in fact highly dependent in Example #1. For example, if $X=-1$ then we know for sure that $Y=0$. I simply cooked up the numbers to ensure that $\mathbb{E}[XY]=\mathbb{E}[X]\mathbb{E}[Y]$ in spite of this.

Example #2: $W$ and $Z$ are uncorrelated.

Because Example #1 is discrete, it can be a bit tricky to think about what it would mean for a dependence relationship to be nonlinear. Here Example #2 can help. As mentioned above, there is clearly a relationship between $Z$ and $W$. But this relationship is nonlinear in that $Z$ is a quadratic function of $X$. Since $\mathbb{E}(WZ) = 0$ and $\mathbb{E}(W) \times \mathbb{E}(Z) = 0 \times \mathbb{E}(Z) = 0$, we see that $W$ and $Z$ are uncorrelated. Another way to see this is by simulating some data with the same properties as Example #2

set.seed(1983)
n_sims <- 250
w <- runif(n_sims, -1, 1)
z <- w^2
cor(w, z)

## [1] 0.008379101

plot(w, z)
abline(lm(z ~ w))

The regression line is flat despite there being an obvious relationship between $W$ and $Z$. When $W$ is positive, there is a positive relationship between the two RVs; but when $W$ is negative the picture is reverses. The line of best fit “averages out” the increasing and decreasing relationships on either side of zero to give an overall slope of zero.²

Conditional Mean Independence

We say that $Y$ is mean independent of $X$ if $\mathbb{E}(Y|X) = \mathbb{E}(Y)$. In words,

$Y$ is mean independent of $X$ if the conditional mean of $Y$ given $X$ equals the unconditional mean of $Y$.

Just to make things confusing, this property is sometimes called “conditional mean independence” and sometimes called simply “mean independence.” The terms are completely interchangeagle. Reversing the roles of $X$ and $Y$, we say that $X$ is mean independent of $Y$ if the conditional mean of $X$ given $Y$ is the same as the unconditional mean of $X$. Spoiler alert: it is possible for $X$ to be mean independent of $Y$ while $Y$ is not mean independent of $X$. We’ll discuss this further below.

To better understand the concept of mean independence, let’s quickly review the difference between an unconditional mean and a conditional mean. The unconditional mean $\mathbb{E}(Y)$, also known as the “expected value” or “expectation” of $Y$, is a constant number.³ If $Y$ is discrete, this is simply the probability-weighted average of all possible realizations of $Y$, namely \[ \mathbb{E}(Y) = \sum_{\text{all } y} y \cdot p_Y(y). \] If $Y$ is continuous, it’s the same idea but with an integral replacing the sum and a probability density $f_Y(y)$ multiplied by $dy$ replacing the probability mass function $p_Y(y)$. Either way, we’re simply multiplying numbers together and adding up the result. Despite the similarity in notation, the conditional expectation $\mathbb{E}(Y|X)$ is a function of $X$ that tells us how the mean of $Y$ varies with $X$. Since $X$ is a random variable, so is $\mathbb{E}(Y|X)$. If $Y$ is conditionally mean independent of $X$ then $\mathbb{E}(Y|X)$ equals $\mathbb{E}(Y)$. In words, the mean of $Y$ does not vary with $X$. Regardless of the value that $X$ takes on, the mean of $Y$ is the same: $\mathbb{E}(Y)$.

There’s another way to think about this property in terms of prediction. With a bit of calculus, we can show that $\mathbb{E}(Y)$ solves the following optimization problem: \[ \min_{\text{all constants } c} \mathbb{E}[(Y - c)^2]. \] In other words, $\mathbb{E}(Y)$ is the constant number that is as close as possible to $Y$ on average, where “close” is measured by squared euclidean distance. In this sense, we can think of $\mathbb{E}(Y)$ as our “best guess” of the value that $Y$ will take. Again using a bit of calculus, it turns out that $\mathbb{E}(Y|X)$ solves the following optimization problem: \[ \min_{\text{all functions } g} \mathbb{E}[\{Y - g(X) \}^2]. \] (See this video for a proof.) Thus, $\mathbb{E}(Y|X)$ is the function of $X$ that is as close as possible to $Y$ on average, where “close” is measured using squared Euclidean distance. Thus, $\mathbb{E}(Y|X)$ is our “best guess” of $Y$ after observing $X$. We have seen that $\mathbb{E}(Y)$ and $\mathbb{E}(Y|X)$ are the solutions to two related but distinct optimization problems; the former is a constant number that doesn’t depend on the realization of $X$ whereas the latter is a function of $X$. Mean independence is the special case in which the solutions to the two optimization problems coincide: $\mathbb{E}(Y|X) = \mathbb{E}(Y)$. Therefore,

$Y$ is mean independent of $X$ if our best guess of $Y$ taking $X$ into account is the same as our best guess of $Y$ ignoring $X$, where “best” is defined by “minimizes average squared distance to $Y$.”

Example #1: $X$ is mean independent of $Y$.

Using the table of joint probabilities for Example #1 above, we found that $\mathbb{E}(X) = 0$. To determine whether $X$ is mean independent of $Y$, we need to calculate $\mathbb{E}(X|Y=y)$, which we can accomplish as follows: \[ \begin{aligned} \mathbb{E}(X|y=0) &= \sum_{\text{all } x} x \cdot \mathbb{P}(X=x|Y=0) = \sum_{\text{all } x} x \cdot \frac{\mathbb{P}(X=x,Y=0)}{\mathbb{P}(Y=0)}\\ \\ \mathbb{E}(X|y=1) &= \sum_{\text{all } x} x \cdot \mathbb{P}(X=x|Y=1) = \sum_{\text{all } x} x \cdot \frac{\mathbb{P}(X=x,Y=1)}{\mathbb{P}(Y=1)}. \end{aligned} \] Substituting the joint and marginal probabilities from the table above, we find that \[ \mathbb{E}(X|Y=0) = 0, \quad \mathbb{E}(X|Y=1) = 0. \] Thus $\mathbb{E}(X|Y=y)$ simply equals zero, regardless of the realization $y$ of $Y$. Since $\mathbb{E}(X) = 0$ we have shown that $X$ is conditionally mean independent of $X$.

Example #1: $Y$ is NOT mean independent of $X$.

To determine whether $Y$ is mean independent of $X$ we need to calculate $\mathbb{E}(Y|X)$. But this is easy. From the table we see that $Y$ is known with certainty after we observe $X$: if $X = -1$ then $Y = 0$, if $X = 0$ then $Y = 1$, and if $X = 1$ then $Y = 0$. Thus, without doing any math at all we find that \[ \mathbb{E}(Y|X=-1) = 0, \quad \mathbb{E}(Y|X=0) = 1, \quad \mathbb{E}(Y|X=1) = 0. \] (If you don’t believe me, work through the arithmetic yourself!) This clearly depends on $X$, so $Y$ is not mean independent of $X$.

Example #2: $Z$ is NOT mean independent of $W$.

Above we calculated that $\mathbb{E}(Z) = \mathbb{E}(W^2) = 1/3$. But the conditional expectation is \[ \mathbb{E}(Z|W) = \mathbb{E}(W^2|W) = W^2 \] using the “taking out what is known” property: conditional on $W$, we know $W^2$ and can hence treat it as though it were a constant in an unconditional expectation, pulling it in front of the $\mathbb{E}$ operator. We see that $\mathbb{E}(Z|W)$ does not equal $1/3$: its value depends on $W$. Therefore $Z$ is not mean independent of $W$.

Example #2: $W$ is mean independent of $Z$.

This one is trickier. To keep this post at an elementary level, my explanation won’t be completely rigorous. For more details see here. We need to calculate $\mathbb{E}(W|Z)$. Since $Z \equiv W^2$ this is the same thing as $\mathbb{E}(W|W^2)$. Let’s start with an example. Suppose we observe $Z = 1$. This means that $W^2 = 1$ so $W$ either equals $1$ or $-1$. How likely is each of these possible realizations of $W$ given that $W^2 = 1$? Because the density of $W$ is symmetric about zero, $f_W(-1) = f_W(1)$. So given that $W^2 = 1$, it is just as likely that $W = 1$ as it is that $W = -1$. Therefore, \[ \mathbb{E}(W|W^2 = 1) = 0.5 \times 1 + 0.5 \times -1 = 0. \] Generalizing this idea, if we observe $Z = z$ then $W = \sqrt{z}$ or $-\sqrt{z}$. But since $f_W(\cdot)$ is symmetric about zero, these possibilities are equally likely. Therefore, \[ \mathbb{E}(W|Z=z) = 0.5 \times \sqrt{z} - 0.5 \times \sqrt{z} = 0. \] Above we calculated that $\mathbb{E}(W) = 0$. Therefore, $W$ is mean independent of $Z$.

Statistical Independence

When you see the word “independent” without any qualification, this means “statistically independent.” In keeping with this usage, I often write “independent” rather than “statistically independent.” Whichever terminology you prefer, there are three equivalent ways of defining this idea:

$X$ and $Y$ are statistically independent if and only if:

their joint distribution equals the product of their marginals, or

the conditional distribution of $Y|X$ equals the unconditional distribution of $Y$, or

the conditional distribution of $X|Y$ equals the unconditional distribution of $X$.

The link between these three alternatives is the definition of conditional probability. Suppose that $X$ and $Y$ are discrete random variables with joint pmf $p_{XY}$, marginal pmfs $p_X$ and $p_Y$, and conditional pmfs $p_{X|Y}$ and $p_{Y|X}$. Version 1 requires that $p_{XY}(x,y) = p_X(x) p_Y(y)$ for all realizations $x,y$. But by the definition of conditional probability, \[ p_{X|Y}(x|y) \equiv \frac{p_{XY}(x,y)}{p_Y(y)}, \quad p_{Y|X}(y|x) \equiv \frac{p_{XY}(x,y)}{p_X(x)}. \] If $p_{XY} = p_X p_Y$, these expressions simplify to \[ p_{X|Y}(x|y) \equiv \frac{p_{X}(x)p_Y(y)}{p_Y(y)} = p_X(x), \quad p_{Y|X}(y|x) \equiv \frac{p_{X}(x)p_Y(y)}{p_X(x)} = p_Y(y) \] so 1 implies 2 and 3. Similarly, if $p_{X|Y}=p_X$ then by the definition of conditional probability \[ p_{X|Y}(x|y) \equiv \frac{p_{XY}(x,y)}{p_Y(y)} = p_X(x). \] Re-arranging, this shows that $p_{XY} = p_X p_Y$, so 3 implies 1. An almost identical argument shows that 2 implies 1, completing our proof that these three seemingly different definitions of statistical independence are equivalent. If $X$ and $Y$ are continuous, the idea is the same but with densities replacing probability mass functions, e.g. $f_{XY}(x,y) = f_X(x) f_Y(y)$ and so on.

In most examples, it’s easier to show independence (or the lack thereof) using 2 or 3 rather than 1. These latter two definitions are also more intuitively appealing. To say that the conditional distribution of $X|Y$ is the same as the unconditional distribution of $X$ is the same thing as saying that

$Y$ provides absolutely no information about $X$ whatsoever.

If learning $Y$ tells us anything at all about $X$, then $X$ and $Y$ are not independent. Similarly, if $X$ tells us anything about $Y$ at all, then $X$ and $Y$ are not independent.

Example #1: $X$ and $Y$ are NOT independent.

If I tell you that $X = 0$, then you know for sure that $Y = 0$. Before I told you this, you did not know that $Y$ would equal zero: it’s a random variable with support set $\{0,1\}$. Since learning $X$ has the potential to tell you something about $Y$, $X$ and $Y$ are not independent. That was easy! For extra credit, $p_{XY}(-1,0) = 1/3$ but $p_X(-1)p_Y(0) = 1/3 \times 2/3 = 2/9$. Since these are not equal, $p_{XY}\neq p_X p_Y$ so the marginal doesn’t equal the product of the joint. We didn’t need to check this, but it’s reassuring to see that everything works out as it should.

Example #2: $W$ and $Z$ are NOT independent.

Again, this one is easy: learning that $W = w$ tells us that $Z = w^2$. We didn’t know this before, so $W$ and $Z$ cannot be independent.

Relating the Three Properties

Now that we’ve described uncorrelatedness, mean independence, and statistical independence, we’re ready to see how these properties relate to one another. Let’s start by reviewing what we learned from the examples given above. In example #1:

$X$ and $Y$ are uncorrelated
$X$ is mean independent of $Y$
$Y$ is not mean independent of $X$
$X$ and $Y$ are not independent.

In example #2, we found that

$W$ and $Z$ are uncorrelated
$W$ is mean independent of $Z$.
$Z$ is not mean independent of $W$.
$W$ and $Z$ are not independent.

These are worth remembering, because they are relatively simple and provide a source of counterexamples to help you avoid making tempting but incorrect statements about correlation, mean independence, and statistical independence. For example:

Uncorrelatedness does NOT IMPLY statistical independence: $X$ and $Y$ are not independent, but they are uncorrelated. (Ditto for $W$ and $Z$.)
Mean independence does NOT IMPLY statistical independence: $W$ is mean independent of $Z$ but these random variables are not independent.
Mean independence is NOT SYMMETRIC: $X$ is mean independent of $Y$, but $Y$ is not mean independent of $X$.

Now that we have a handle on what’s not true, let’s see what can be said about correlation, mean independence, and statistical independence.

Uncorrelatedness and Statistical Independence are Symmetric

In the equality $\mathbb{E}(XY) = \mathbb{E}(X) \mathbb{E}(Y)$, nothing changes if we swap the roles of $X$ and $Y$; this statement is equivalent to $\mathbb{E}(YX) = \mathbb{E}(Y) \mathbb{E}(X)$. This shows that uncorrelatedness is symmetric. The same goes for statistical independence: we showed that $p_{Y|X} = p_Y$ is equivalent to $p_{X|Y} = p_X$ above. In contrast, mean independence is not symmetric: $X$ can be mean independent of $Y$ without $Y$ being mean independent of $X$.

Here’s an analogy: uncorrelatedness and independence are like the relation “being biological siblings.” If $X$ is the sibling of $Y$, then $Y$ must be the sibling of $X$ because “being siblings” is defined as “having the same parents.” In contrast, mean independence is like the relation “being in love.” Sadly, it’s possible for $X$ to be in love with $Y$ despite $Y$ not being in love with $X$.⁴

Statistical Independence Implies Conditional Mean Independence

Statistical independence is the “strongest” of the three properties: it implies both mean independence and uncorrelatedness. We’ll show this in two steps. In the first step, we’ll show that statistical independence implies mean independence. In the second step we’ll show that mean independence implies uncorrelatedness. Then we’ll bring this overly-long blog post to a close! Suppose that $X$ and $Y$ are discrete random variables. (For the continuous case, replace sums with integrals.) If $X$ is statistically independent of $Y$, then $p_{Y|X} = p_Y$ and $p_{X|Y} = p_X$. Hence, \[ \begin{aligned} \mathbb{E}(Y|X=x) &\equiv \sum_{\text{all } y} y \cdot p_{Y|X}(y|x) = \sum_{\text{all } y} y \cdot p_Y(y) \equiv \mathbb{E}(Y)\\ \mathbb{E}(X|Y=y) &\equiv \sum_{\text{all } x} x \cdot p_{X|Y}(x|y) = \sum_{\text{all } x} x \cdot p_X(x) \equiv \mathbb{E}(X) \end{aligned} \] so $Y$ is mean independent of $X$ and $X$ is mean independent of $Y$.

Conditional Mean Independence Implies Uncorrelatedness

If either $\mathbb{E}(Y|X) = \mathbb{E}(Y)$ or $\mathbb{E}(X|Y) = \mathbb{E}(X)$, then $X$ and $Y$ are uncorrelated. To show this, we use the Law of Iterated Expectations and the “taking out what is known” property, along with the fact that $\mathbb{E}(X)$ and $\mathbb{E}(Y)$ are constants. Suppose first that $Y$ is mean independent of $X$, i.e. $\mathbb{E}(Y|X) = \mathbb{E}(Y)$. Then, taking iterated expectations over $X$, \[ \mathbb{E}(XY) = \mathbb{E}[\mathbb{E}(XY|X)] = \mathbb{E}[X \mathbb{E}(Y|X)] = \mathbb{E}[X \mathbb{E}(Y)] = \mathbb{E}(X) \mathbb{E}(Y). \] Alternatively, suppose that $X$ is mean independent of $Y$, i.e. $\mathbb{E}(X|Y) = \mathbb{E}(X)$. Then, taking iterated expectations over $Y$, \[ \mathbb{E}(XY) = \mathbb{E}[\mathbb{E}(XY|Y)] = \mathbb{E}[Y\mathbb{E}(X|Y)] = \mathbb{E}[Y \mathbb{E}(X)] = \mathbb{E}(Y) \mathbb{E}(X). \] Therefore, if either $X$ is mean independent of $Y$, or $Y$ is mean independent of $X$, or both, then $X$ and $Y$ are uncorrelated. Since statistical independence implies mean independence, it follows that statistical independence implies uncorrelatedness. And we’re finally done!

Summary

In this post we shown that that:

Statistical Independence $\implies$ Mean Independence $\implies$ Uncorrelatedness.
Uncorrelatedness does not imply mean independence or statistical independence.
Mean independence does not imply statistical independence.
Statistical independence and correlation are symmetric; mean independence is not.

Reading the figure from the very beginning of this post from top to bottom: statistical independence is the strongest notion, followed by mean independence, followed by uncorrelatedness.

It turns out that teaching well is extremely hard. I am incredibly grateful to those intrepid souls who bravely raise their hand and inform me that no one in the room has any idea what I’m talking about!↩︎
I used a small number of simulation draws so it would be easier to see the data in the plot. If you use a larger number of simulations, the correlation will be even closer to zero and the line almost perfectly flat.↩︎
Throughout this post, I make the tacit assumption that all means–conditional or unconditional–exist and are finite.↩︎
But on the plus side, we got a lot of great pop songs out of the deal!↩︎

From the Poisson Distribution to Stirling's Approximation

Fri, 18 Nov 2022 00:00:00 +0000

The Poisson distribution is the most famous probability model for counts, non-negative integer values. Many real-world phenomena are well approximated by this distribution, including the number of German bombs that landed in 1/4km grid squares in south London during WWII. Formally, we say that a discrete random variable $X$ follows a Poisson distribution with rate parameter $\mu > 0$, abbreviated $X \sim \text{Poisson}(\mu)$, if $X$ has support set $\{0, 1, 2, ...\}$ and probability mass function \[ p(x) \equiv \mathbb{P}(X=x) = \frac{e^{-\mu }\mu^x}{x!}. \] Using some clever algebra with sums it’s not too hard to show that the rate parameter, $\mu$, is both the mean and the variance of $X$.

Numerical problems? Try taking logs.

Now, suppose that we wanted to plot the pmf of a Poisson RV with rate $\mu = 171$. The R function for the pmf of a Poisson RV is dpois(), so we can make our plot as follows (indicating the rate parameter as a vertical line)

library(tidyverse)
tibble(x = 0:300) %>%
mutate(p = dpois(x, 171)) %>%
ggplot(aes(x, p)) +
geom_point() +
geom_vline(xintercept = 171) +
ylab('Poisson(171) pmf')

For such a large value of $\mu$, this distribution looks decidedly bell-shaped. And indeed, it turns out to be extremely well-approximated by a normal distribution, as we’ll see below. It’s also clear that $X$ is most likely to take on a value relatively close to 171. We can use dpois() to calculate the exact probability that $X = 171$ as follows: the answer is just over 3%.

dpois(171, 171)

## [1] 0.03049301

Now let’s try to calculate exactly the same probability by hand, that is by using the formula for the Poisson pmf from above.

my_dpois <- function(x, mu) {
exp(-mu) * mu^x / factorial(x)
}
my_dpois(171, 171)

## [1] NaN

What gives?! The abbreviation NaN stands for “not a number.” The problem in this case is that both the numerator and denominator of the fraction inside of my_dpois() evaluate to infinity when mu and x are 171, and the ratio $\infty/\infty$ is undefined.¹

c(numerator = exp(-171) * 171^171, denominator = factorial(171))

## numerator denominator
## Inf Inf

As I discussed in an earlier post, computers can only store a finite number of distinct numeric values. It’s not literally true that factorial(171) equals $\infty$. What’s really going on here is that factorial(171) is such a large number that it can’t be stored as a floating-point number. In this case there’s a very simple fix. If you haven’t seen this trick before, it’s a helpful one to keep up your sleeves: if you run into numerical problems with very large or very small values, try taking logs.² The log of the Poisson pmf is simply \[ \log p(x) = -\mu + x \log(\mu) - \log(x!). \] R even has a convenient, built-in function for evaluating the natrual log of a factorial: lfactorial(). Now we can compute the log of our desired probability as follows:

-171 + 171 * log(171) - lfactorial(171)

## [1] -3.490258

To obtain the probability, simply exponentiate:

exp(-171 + 171 * log(171) - lfactorial(171))

## [1] 0.03049301

Of course this just passes the buck to lfactorial(). So how does this mysterious function work? The bad news is that I’m not going to tell you; the good news is that I’m going to show you something even better, namely Stirling’s approximation: a way to understand now $n!$ behaves qualitatively that turns out to give a pretty darned good approximation to lfactorial(). This may seem like an odd topic for a blog devoted to econometrics and statistics, so allow me to offer a few words of justification. First, computations involving $n!$ come up all the time in applied work. Second, it can be extremely helpful for certain theoretical arguments to have good approximations to $n!$ for large values of $n$. Finally, and most importantly from my perspective, the heuristic argument I’ll use below relies on none other than the central limit theorem. So even if you’ve seen a more traditional proof of Stirling’s approximation, I hope you’ll enjoy this alternative approach.³

Stirling’s Approximation

The key step in our argument is to show that the pmf of a $\text{Poisson}(\mu)$ random variable is well-approximated by the $\text{Normal}(\mu, \mu)$ density. This explains the bell-shaped curve that we plotted above. To obtain this result, we’ll use the central limit theorem. But there is one fact that you will need to take on faith if you don’t already know it: if $X_1 \sim \text{Poisson}(\mu_1)$ is independent of $X_2 \sim \text{Poisson}(\mu_2)$ then $X_1 + X_2 \sim \text{Poisson}(\mu_1 + \mu_2)$. Proceeding by induction we can view a Poisson(171) random variable as the sum of 171 independent Poisson(1) random variables. More generally, we can view a Poisson RV with rate parameter $n$ as the num of $n$ iid Poisson(1) random variables. By the central limit theorem, it follows that \[ \sqrt{n}(\bar{X}_n - 1) \rightarrow_d \text{N}(0,1) \] since the mean and variance of a Poisson(1) RV are both equal to one. From a practical perspective, this means that $\sqrt{n}(\bar{X}_n - 1)$ is approximately equal to $Z$, a standard normal random variable. Re-arranging, \[ X_1 + X_2 + ... + X_n = n\bar{X}_n = n + \sqrt{n} \times [\sqrt{n}(\bar{X}_n - 1)] \approx n + \sqrt{n} Z \] and $n + \sqrt{n} Z$ is simply a $\text{N}(n, n)$ random variable! This is a quick way of seeing why the $\text{Poisson}(\mu)$ distribution is well-approximated by the $\text{N}(\mu, \mu)$ distribution when $\mu$ is large.

Now let’s run with this. As we just saw, for large $\mu$ the Poisson$(\mu)$ pmf is well-approximated by the Normal$(\mu, \mu)$ density: \[ \frac{e^{-\mu}\mu^x}{x!} \approx \frac{1}{\sqrt{2\pi \mu}} \exp\left\{ -\frac{1}{2}\left( \frac{x - \mu}{\sqrt{\mu}}\right)^2\right\} \] This approximation is particularly accurate for $x$ near the mean. This is convenient, because substituting $\mu$ for $x$ considerably simplifies the right hand side: \[ \frac{e^{-\mu}\mu^\mu}{\mu!} \approx \frac{1}{\sqrt{2\pi\mu}} \] Re-arranging, we obtain \[ \mu! \approx \mu^\mu e^{-\mu} \sqrt{2 \pi \mu} \] Taking logs of both sides gives: \[ \log(\mu!) \approx \mu \log(\mu) - \mu + \frac{1}{2} \log(2 \pi \mu) \] Writing this with $n$ in place of $\mu$ gives the following: \[ \log(n!) \approx n \log(n) - n + \frac{1}{2} \log(2 \pi n) \] This is called Stirling’s Approximation. The usual way of writing this excludes the $\log(2\pi n)/2$ term, yielding $\log(n!) \approx n\log(n) - n$, which is fairly easy to remember. Including the extra term, however, gives increased accuracy for smaller values of $n$. While I haven’t formally proved this, it turns out that \[ \log(n!) \sim n \log(n) - n + \frac{1}{2} \log(2 \pi n) \] as $n \rightarrow \infty$. In other words, the ratio of the LHS and RHS tends to one in the large $n$ limit. Perhaps surprisingly, this approximately is extremely accurate even for fairly small values of $n$, as we can see by comparing it against lfactorial().

stirling1 <- function(n) n * log(n) - n
stirling2 <- function(n) n * log(n) - n + 0.5 * log(2 * pi * n)
tibble(n = 1:20) %>%
mutate(Stirling1 = stirling1(n),
Stirling2 = stirling2(n),
R = lfactorial(n)) %>%
knitr::kable(digits = 3)

n	Stirling1	Stirling2	R
1	-1.000	-0.081	0.000
2	-0.614	0.652	0.693
3	0.296	1.764	1.792
4	1.545	3.157	3.178
5	3.047	4.771	4.787
6	4.751	6.565	6.579
7	6.621	8.513	8.525
8	8.636	10.594	10.605
9	10.775	12.793	12.802
10	13.026	15.096	15.104
11	15.377	17.495	17.502
12	17.819	19.980	19.987
13	20.344	22.546	22.552
14	22.947	25.185	25.191
15	25.621	27.894	27.899
16	28.361	30.667	30.672
17	31.165	33.500	33.505
18	34.027	36.391	36.395
19	36.944	39.335	39.340
20	39.915	42.331	42.336

Epilogue

I have a bad habit of trying to add a “moral” or “lesson” to the end of my posts, but I suppose there’s no point trying to break the habit today! While there are easier ways to derive Stirling’s approximation, there are two things I enjoy about this one. First, we get a more accurate approximation than $n \log(n) - n$ with practically no effort. Second, making unexpected connections between facts that we already know both deepens our understanding and helps us “compress” information. If you ever forget Stirling’s approximation, now you know how to very quickly re-derive it on the spot!

There’s an important but subtle difference between NA and NaN. The former is synonymous with “missing.” If x equals NA this means “we don’t know the value of x.” If instead x equals NaN, this means “x isn’t missing, but it’s not a well-defined numeric value either.”↩︎
Unless otherwise specified log always means “natural logarithm” on this blog :)↩︎
I first came across this argument from the late David MacKay’s fantastic book Information Theory, Inference, and Learning Algorithms. His book on sustainable energy, while a bit out-of-date at this point, is also spectacularly good.↩︎

Local Asymptotics: The Simplest Possible Example

Sat, 12 Nov 2022 00:00:00 +0000

If you study enough econometrics, you will eventually come across an asymptotic argument in which some parameter is assumed to change with sample size. This peculiar notion goes by a variety of names including “Pitman drift,” a “sequence of local alternatives,” and “local mis-specification,” and crops up in a wide range of problems from weak instruments, to model selection, to power analysis.¹ Whatever you choose to call it, the idea of a parameter that changes with sample size is bizarre, and I remember spending weeks trying to understand it when I was a graduate student. How could parameters, fixed quantities that we’re trying to estimate, possibly know anything about our sample size?

Do we expect parameters to be smaller when we have more data? Do we expect them to be larger when we have less data? The answer to both questions is a resounding NO. Like all asymptotics, what I will call local asymptotics are nothing more than a thought experiment that we set up for mathematical convenience. Ideally we would derive finite sample results for every problem of interest, but this is rarely possible in practice. For this reason we turn to asymptotic results, such as the central limit theorem. Sometimes this works out OK, and sometimes it’s a disaster. The goal of local asymptotics is to derive results that more closely approximate the finite sample behavior that we can understand from simple examples, in the hope that this will lead to better approximations in more complicated problems. In this post, I’ll illustrate the usefulness of local asymptotics in the simplest example I could think of: a one-sided test for the mean of a normal distribution with known variance. No advanced statistics or econometrics are used below, so even if you found the preceding paragraph off-putting give the rest a go: you may be pleasantly surprised!

Suppose that we observe \[ X_1, X_2, \dots, X_{n} \overset{iid}{\sim}N(\mu, 1) \] and want to test $H_0\colon \mu = 0$ against the one-sided alternative $H_1\colon \mu >0$. In this admittedly very simple example, the Econometrics 101 test statistic is \[ T_{n} = \sqrt{n} \bar{X}_{n} \sim N\left(\mu \sqrt{n}, 1\right) \] where $\bar{X}_{n}$ is the sample mean. We reject when $\sqrt{n} \bar{X}_{n}>z_{1-\alpha}$ where $z_{1-\alpha}$ is the $1-\alpha$ quantile of a standard normal distribution. Let’s calculate the power of this test: the probability of rejecting the null hypothesis given that it is false. We find that \[ \begin{eqnarray*} \mbox{Power}(T_{n}) &=& P\left(\sqrt{n} \bar{X}_{n}>z_{1-\alpha}\right) = P\left(Z + \mu\sqrt{n} >z_{1-\alpha}\right)\\ &=&P\left(Z >z_{1-\alpha} - \mu\sqrt{n}\right) = 1 - \Phi\left(z_{1-\alpha} - \mu\sqrt{n}\right) \end{eqnarray*} \] where $Z$ is a standard normal random variable and $\Phi$ is the standard normal CDF.

Now suppose we decided to do something completely crazy: throw away half our sample. Let $\bar{X}_{n/2}$ denote the sample mean based on observations $1, 2, \dots, \lfloor N/2 \rfloor$ only, where $\lfloor x \rfloor$ denotes the floor function, i.e. the greatest integer less than or equal to $x$.² We can still construct a perfectly valid test with size $\alpha$ as follows. Define \[ T_{n/2} = \sqrt{\lfloor n/2 \rfloor } \bar{X}_{n/2} \sim N\left(\mu \sqrt{\lfloor n/2 \rfloor }, 1\right) \] and reject if $\sqrt{n} \bar{X}_n > z_{1-\alpha}$. But there’s an obvious problem here: there must be a cost for throwing away perfectly good data. Indeed, if we calculate the power for this crazy test, we’ll find that it’s strictly lower than that of the sensible test based on the full sample. In particular, \[\mbox{Power}(T_{n/2}) = 1 - \Phi\left(z_{1-\alpha} - \mu\sqrt{\lfloor n/2 \rfloor }\right)\] using the same argument as above with $\lfloor N/2 \rfloor$ in place of $n$. Unsurprisingly, it turns out to be a bad idea to throw away half of your data!

Now, for an example this simple we’d never resort to asymptotics, but suppose we did. How do these two tests compare as the sample size goes to infinity? The asymptotic size in this example is the same as the finite-sample size since we know the exact sampling distribution of the test statistics under the null and neither depends on sample size. But what about the power? We have, \[ \begin{eqnarray*} \lim_{n\rightarrow \infty} \mbox{Power}(T_{n}) &=& \lim_{n\rightarrow \infty}\left[1 - \Phi\left(z_{1-\alpha} - \mu\sqrt{n}\right) \right] = 1\\ \lim_{n\rightarrow \infty} \mbox{Power}(T_{n/2}) &=& \lim_{n\rightarrow \infty}\left[1 - \Phi\left(z_{1-\alpha} - \mu\sqrt{\lfloor n/2 \rfloor }\right) \right] = 1 \end{eqnarray*} \] In other words, both of these tests are consistent: as the sample size goes to infinity, the power goes to one. Think about this for a moment: we know that for any fixed sample size a test based on the full sample is strictly more powerful but in the limit this difference disappears. This strongly suggests that something is wrong with comparing two tests on the basis of their asymptotic power. Clearly the second test is worse than the first, but the asymptotics obscure this.

You might object that I’ve cooked up a particularly perverse example, but it turns out that this phenomenon is quite general. It’s easy to find consistent tests, in fact it’s difficult to find tests that aren’t consistent. But we know from simulation studies that not all consistent tests are created equal: some have much better finite sample power than others and it’s ultimately finite sample performance that we care about. One way around this problem would be to only compare the finite-sample properties of different tests and never use asymptotics. But we almost never know the exact sampling distribution of our test statistics.³

This is where local alternatives come in. Rather than evaluating our tests against a fixed alternative $\mu$, suppose we were to evaluate it against a sequence of local alternatives that drift towards the null at rate $n^{-1/2}$. In other words, our alternative becomes $H_{1,n} \colon \mu = \delta / \sqrt{n}$ where, for this one-sided test, $\delta > 0$. If we substitute $\delta/\sqrt{n}$ for $\mu$ and take the limit as $n\rightarrow \infty$, we find \[ \begin{eqnarray*} \lim_{n\rightarrow \infty} \mbox{Power}(T_{n}) &=& \lim_{n\rightarrow \infty}\left[1 - \Phi\left(z_{1-\alpha} - \frac{\delta}{\sqrt{n}}\sqrt{n}\right) \right]\\ &=& 1 - \Phi\left(z_{1-\alpha} - \delta \right) \end{eqnarray*} \] and similarly \[ \begin{eqnarray*} \lim_{n\rightarrow \infty} \mbox{Power}(T_{n/2}) &=& \lim_{n\rightarrow \infty}\left[1 - \Phi\left(z_{1-\alpha} - \frac{\delta}{\sqrt{n}}\sqrt{\lfloor n/2 \rfloor }\right) \right]\\ &=& 1 - \Phi\left(z_{1-\alpha} - \frac{\delta}{\sqrt{2}} \right) \end{eqnarray*} \] Wow! Our problem has disappeared! The asymptotic power of the two tests now differs in essentially the same way as the finite sample power. Also note that the power no longer converges to one. Intuitively, this is because the drifting sequence of alternatives $\delta/\sqrt{n}$ makes it “harder and harder” to reject the null as the sample size grows by shrinking just fast enough but not so fast that the power goes to zero. This type of calculation is called a local power analysis. A test that has asymptotic power greater than zero in such a setting is said to have “power against local alternatives.”

This example was a bit silly since we already knew the answer. But this is precisely what made it so obvious that local asymptotics make more sense in this setting than fixed-parameter asymptotics. Now that you understand this basic intuition, I hope you’ll feel more confident tackling examples of local asymptotics that come up in the econometrics literature.

I’ve used this idea in some of my own work on moment selection and model selection.↩︎
If you don’t like this, ignore it and pretend that $n$ is an even number!↩︎
This blog post is a very special exception created for pedagogical purposes :)↩︎

Three Ways of Thinking About Instrumental Variables

Sun, 06 Nov 2022 00:00:00 +0000

In this post we’ll examine a very simple instrumental variables model from three different perspectives: two familiar and one a bit more exotic. While all three yield the same solution in this particular model, they lead in different directions in more complicated examples. Crucially, each gives us a different way of thinking about the problem of endogeneity and how to solve it.

The Setup

Consider a simple linear causal model of the form $Y \leftarrow \alpha + \beta X + U$ where $X$ is endogenous, i.e. related to the unobserved random variable $U$. Our goal is to learn $\beta$, the causal effect of $X$ on $Y$. To take a simple example, suppose that $Y$ is wage and $X$ is years of schooling. Then $\beta$ is the causal effect of one additional year of schooling on a person’s wage. The random variable $U$ is a catchall, representing all unobserved causes of wage, such as ability, family background, and so on. A linear regression of $Y$ on $X$ will not allow us to learn $\beta$. For example, if you’re very smart, you will probably find school easier and stay in school longer. But being smarter likely has its own effect on your wage, separate from years of education. Ability is a confounder because it causes both years of schooling and wage.

Now suppose that $Z$ is an instrumental variable: something that is uncorrelated with $U$ (exogenous) but correlated with $X$ (relevant). For example, a very famous paper pointed out that quarter of birth is correlated with years of schooling in the US and argued that it is unrelated to other causes of wages. Finding a good instrumental variable is very hard in practice. Indeed, I remain skeptical that quarter of birth is really unrelated to $U$. But that’s a conversation for another day. For the moment, suppose we have a bona fide exogenous and relevant instrument at our disposal. To make things even simpler, suppose that the true causal effect $\beta$ is homogeneous, i.e. the same for everyone.

1st Perspective: The IV Approach

Regress $Y$ on $Z$ to find the causal effect of $Z$ on $Y$. Rescale it to obtain the causal effect of $X$ on $Y$.

If $Z$ is a valid and relevant instrument, then \[ \beta_{\text{IV}} \equiv \frac{\text{Cov}(Z,Y)}{\text{Cov}(Z,X)} = \frac{\text{Cov}(Z, \alpha + \beta X + U)}{\text{Cov}(Z,X)} = \frac{\beta\text{Cov}(Z,X) + \text{Cov}(Z,U)}{\text{Cov}(Z,X)} = \beta \] which is precisely the causal effect we’re after! The ratio of $\text{Cov}(Z,Y)$ to $\text{Cov}(Z,X)$ is called the instrumental variables (IV) estimand, but it seems to come out of nowhere. A more intuitive way to write this quantity multiplies the numerator and denominator by $\text{Var}(Z)$ to yield \[ \beta_{\text{IV}} \equiv \frac{\text{Cov}(Z,Y)}{\text{Cov}(Z,X)} = \frac{\text{Cov}(Y,Z)/\text{Var}(Z)}{\text{Cov}(X,Z)/\text{Var}(Z)} \equiv \frac{\gamma}{\pi}. \] We see that $\beta_{\text{IV}}$ is the ratio of two linear regression slopes: the slope $\gamma$ from a regression of $Y$ on $Z$ divided by the slope $\pi$ from a regression of $X$ on $Z$. This makes intuitive sense if we think about units. Because $Z$ is unrelated to $U$, $\gamma$ gives the causal effect of $Z$ on $Y$. If $Y$ is measured in dollars and $Z$ is measured in miles (e.g. distance to college), then $\gamma$ is measured in dollars per mile. If $X$ is years of schooling, then $\beta$ should be measured in dollars per year. To convert from dollars/mile to dollars/year, we need to multiply by miles/year or equivalently to divide by years/mile. And indeed, $\pi$ is measured in years/mile as required! This is yet another example of my favorite maxim: most formulas in statistics and econometrics are obvious if you keep track of the units.

2nd Perspective: The TSLS Approach

Construct $\tilde{X}$ by using $Z$ to “clean out” the part of $X$ that is correlated with $U$. Then regress $Y$ on $\tilde{X}$.

Let $\delta$ be the intercept and $\pi$ be the slope from a population linear regression of $X$ on $Z$. Defining $V \equiv X - \delta - \pi Z$, we can write \[ X = \tilde{X} + V, \quad \tilde{X} \equiv \delta + \pi Z, \quad \pi \equiv \frac{\text{Cov}(X,Z)}{\text{Var}(Z)}, \quad \delta \equiv \mathbb{E}(X) - \pi\mathbb{E}(Z). \] By definition $\tilde{X} \equiv \delta + \pi Z$ is the best linear predictor of $X$ based on $Z$, in that $\delta$ and $\pi$ solve the optimization problem \[ \min_{a, b} \mathbb{E}[(X - a - bZ)^2]. \] What’s more, $\text{Cov}(Z,V) = 0$ by construction since: \[ \begin{align*} \text{Cov}(Z,V) &= \text{Cov}(Z, X - \delta - \pi Z) = \text{Cov}(Z,X) - \pi \text{Var}(Z)\\ &= \text{Cov}(Z,X) - \frac{\text{Cov}(X,Z)}{\text{Var}(Z)} \text{Var}(Z) = 0. \end{align*} \] And since $Z$ is uncorrelated with $U$, so is $\tilde{X}$: \[ \text{Cov}(\tilde{X}, U) = \text{Cov}(\delta + \pi Z, U) = \pi\text{Cov}(Z,U) = 0. \] So now we have a variable $\tilde{X}$ that is a good predictor of $X$ but is uncorrelated with $U$. In essence, we’ve used $Z$ to “clean out” the endogeneity from $X$ and we did this using a first stage regression of $X$ on $Z$. Two-stage least squares (TSLS) combines this with a second stage regression of $Y$ on $\tilde{X}$ to recover $\beta$. To see why this works, substitute $\tilde{X} +V$ for $X$ in the causal model, yielding \[ \begin{align*} Y &= \alpha + \beta X + U = \alpha + \beta (\tilde{X} + V) + U\\ &= \alpha + \beta \tilde{X} + (\beta V + U)\\ &= \alpha + \beta \tilde{X} + \tilde{U} \end{align*} \] where we define $\tilde{U} \equiv \beta V + U$. Finally, since \[ \begin{align*} \text{Cov}(\tilde{X}, \tilde{U}) &= \text{Cov}(\tilde{X}, \beta V + U)\\ &= \beta\text{Cov}(\tilde{X}, V) + \text{Cov}(\tilde{X}, U)\\ &= \beta\text{Cov}(\delta + \pi Z , V) + 0 \\ &= \beta\pi\text{Cov}(Z, V) = 0 \end{align*} \] a regression of $Y$ on $\tilde{X}$ recovers the causal effect $\beta$ of $X$ on $Y$.

3rd Perspective: The Control Function Approach

Use $Z$ to solve for $V$, the part of $U$ that is correlated with $X$. Then regress $Y$ on $X$ controlling for $V$.

I’m willing to bet that you haven’t seen this approach before! The so-called control function approach starts from the same place as TSLS: the first-stage regression of $X$ on $Z$ from above, namely \[ X = \delta + \pi Z + V, \quad \text{Cov}(Z,V) = 0. \] Like the error term $U$ from the causal model $Y \leftarrow \alpha + \beta X + U$, the first stage regression error $V$ is unobserved. But as strange as it sounds, imagine running a regression of $U$ on $V$. Then we would obtain \[ U = \kappa + \lambda V + \epsilon, \quad \lambda \equiv \frac{\text{Cov}(U,V)}{\text{Var}(V)}, \quad \kappa \equiv \mathbb{E}(U) - \lambda \mathbb{E}(V) \] where $\text{Cov}(V, \epsilon) = 0$ by construction. Now, since the causal model for $Y$ includes an intercept, $\mathbb{E}(U) = 0$. And since the first-stage linear regression model that defines $V$ likewise includes an intercept, $\mathbb{E}(V) = 0$ as well. This means that $\kappa = 0$ so the regression of $U$ on $V$ becomes \[ U = \lambda V + \epsilon, \quad \lambda \equiv \frac{\text{Cov}(U,V)}{\text{Var}(V)} \quad \text{Cov}(V, \epsilon) = 0. \] Now, substituting for $U$ in the causal model gives \[ Y = \alpha + \beta X + U = \alpha + \beta X + \lambda V + \epsilon. \] By construction $\text{Cov}(V, \epsilon) = 0$. And since $X = \delta + \pi Z + V$, it follows that \[ \begin{align*} \text{Cov}(X,\epsilon) &= \text{Cov}(\delta + \pi Z + V, \epsilon)\\ &= \pi \text{Cov}(Z,\epsilon) + \text{Cov}(V, \epsilon) \\ &= \pi \text{Cov}(Z, U - \lambda V) + 0\\ &= \pi \left[ \text{Cov}(Z,U) - \lambda \text{Cov}(Z,V)\right] = 0. \end{align*} \] Therefore, if only we could observe $V$, a regression of $Y$ on $X$ that controls for $V$ would allow us to recover the causal effect of interest, namely $\beta$. Such a regression would also give us $\lambda$. To see why this is interesting, notice that \[ \begin{align*} \text{Cov}(X,U) &= \text{Cov}(\gamma + \pi Z + V, U) = \pi\text{Cov}(Z,U) + \text{Cov}(V,U)\\ &= 0 + \text{Cov}(V, \lambda V + \epsilon) \\ &= \lambda \text{Var}(V). \end{align*} \] Since $\text{Var}(V) > 0$, $\lambda$ tell us the direction of endogeneity in $X$. If $\lambda >0$ then $X$ is positively correlated with $U$, if $\lambda < 0$ then $X$ is negatively correlated with $U$, and if $\lambda = 0$ then $X$ is exogenous. If $U$ is ability and ability has a positive effect on years of schooling, for example, then $\lambda$ will be positive.

Now it’s time to address the elephant in the room: $V$ is unobserved! It’s all fine and well to say that if $V$ were observed our problems would be solved, but given that it is not in fact observed what are we supposed to do? Here’s where the TSLS first stage regression comes to the rescue. Both $X$ and $Z$ are observed, so we can learn $\delta$ and $\pi$ by regressing $X$ on $Z$. Given these coefficients, we can simply solve for the unobserved error: $V = X - \delta - \pi Z$. Like TSLS, the control function approach relies crucially on the first stage regression. But whereas TSLS uses it to construct $\tilde{X} = \delta + \pi Z$, the control function approach uses it to construct $V = X - \delta - \pi Z$. We don’t replace $X$ with its exogenous component $\tilde{X}$; instead we “pull out” the component of $U$ that is correlated with $X$, namely $V$. In effect we control for the “omitted variable” $V$, hence the name control function.

Simulating the Three Approaches

Perhaps that was all a bit abstract. Let’s make it concrete by simulating some data and actually calculating estimates of $\beta$ using each of the three approaches described above. Because this exercise relies on a sample of data rather than a population, estimates will replace parameters and residuals will replace error terms.

To begin, we need to simulate $Z$ independently of $(U,V)$. For simplicity I’ll make these standard normal and set the correlation between $U$ and $V$ to 0.5.

set.seed(1983) # for replicability of pseudo-random draws
n <- 1000
Z <- rnorm(n)
library(mvtnorm)
cor_mat <- matrix(c(1, 0.5,
0.5, 1), 2, 2, byrow = TRUE)
errors <- rmvnorm(n, sigma = cor_mat)
head(errors)

## [,1] [,2]
## [1,] 0.1612255 -0.96692422
## [2,] 1.4020130 1.55818062
## [3,] 1.7212525 -0.01997204
## [4,] -0.6972637 -0.68551762
## [5,] 1.3471669 -0.01766333
## [6,] -1.0441467 -0.23113677

U <- errors[,1]
V <- errors[,2]
rm(errors)

Since this is a simulation we actually can observe $U$ and $V$ and hence could regress the one on the other. Since I set the standard deviation of both of them equal to one, $\lambda$ will simply equal the correlation between them, namely 0.5

coef(lm(U ~ V - 1)) # exclude an intercept

## V
## 0.5047334

Excellent! Everything is working as it should. The next step is to generate $X$ and $Y$. Again to keep things simple, in my simulation I’ll set $\alpha = \delta = 0$.

pi <- 0.3
beta <- 1.1
X <- pi * Z + V
Y <- beta * X + U

Now we’re ready to run some regressions! We’ll start with an OLS regression of $Y$ on $X$. This substantially overestimates $\beta$ because $X$ is in fact positively correlated with $U$.

OLS <- coef(lm(Y ~ X))[2]
OLS

## X
## 1.567642

In contrast, the IV approach works well.

IV <- cov(Y, Z) / cov(X, Z)
IV

## [1] 1.049043

For the TSLS and control function approaches we need to run the first-stage regression of $X$ on $Z$ and store the results.

first_stage <- lm(X ~ Z)

The TSLS approach uses the fitted values of this regression as $\tilde{X}$.

Xtilde <- predict(first_stage)
TSLS <- coef(lm(Y ~ Xtilde))[2] # drop the intercept since we're not interested in it
TSLS

## Xtilde
## 1.049043

In contrast, the control function approach uses the residuals from the first stage regression. It also gives us $\lambda$ in addition to $\beta$.

Vhat <- residuals(first_stage)
CF <- coef(lm(Y ~ X + Vhat))[-1] # drop the intercept since we're not interested in it
CF # The coefficient on Vhat is lambda

## X Vhat
## 1.0490432 0.5558904

Notice that we obtain precisely the same estimates for $\beta$ using each of the three approaches.

c(IV, TSLS, CF[1])

## Xtilde X
## 1.049043 1.049043 1.049043

It turns out that in this simple linear model with a single endogenous regressor and a single instrument, the three approaches are numerically equivalent. In other words, they give exactly the same answer. This will not necessarily be true in more complicated models, so be careful!

Epilogue

It’s time to admit that this post had a secret agenda: to introduce the idea of a control function in the simplest way possible! If you’re interested in learning more about control functions, a canonical example that does not turn out to be identical to IV is the so-called Heckman Selection Model, which you can learn more about here. (Scroll down until you see the heading “Heckman Selection Model.”) The basic logic is similar: to solve an endogeneity problem, use a first-stage regression to estimate an unobserved quantity that “soaks up” the part of the error term that is correlated with your endogenous regressor of interest. If these videos whet your appetite for more control function fun, Wooldridge (2015) provides a helpful overview along with many references to the econometrics literature.

Why Econometrics is Confusing Part 1: The Error Term

Sun, 24 Jul 2022 00:00:00 +0000

“Suppose that $Y = \alpha + \beta X + U$.” A sentence like this is bound to come up dozens of times in an introductory econometrics course, but if I had my way it would be stamped out completely. Without further clarification, this sentence could mean any number of different things. Even with clarification, it is a source of endless confusion for beginning students. What is $U$ exactly? What is the meaning of “$=$” in this context? We can do better. Here are a few suggestions.

Population Linear Regression

Sometimes $Y = \alpha + \beta X + U$ is nothing more than the population linear regression model. In other words $(\alpha, \beta)$ are the solutions to \[ \min_{\alpha, \beta} \mathbb{E}[(Y - \alpha - \beta X)^2]. \]

The usual way to signal this is by adding the “assumptions” that $\mathbb{E}[XU] = \mathbb{E}[U] = 0$. It’s no wonder that students find this confusing. Neither of these equalities is in fact an assumption; each is true by construction. Rather than “let $Y = \alpha + \beta X + U$,” I suggest

Define $U \equiv Y - (\alpha + \beta X)$ where $\alpha$ and $\beta$ are the slope and intercept from a population linear regression of $Y$ on $X$.

This makes it clear that $U$ has no life of its own; it is defined by the coefficients $\alpha$ and $\beta$. In this way, the equalities $\mathbb{E}[XU] = \mathbb{E}[U] = 0$ become a theorem to be deduced rather than a spurious “assumption” of linear regression. Repeat after me: the population linear regression model has no assumptions. We can always choose $\alpha$ and $\beta$ to ensure that $U$ satisfies the equalities from above. The solution to the population least squares problem is \[ \beta = \text{Cov}(X,Y)/\text{Var}(X),\quad \alpha = \mathbb{E}[Y] - \beta \mathbb{E}[X]. \] By the linearity of expectation, it follows that \[ \mathbb{E}[U] = \mathbb{E}[Y - \alpha - \beta X] = \mathbb{E}[Y] - (\mathbb{E}[Y] - \beta \mathbb{E}[X]) - \beta \mathbb{E}[X] = 0 \] and similarly, although with a bit more algebra¹ \[ \begin{align} \mathbb{E}[XU] &= \mathbb{E}[X(Y - \alpha - \beta X)] \\ &= \mathbb{E}[X(Y - \left\{\mathbb{E}(Y) - \beta \mathbb{E}(X)\right\} - \beta X)]\\ &= \mathbb{E}[X\left\{Y - \mathbb{E}(Y) \right\}] - \beta \mathbb{E}[X\left\{X - \mathbb{E}(X)\right\} ]\\ &= \text{Cov}(X,Y) - \beta \text{Var}(X) = 0. \end{align} \]

Conditional Mean Function

In other situations $Y = \alpha + \beta X + U$ is intended to represent a conditional mean function. This is usually signaled by the assumption $\mathbb{E}[U|X] = 0$. This time around I haven’t written the word assumption in “scare quotes.” That’s because there is an assumption lurking here, unlike in the population linear regression model from above. Still, this is a hopelessly confusing way of indicating it. Here’s a better way:

Define $U \equiv Y - \mathbb{E}(Y|X)$ and assume that $\mathbb{E}(Y|X) = \alpha + \beta X$.

Again, this makes it clear that $U$ has no life of its own. It is constructed from $Y$ and $X$. The conditional mean function $\mathbb{E}(Y|X)$ is simply the minimizer of $\mathbb{E}[\left\{ Y - f(X)\right\}^2]$ over all (well-behaved) functions.² By construction $\mathbb{E}[U|X] = 0$ since \[ \mathbb{E}[U|X] = \mathbb{E}[Y - \mathbb{E}(Y|X)|X] = \mathbb{E}[Y|X] - \mathbb{E}[Y|X] = 0 \] by the linearity of conditional expectation and the fact that $\mathbb{E}(Y|X)$ is a function of $X$. But how can we be sure that the conditional mean function is linear? This is a bona fide assumption: it may be true or it may be false. Either way, it is much clearer to emphasize that we are making an assumption about the form of the conditional mean function, not an assumption about the error term $U \equiv Y - \mathbb{E}(Y|X)$.

Causal Model

Both interpretations of $Y = \alpha + \beta X + U$ from above are purely predictive; they say nothing about whether $X$ causes $Y$. To indicate that a linear model is mean to be causal, it is traditional to write something like “suppose that $Y = \alpha + \beta X + U$ where $X$ may be endogenous.” Often “may be endogenous” is replaced by “where $X$ may be correlated with $U$.” What on earth is this supposed to mean? The language is vague, evasive, and imprecise. It also stretches the meaning of “$=$” beyond all reason. Here’s my suggested improvement:

Consider the causal model $Y \leftarrow (\alpha + \beta X + U)$ where $U$ is unobserved and $(X,U)$ may be dependent.

Causality is intrinsically directional: cigarettes cause lung cancer; lung cancer doesn’t cause cigarettes. The notation “$\leftarrow$” makes this clear. In stark contrast, the notion of mathematical equality is symmetric. If $Y = \alpha + \beta X + U$, it is just as true to say that $X = (Y - \alpha - U) / \beta$. Of course this is nonsensical when applied to cigarettes and cancer.

In a causal model, $U$ does have a life of its own; it represents the causes of $Y$ that we cannot observe. Perhaps $Y$ is wage, $X$ is years of schooling and $U$ is “family background” plus “ability.” For this reason I do not write “define $U \equiv (\text{something})$.” We aren’t defining a residual in a prediction problem. We are taking a stand on how the world works by writing down a particular causal model. In a randomized controlled trial, any unobserved causes $U$ would be independent of $X$. Here we have not made this assumption. We have, however, assumed a particular form for the causal relationship: linear with constant coefficients. Each additional year of schooling causes the same increase (or decrease) in wage regardless of who you are or how many years of schooling you already have. This model could be wrong. But right or wrong, it is fundamentally distinct from the population linear regression and conditional mean models described above. Let’s endeavour to make this clear in our notation.

Surprised that $\mathbb{E}[X(Y - \mathbb{E}[Y])]] = \text{Cov}(X,Y)$ and $\mathbb{E}[X(X - \mathbb{E}[X])] = \text{Var}(X)$? These are great homework problems! Work through the algebra using the definitions of variance and covariance.↩︎
See Section 2.1 of my lecture notes for details.↩︎

A New Way of Looking at Least Squares

Mon, 14 Mar 2022 00:00:00 +0000

Have you got a ruler handy? Fantastic! Then hold out your right hand, extend your thumb and little finger as far as they’ll go, and measure the distance in centimeters, rounding to the nearest half centimeter. This is your handspan. Mine is around 23.5 centimeters, but is that big, small, or merely average?¹ Fortunately for you, I’ve asked hundreds of introductory statistics students to measure their handspans over the years and (with their consent) posted the resulting data on my website:

library(tidyverse)
dat <- read_csv('https://ditraglia.com/econ103/height-handspan.csv')
quantile(dat$handspan)

## 0% 25% 50% 75% 100%
## 16.5 20.0 21.5 23.0 28.0

If we take these 326 students as representative of the population of which I am a member, my handspan is roundabout the 84th percentile.

The great thing about handspan, and the reason that I used it in my teaching, is that it’s strongly correlated with height but, in contrast to weight, there’s no temptation to shade the truth. (What’s a good handspan anyway?) Here’s a scatterplot of height against handspan along with the regression line and confidence bands:

dat %>%
ggplot(aes(x = handspan, y = height)) +
geom_point(alpha = 0.3) +
geom_smooth(method = 'lm', formula = y ~ x)

Because handspan is only measured to the nearest half of a centimeter and height to the nearest inch, the dataset contains multiple “tied” values. I’ve used darker colors to indicate situations in which more than one student reported a given height and handspan pair.² The correlation between height and handspan is approximately 0.67. From the following simple linear regression, we’d predict approximately a 1.3 inch difference in height between two students whose handspan differed by one centimeter:

coef(lm(height ~ handspan, dat))

## (Intercept) handspan
## 40.943127 1.266961

Where does the regression line come from?

If you’ve taken an introductory statistics or econometrics course, you most likely learned that the least squares regression line $\widehat{\alpha} + \widehat{\beta} x$ minimizes the sum of squared vertical deviations by solving the optimization problem \[ \max_{\alpha, \beta} \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2. \] You probably also learned that the solution is given by \[ \widehat{\alpha} = \bar{y} - \widehat{\beta} \bar{x}, \quad \widehat{\beta} = \frac{\sum_{i=1}^n (y_i - \bar{y})(x_i - \bar{x})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{s_{xy}}{s_x^2} \] In words: the regression slope equals the ratio of the covariance between height and handspan to the variance of handspan, and the regression line passes through the sample average values of height and handspan. We can check that all of these formulas agree with what we calculated above using lm()

b <- with(dat, cov(height, handspan) / var(handspan))
a <- with(dat, mean(height) - b * mean(handspan))
c(a, b)

## [1] 40.943127 1.266961

coef(lm(height ~ handspan, dat))

## (Intercept) handspan
## 40.943127 1.266961

This is all perfectly correct, and an entirely reasonable way of looking at the problem. But I’d now like to suggest a completely different way of looking at regression. Why bother? The more different ways we have of understanding an idea or method, the more deeply we understand how it works, when it works, and when it is likely to fail. So bear with me while I take you on what might at first appear to be a poorly-motivated computational detour. I promise that there’s a payoff at the end!

A Crazy Idea

There is a unique line that passes through any two distinct points in a plane. So here’s a crazy idea: let’s form every unique pair of students from my height and handspan dataset. To understand what I have in mind, consider a small subset of the data, call it test

test <- dat[1:3,]
test

## # A tibble: 3 × 2
## height handspan
## <dbl> <dbl>
## 1 73 22.5
## 2 65 17
## 3 69 21

With three students, there are three unique unordered pairs: $\{1,2\}, \{1,3\}, \{2,3\}$. Corresponding to these three pairs are three line segments, one through each pair:

plot(height ~ handspan, test, pch = 20)
segments(x0 = c(17, 17, 21), # FROM: x-coordinates
y0 = c(65, 65, 69), # FROM: y-coordinates
x1 = c(21, 22.5, 22.5), # TO: x-coordinates
y1 = c(69, 73, 73), # TO: y-coordinates
lty = 2)

And associated with each line segment is a slope

y_differences <- c(69 - 65, 73 - 65, 73 - 69)
x_differences <- c(21 - 17, 22.5 - 17, 22.5 - 21)
slopes <- y_differences / x_differences
slopes

## [1] 1.000000 1.454545 2.666667

And here’s my question for you: what, if anything, is the relationship between these three slopes and the slope of the regression line? While it’s a bit silly to run a regression with only three observations, the results are as follows:

coef(lm(height ~ handspan, test))

## (Intercept) handspan
## 41.556701 1.360825

The slope of the regression line doesn’t equal any of the three slopes we calculated above, but it does lie between them. This makes sense: if the regression line were steeper or less steep than all three line segments from above, it couldn’t possibly minimize the sum of squared vertical deviations. Perhaps the regression slope is the arithmetic mean of slopes? No such luck:

mean(slopes)

## [1] 1.707071

Something interesting is going on here, but it’s not clear what. To learn more, it would be helpful to play with more than three points. But doing this by hand would be extremely tedious. Time to write a function!

All Pairs of Students

The following function generates all unique pairs of elements taken from a vector x and stores them in a matrix:

make_pairs <- function(x) {
# Returns a data frame whose rows contain each unordered pair of elements of x
# i.e. all committees of two with members drawn from x
n <- length(x)
pair_indices <- combn(1:n, 2)
matrix(x[c(pair_indices)], ncol = 2, byrow = TRUE)
}

For example, applying make_pairs() to the vector c(1, 2, 3, 4, 5) gives

make_pairs(1:5)

## [,1] [,2]
## [1,] 1 2
## [2,] 1 3
## [3,] 1 4
## [4,] 1 5
## [5,] 2 3
## [6,] 2 4
## [7,] 2 5
## [8,] 3 4
## [9,] 3 5
## [10,] 4 5

Notice that I’ve make_pairs() is constructed such that order doesn’t matter: we don’t distinguish between $(4,5)$ and $(5,4)$, for example. This makes sense for our example: Alice and Bob denotes the same pair of students as Bob and Alice.

To generate all possible pairs of students from dat, we apply make_pairs() to dat$handspan and dat$height separately and then combine the result into a single dataframe:

regression_pairs <- data.frame(make_pairs(dat$handspan),
make_pairs(dat$height))
head(regression_pairs)

## X1 X2 X1.1 X2.1
## 1 22.5 17.0 73 65
## 2 22.5 21.0 73 69
## 3 22.5 25.5 73 71
## 4 22.5 25.0 73 78
## 5 22.5 20.0 73 68
## 6 22.5 20.5 73 75

The names are ugly, so let’s clean them up a bit. Handspan is our “x” variable and height is our “y” variable, so we’ll refer to the measurements from each pair as x1, x2, y1, y2

names(regression_pairs) <- c('x1', 'x2', 'y1', 'y2')
head(regression_pairs)

## x1 x2 y1 y2
## 1 22.5 17.0 73 65
## 2 22.5 21.0 73 69
## 3 22.5 25.5 73 71
## 4 22.5 25.0 73 78
## 5 22.5 20.0 73 68
## 6 22.5 20.5 73 75

The 1 and 2 indices indicate a particular student: in a given row x1 and y1 are the handspan and height of the “first student” in the pair while x2 and y2 are the handspan and height of the “second student.” Each student appears many times in the regression_pairs dataframe. This is because there are many pairs of students that include Alice: we can pair her up with any other student in the class. For this reason, regression_pairs has a tremendous number of rows, 52975 to be precise. This is the number of ways to form a committee of size 2 from a collection of 326 people when order doesn’t matter:

choose(326, 2)

## [1] 52975

I just calculated 52,975 slopes.

Corresponding to each row of regression_pairs is a slope. We can calculate and summarize them as follows, using the dplyr package from the tidyverse to make things easier to read:

library(dplyr)
regression_pairs <- regression_pairs %>%
mutate(slope = (y2 - y1) / (x2 - x1))
regression_pairs %>%
pull(slope) %>%
summary

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -Inf 0.000 1.333 NaN 2.667 Inf 419

The sample mean slope is NaN, the minimum is -Inf, the maximum is Inf, and there are 419 missing values. So what on earth does this mean? First things first: the abbreviation NaN stands for “not a number.” This is R’s way of expressing $0/0$, and an NaN “counts” as a missing value:

0/0

## [1] NaN

is.na(0/0)

## [1] TRUE

In contrast, Inf and -Inf are R’s way of expressing $\pm \infty$. These do not count as missing values, and they also arise when a number is too big or too small for your computer to represent:

c(-1/0, 1/0, exp(9999999), -exp(9999999))

## [1] -Inf Inf Inf -Inf

is.na(c(Inf, -Inf))

## [1] FALSE FALSE

So where do these Inf and NaN values come from? Our slope calculation from above was (y2 - y1) / (x2 - x1). If x2 == x1, the denominator is zero. This occurs when the two students in a given pair have the same handspan. Because we only measured handspan to the nearest 0.5cm, there are many such pairs. Indeed, handspan only takes on 23 distinct values in our dataset but there are 326 students:

sort(unique(dat$handspan))

## [1] 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 23.5
## [16] 24.0 24.5 25.0 25.5 26.0 26.5 27.0 28.0

If y1 and y2 are different but x1 and x2 are the same, the slope will either be Inf or -Inf, depending on whether y1 > y2 or the reverse. When y1 == y2 and x1 == x2 the slope is NaN.

This isn’t an arcane numerical problem. When y1 == y2 and x1 == x2, our pair contains only a single point, so there’s no way to draw a line segment. With no line to draw, there’s clearly no slope to calculate. When y1 != y2 but x1 == x2 we can draw a line segment, but it will be vertical. Should we call the slope of this vertical line $+\infty$? Or should we call it $-\infty$? Because the labels 1 and 2 for the students in a given pair were arbitrary–order doesn’t matter–there’s no way to choose between Inf and -Inf. From the perspective of calculating a slope, it simply doesn’t make sense to construct pairs of students with the same handspan.

With this in mind, let’s see what happens if we average all of the slopes that are not -Inf, Inf, or NaN

regression_pairs %>%
filter(!is.na(slope) & !is.infinite(slope)) %>%
pull(slope) %>%
mean

## [1] 1.259927

This is tantalizingly close to the slope of the regression line from above: 1.266961. But it’s still slightly off.

Not All Slopes are Created Equal

The median handspan in my dataset is 21.5. Let’s take a closer look at the heights of students whose handspans are close to this value:

boxplot(height ~ handspan, dat, subset = handspan %in% c(21, 21.5, 22))

There is a large amount of variation in height for a given value of handspan. Indeed, from this boxplot alone you might not even guess that there is a strong positive relationship between height and handspan in the dataset as a whole! The “boxes” in the figure, representing the middle 50% of heights for a given handspan, overlap substantially. If we were to randomly choose one student with a handspan of 21 and another with a handspan of 21.5, it’s quite likely that the slope between them would be negative. It’s true that students with bigger hands are taller on average. But the difference in height that we’d predict for two students who differed by 0.5cm in handspan is very small: 0.6 inches according to the linear regression from the beginning of this post. In contrast, the standard deviation of height among students with the median handspan is more than five times as large:

sd(subset(dat, handspan == 21.5)$height)

## [1] 3.172248

The 25th percentile of handspan is 20 while the 75th percentile is 23. Comparing the heights of students with these handspans rather than those close to the median, gives a very different picture:

boxplot(height ~ handspan, dat, subset = handspan %in% c(20, 23))

Now there’s much less overlap in the distributions of height. This accords with the predictions of the linear regression from above: for two students whose handspan differs by 3cm, we would predict a difference of 3.8 inches in height. This difference is big enough to discern in spite of the variation in height for students with the same handspan. If we were to choose one student with a handspan of 20cm and another with a handspan of 23cm, it’s fairly unlikely that the slope between these lines would be negative.

So where does this leave us? Above we saw that forming a pair of students with the same handspan does not allow us to calculate a slope. Now we’ve seen that the slope for a pair of students with a very similar handspan can give a misleading impression about the overall relationship. This turns out to be the key to our puzzle from above. The ordinary least squares slope estimate does equal an average of the slopes for each pair of students, but this average gives more weight to pairs with a larger difference in handspan. As I’ll derive in my next post: \[ \widehat{\beta}_{\text{OLS}} = \sum_{(i,j)\in C_2^n} \omega_{ij} \left(\frac{y_i - y_j}{x_i - x_j}\right), \quad \omega_{ij} \equiv \frac{(x_i - x_j)^2}{\sum_{(i,j)\in C_2^n} (x_i-x_j)^2} \] The notation $C_2^n$ is shorthand for the set $\{(i,j)\colon 1 \leq i < j \leq n\}$, in other words the set of all unique pairs $(i,j)$ where order doesn’t matter. The weights $\omega_{ij}$ are between zero and one and sum to one over all pairs. Pairs with $x_i = x_j$ are given zero weight; pairs in which $x_i$ is far from $x_j$ are given more weight than pairs where these values are closer. And you don’t have to wait for my next post to see that it works:

regression_pairs <- regression_pairs %>%
mutate(x_dist = abs(x2 - x1),
weight = x_dist^2 / sum(x_dist^2))
regression_pairs %>%
filter(!is.infinite(slope) & !is.na(slope)) %>%
summarize(sum(weight * slope)) %>%
pull

## [1] 1.266961

coef(lm(height ~ handspan, dat))[2]

## handspan
## 1.266961

So there you have it: in a simple linear regression, the OLS slope estimate is a weighted average of the slopes of the line segments between all pairs of observations. The weights are proportional to the squared Euclidean distance between $x$-coordinates. I’ll leave things here for today, but there’s much more to say on this topic. Stay tuned for the next installment!

If you play the piano, this may help: I can play parallel 10ths but only just.↩︎
Another way to show this is by “jittering” the data: simply replace geom_point(alpha = 0.3) with geom_jitter(alpha = 0.3).↩︎

The Wilson Confidence Interval for a Proportion

Sat, 05 Feb 2022 00:00:00 +0000

This is the second in a series of posts about how to construct a confidence interval for a proportion. (Simple problems sometimes turn out to be surprisingly complicated in practice!) In the first part, I discussed the serious problems with the “textbook” approach, and outlined a simple hack that works amazingly well in practice: the Agresti-Coull confidence interval.

Somewhat unsatisfyingly, my earlier post gave no indication of where the Agresti-Coull interval comes from, how to construct it when you want a confidence level other than 95%, and why it works. In this post I’ll fill in some of the gaps by discussing yet another confidence interval for a proportion: the Wilson interval, so-called because it first appeared in Wilson (1927). While it’s not usually taught in introductory courses, it easily could be. Not only does the Wilson interval perform extremely well in practice, it packs a powerful pedagogical punch by illustrating the idea of “inverting a hypothesis test.” Spoiler alert: the Agresti-Coull interval is a rough-and-ready approximation to the Wilson interval.

To understand the Wilson interval, we first need to remember a key fact about statistical inference: hypothesis testing and confidence intervals are two sides of the same coin. We can use a test to create a confidence interval, and vice-versa. In case you’re feeling a bit rusty on this point, let me begin by refreshing your memory with the simplest possible example. If this is old hat to you, skip ahead to the next section.

Tests and CIs – Two Sides of the Same Coin

Suppose that we observe a random sample $X_1, \dots, X_n$ from a normal population with unknown mean $\mu$ and known variance $\sigma^2$. Under these assumptions, the sample mean $\bar{X}_n \equiv \left(\frac{1}{n} \sum_{i=1}^n X_i\right)$ follows a $N(\mu, \sigma^2/n)$ distribution. Centering and standardizing, \[ \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \sim N(0,1).\] Now, suppose we want to test $H_0\colon \mu = \mu_0$ against the two-sided alternative $H_1\colon \mu = \mu_0$ at the 5% significance level. If $\mu = \mu_0$, then the test statistic \[T_n \equiv \frac{\bar{X}_n - \mu_0}{\sigma/\sqrt{n}}\] follows a standard normal distribution. If $\mu \neq \mu_0$, then $T_n$ does not follow a standard normal distribution. To carry out the test, we reject $H_0$ if $|T_n|$ is greater than $1.96$, the $(1 - \alpha/2)$ quantile of a standard normal distribution for $\alpha = 0.05$. To put it another way, we fail to reject $H_0$ if $|T_n| \leq 1.96$. So for what values of $\mu_0$ will we fail to reject? By the definition of absolute value and the definition of $T_n$ from above, $|T_n| \leq 1.96$ is equivalent to \[ - 1.96 \leq \frac{\bar{X}_n - \mu_0}{\sigma/\sqrt{n}} \leq 1.96. \] Re-arranging, this in turn is equivalent to \[ \bar{X}_n - 1.96 \times \frac{\sigma}{\sqrt{n}} \leq \mu_0 \leq \bar{X}_n + 1.96 \times \frac{\sigma}{\sqrt{n}}. \] This tells us that the values of $\mu_0$ we will fail to reject are precisely those that lie in the interval $\bar{X} \pm 1.96 \times \sigma/\sqrt{n}$. Does this look familiar? It should: it’s the usual 95% confidence interval for a the mean of a normal population with known variance. The 95% confidence interval corresponds exactly to the set of values $\mu_0$ that we fail to reject at the 5% level.

This example is a special case a more general result. If you give me a $(1 - \alpha)\times 100\%$ confidence interval for a parameter $\theta$, I can use it to test $H_0\colon \theta = \theta_0$ against $H_0 \colon \theta \neq \theta_0$. All I have to do is check whether $\theta_0$ lies inside the confidence interval, in which case I fail to reject, or outside, in which case I reject. Conversely, if you give me a two-sided test of $H_0\colon \theta = \theta_0$ with significance level $\alpha$, I can use it to construct a $(1 - \alpha) \times 100\%$ confidence interval for $\theta$. All I have to do is collect the values of $\theta_0$ that are not rejected. This procedure is called inverting a test.

How to Confuse Your Introductory Statistics Students

Around the same time as we teach students the duality between testing and confidence intervals–you can use a confidence interval to carry out a test or a test to construct a confidence interval–we throw a wrench into the works. The most commonly-presented test for a population proportion $p$ does not coincide with the most commonly-presented confidence interval for $p$. To quote from page 355 of Kosuke Imai’s fantastic textbook Quantitative Social Science: An Introduction

the standard error used for confidence intervals is different from the standard error used for hypothesis testing. This is because the latter standard error is derived under the null hypothesis … whereas the standard error for confidence intervals is computed using the estimated proportion.

Let’s translate this into mathematics. Suppose that $X_1, ..., X_n \sim \text{iid Bernoulli}(p)$ and let $\widehat{p} \equiv (\frac{1}{n} \sum_{i=1}^n X_i)$. The two standard errors that Imai describes are \[ \text{SE}_0 \equiv \sqrt{\frac{p_0(1 - p_0)}{n}} \quad \text{versus} \quad \widehat{\text{SE}} \equiv \sqrt{\frac{\widehat{p}(1 - \widehat{p})}{n}}. \] Following the advice of our introductory textbook, we test $H_0\colon p = p_0$ against $H_1\colon p \neq p_0$ at the $5\%$ level by checking whether $|(\widehat{p} - p_0) / \text{SE}_0|$ exceeds $1.96$. This is called the score test for a proportion. Again following the advice of our introductory textbook, we report $\widehat{p} \pm 1.96 \times \widehat{\text{SE}}$ as our 95% confidence interval for $p$. As you may recall from my earlier post, this is the so-called Wald confidence interval for $p$. Because the two standard error formulas in general disagree, the relationship between tests and confidence intervals breaks down.

To make this more concrete, let’s plug in some numbers. Suppose that $n = 25$ and our observed sample contains 5 ones and 20 zeros. Then $\widehat{p} = 0.2$ and we can calculate $\widehat{\text{SE}}$ and the Wald confidence interval as follows

n <- 25
n1 <- 5
p_hat <- n1 / n
alpha <- 0.05
SE_hat <- sqrt(p_hat * (1 - p_hat) / n)
p_hat + c(-1, 1) * qnorm(1 - alpha / 2) * SE_hat

## [1] 0.04320288 0.35679712

The value 0.07 is well within this interval. This suggests that we should fail to reject $H_0\colon p = 0.07$ against the two-sided alternative. But when we compute the score test statistic we obtain a value well above 1.96, so that $H_0\colon p = 0.07$ is soundly rejected:

p0 <- 0.07
SE0 <- sqrt(p0 * (1 - p0) / n)
abs((p_hat - p0) / SE0)

## [1] 2.547551

The test says reject $H_0\colon p = 0.07$ and the confidence interval says don’t. Upon encountering this example, your students decide that statistics is a tangled mess of contradictions, despair of ever making sense of it, and resign themselves to simply memorizing the requisite formulas for the exam.

Should we teach the Wald test instead?

How can we dig our way out of this mess? One idea is to use a different test, one that agrees with the Wald confidence interval. If we had used $\widehat{\text{SE}}$ rather than $\text{SE}_0$ to test $H_0\colon p = 0.07$ above, our test statistic would have been

abs((p_hat - p0) / SE_hat)

## [1] 1.625

which is clearly less than 1.96. Thus we would fail to reject $H_0\colon p = 0.7$ exactly as the Wald confidence interval instructed us above. This procedure is called the Wald test for a proportion. Its main benefit is that it agrees with the Wald interval, unlike the score test, restoring the link between tests and confidence intervals that we teach our students. Unfortunately the Wald confidence interval is terrible and you should never use it. Because the Wald test is equivalent to checking whether $p_0$ lies inside the Wald confidence interval, it inherits all of the latter’s defects.

Indeed, compared to the score test, the Wald test is a disaster, as I’ll now show. Suppose we carry out a 5% test. If the null is true, we should reject it 5% of the time. Because the Wald and Score tests are both based on an approximation provided by the central limit theorem, we should allow a bit of leeway here: the actual rejection rates may be slightly different from 5%. Nevertheless, we’d expect them to at least be fairly close to the nominal value of 5%. The following plot shows the actual type I error rates of the score and Wald tests, over a range of values for the true population proportion $p$ with sample sizes of 25, 50, and 100. In each case the nominal size of each test, shown as a dashed red line, is 5%.¹

The score test isn’t perfect: if $p$ is extremely close to zero or one, its actual type I error rate can be appreciably higher than its nominal type I error rate: as much as 10% compared to 5% when $n = 25$. But in general, its performance is good. In contrast, the Wald test is absolutely terrible: its nominal type I error rate is systematically higher than 5% even when $n$ is not especially small and $p$ is not especially close to zero or one.

Granted, teaching the Wald test alongside the Wald interval would reduce confusion in introductory statistics courses. But it would also equip students with lousy tools for real-world inference. There is a better way: rather than teaching the test that corresponds to the Wald interval, we could teach the confidence interval that corresponds to the score test.

Inverting the Score Test

Suppose we collect all values $p_0$ that the score test does not reject at the 5% level. If the score test is working well–if its nominal type I error rate is close to 5%–the resulting set of values $p_0$ will be an approximate $(1 - \alpha) \times 100\%$ confidence interval for $p$. Why is this so? Suppose that $p_0$ is the true population proportion. Then an interval constructed in this way will cover $p_0$ precisely when the score test does not reject $H_0\colon p = p_0$. This occurs with probability $(1 - \alpha)$. Because the score test is much more accurate than the Wald test, the confidence interval that we obtain by inverting it way will be much more accurate than the Wald interval. This interval is called the score interval or the Wilson interval.

So let’s do it: let’s invert the score test. Our goal is to find all values $p_0$ such that $|(\widehat{p} - p_0)/\text{SE}_0|\leq c$ where $c$ is the normal critical value for a two-sided test with significance level $\alpha$. Squaring both sides of the inequality and substituting the definition of $\text{SE}_0$ from above gives \[ (\widehat{p} - p_0)^2 \leq c^2 \left[ \frac{p_0(1 - p_0)}{n}\right]. \] Multiplying both sides of the inequality by $n$, expanding, and re-arranging leaves us with a quadratic inequality in $p_0$, namely \[ (n + c^2) p_0^2 - (2n\widehat{p} + c^2) p_0 + n\widehat{p}^2 \leq 0. \] Remember: we are trying to find the values of $p_0$ that satisfy the inequality. The terms $(n + c^2)$ along with $(2n\widehat{p})$ and $n\widehat{p}^2$ are constants. Once we choose $\alpha$, the critical value $c$ is known. Once we observe the data, $n$ and $\widehat{p}$ are known. Since $(n + c^2) > 0$, the left-hand side of the inequality is a parabola in $p_0$ that opens upwards. This means that the values of $p_0$ that satisfy the inequality must lie between the roots of the quadratic equation \[ (n + c^2) p_0^2 - (2n\widehat{p} + c^2) p_0 + n\widehat{p}^2 = 0. \] By the quadratic formula, these roots are \[ p_0 = \frac{(2 n\widehat{p} + c^2) \pm \sqrt{4 c^2 n \widehat{p}(1 - \widehat{p}) + c^4}}{2(n + c^2)}. \] Factoring $2n$ out of the numerator and denominator of the right-hand side and simplifying, we can re-write this as \[ \begin{align*} p_0 &= \frac{1}{2\left(n + \frac{n c^2}{n}\right)}\left\{\left(2n\widehat{p} + \frac{2n c^2}{2n}\right) \pm \sqrt{4 n^2c^2 \left[\frac{\widehat{p}(1 - \widehat{p})}{n}\right] + 4n^2c^2\left[\frac{c^2}{4n^2}\right] }\right\} \\ \\ p_0 &= \frac{1}{2n\left(1 + \frac{ c^2}{n}\right)}\left\{2n\left(\widehat{p} + \frac{c^2}{2n}\right) \pm 2nc\sqrt{ \frac{\widehat{p}(1 - \widehat{p})}{n} + \frac{c^2}{4n^2}} \right\} \\ \\ p_0 &= \left( \frac{n}{n + c^2}\right)\left\{\left(\widehat{p} + \frac{c^2}{2n}\right) \pm c\sqrt{ \widehat{\text{SE}}^2 + \frac{c^2}{4n^2} }\right\}\\ \\ \end{align*} \] using our definition of $\widehat{\text{SE}}$ from above. And there you have it: the right-hand side of the final equality is the $(1 - \alpha)\times 100\%$ Wilson confidence interval for a proportion, where $c = \texttt{qnorm}(1 - \alpha/2)$ is the normal critical value for a two-sided test with significance level $\alpha$, and $\widehat{\text{SE}}^2 = \widehat{p}(1 - \widehat{p})/n$.

Compared to the Wald interval, $\widehat{p} \pm c \times \widehat{\text{SE}}$, the Wilson interval is certainly more complicated. But it is constructed from exactly the same information: the sample proportion $\widehat{p}$, two-sided critical value $c$ and sample size $n$. Computing it by hand is tedious, but programming it in R is a snap:

get_wilson_CI <- function(x, alpha = 0.05) {
#-----------------------------------------------------------------------------
# Compute the Wilson (aka Score) confidence interval for a popn. proportion
#-----------------------------------------------------------------------------
# x vector of data (zeros and ones)
# alpha 1 - (confidence level)
#-----------------------------------------------------------------------------
n <- length(x)
p_hat <- mean(x)
SE_hat_sq <- p_hat * (1 - p_hat) / n
crit <- qnorm(1 - alpha / 2)
omega <- n / (n + crit^2)
A <- p_hat + crit^2 / (2 * n)
B <- crit * sqrt(SE_hat_sq + crit^2 / (4 * n^2))
CI <- c('lower' = omega * (A - B),
'upper' = omega * (A + B))
return(CI)
}

Notice that this is only slightly more complicated to implement than the Wald confidence interval:

get_wald_CI <- function(x, alpha = 0.05) {
#-----------------------------------------------------------------------------
# Compute the Wald confidence interval for a popn. proportion
#-----------------------------------------------------------------------------
# x vector of data (zeros and ones)
# alpha 1 - (confidence level)
#-----------------------------------------------------------------------------
n <- length(x)
p_hat <- mean(x)
SE_hat <- sqrt(p_hat * (1 - p_hat) / n)
ME <- qnorm(1 - alpha / 2) * SE_hat
CI <- c('lower' = p_hat - ME,
'upper' = p_hat + ME)
return(CI)
}

With a computer rather than pen and paper there’s very little cost using the more accurate interval. Indeed, the built-in R function prop.test() reports the Wilson confidence interval rather than the Wald interval:

set.seed(1234)
x <- rbinom(20, 1, 0.5)
prop.test(sum(x), length(x), correct = FALSE) # no continuity correction

##
## 1-sample proportions test without continuity correction
##
## data: sum(x) out of length(x), null probability 0.5
## X-squared = 0.2, df = 1, p-value = 0.6547
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.3420853 0.7418021
## sample estimates:
## p
## 0.55

get_wilson_CI(x)

## lower upper
## 0.3420853 0.7418021

Understanding the Wilson Interval

You could stop reading here and simply use the code from above to construct the Wilson interval. But computing is only half the battle: we want to understand our measures of uncertainty. While the Wilson interval may look somewhat strange, there’s actually some very simple intuition behind it. It amounts to a compromise between the sample proportion $\widehat{p}$ and $1/2$.

The Wald estimator is centered around $\widehat{p}$, but the Wilson interval is not. Manipulating our expression from the previous section, we find that the midpoint of the Wilson interval is \[ \begin{align*} \widetilde{p} &\equiv \left(\frac{n}{n + c^2} \right)\left(\widehat{p} + \frac{c^2}{2n}\right) = \frac{n \widehat{p} + c^2/2}{n + c^2} \\ &= \left( \frac{n}{n + c^2}\right)\widehat{p} + \left( \frac{c^2}{n + c^2}\right) \frac{1}{2}\\ &= \omega \widehat{p} + (1 - \omega) \frac{1}{2} \end{align*} \] where the weight $\omega \equiv n / (n + c^2)$ is always strictly between zero and one. In other words, the center of the Wilson interval lies between $\widehat{p}$ and $1/2$. In effect, $\widetilde{p}$ pulls us away from extreme values of $p$ and towards the middle of the range of possible values for a population proportion. For a fixed confidence level, the smaller the sample size, the more that we are pulled towards $1/2$. For a fixed sample size, the higher the confidence level, the more that we are pulled towards $1/2$.

Continuing to use the shorthand $\omega \equiv n /(n + c^2)$ and $\widetilde{p} \equiv \omega \widehat{p} + (1 - \omega)/2$, we can write the Wilson interval as \[ \widetilde{p} \pm c \times \widetilde{\text{SE}}, \quad \widetilde{\text{SE}} \equiv \omega \sqrt{\widehat{\text{SE}}^2 + \frac{c^2}{4n^2}}. \] So what can we say about $\widetilde{\text{SE}}$? It turns out that the value $1/2$ is lurking behind the scenes here as well. The easiest way to see this is by squaring $\widehat{\text{SE}}$ to obtain \[ \begin{align*} \widetilde{\text{SE}}^2 &= \omega^2\left(\widehat{\text{SE}}^2 + \frac{c^2}{4n^2} \right) = \left(\frac{n}{n + c^2}\right)^2 \left[\frac{\widehat{p}(1 - \widehat{p})}{n} + \frac{c^2}{4n^2}\right]\\ &= \frac{1}{n + c^2} \left[\frac{n}{n + c^2} \cdot \widehat{p}(1 - \widehat{p}) + \frac{c^2}{n + c^2}\cdot \frac{1}{4}\right]\\ &= \frac{1}{\widetilde{n}} \left[\omega \widehat{p}(1 - \widehat{p}) + (1 - \omega) \frac{1}{2} \cdot \frac{1}{2}\right] \end{align*} \] defining $\widetilde{n} = n + c^2$. To make sense of this result, recall that $\widehat{\text{SE}}^2$, the quantity that is used to construct the Wald interval, is a ratio of two terms: $\widehat{p}(1 - \widehat{p})$ is the usual estimate of the population variance based on iid samples from a Bernoulli distribution and $n$ is the sample size. Similarly, $\widetilde{\text{SE}}^2$ is a ratio of two terms. The first is a weighted average of the population variance estimator and $1/4$, the population variance under the assumption that $p = 1/2$. Once again, the Wilson interval “pulls” away from extremes. In this case it pulls away from extreme estimates of the population variance towards the largest possible population variance: $1/4$.² We divide this by the sample size augmented by $c^2$, a strictly positive quantity that depends on the confidence level.³

To make this more concrete, Consider the case of a 95% Wilson interval. In this case $c^2 \approx 4$ so that $\omega \approx n / (n + 4)$ and $(1 - \omega) \approx 4/(n+4)$.⁴ Using this approximation we find that \[ \widetilde{p} \approx \frac{n}{n + 4} \cdot \widehat{p} + \frac{4}{n + 4} \cdot \frac{1}{2} = \frac{n \widehat{p} + 2}{n + 4} \] which is precisely the midpoint of the Agresti-Coul confidence interval. And while \[ \widetilde{\text{SE}}^2 \approx \frac{1}{n + 4} \left[\frac{n}{n + 4}\cdot \widehat{p}(1 - \widehat{p}) +\frac{4}{n + 4} \cdot \frac{1}{2} \cdot \frac{1}{2}\right] \] is slightly different from the quantity that appears in the Agresti-Coul interval, $\widetilde{p}(1 - \widetilde{p})/\widetilde{n}$, the two expressions give very similar results in practice. The Agresti-Coul interval is nothing more than a rough-and-ready approximation to the 95% Wilson interval. This not only provides some intuition for the Wilson interval, it shows us how to construct an Agresti-Coul interval with a confidence level that differs from 95%: just construct the Wilson interval!

Comparing the Wald and Wilson Intervals

Another way of understanding the Wilson interval is to ask how it will differ from the Wald interval when computed from the same dataset. In large samples, these two intervals will be quite similar. This is because $\omega \rightarrow 1$ as $n \rightarrow \infty$. Using the expressions from the preceding section, this implies that $\widehat{p} \approx \widetilde{p}$ and $\widehat{\text{SE}} \approx \widetilde{\text{SE}}$ for very large sample sizes. For smaller values of $n$, however, the two intervals can differ markedly. To make a long story short, the Wilson interval gives a much more reasonable description of our uncertainty about $p$ for any sample size. Wilson, unlike Wald, is always an interval; it cannot collapse to a single point. Moreover, unlike the Wald interval, the Wilson interval is always bounded below by zero and above by one.

Wald Can Collapse to a Single Point; Wilson Can’t

A strange property of the Wald interval is that its width can be zero. Suppose that $\widehat{p} = 0$, i.e. that we observe zero successes. In this case, regardless of sample size and regardless of confidence level, the Wald interval only contains a single point: zero \[ \widehat{p} \pm c \sqrt{\widehat{p}(1 - \widehat{p})/n} = 0 \pm c \times \sqrt{0(1 - 0)/n} = \{0 \}. \] This is clearly insane. If we observe zero successes in a sample of ten observations, it is reasonable to suspect that $p$ is small, but ridiculous to conclude that it must be zero. We encounter a similarly absurd conclusion if $\widehat{p} = 1$. In contrast, the Wilson interval can never collapse to a single point. Using the expression from the preceding section, we see that its width is given by \[ 2c \left(\frac{n}{n + c^2}\right) \times \sqrt{\frac{\widehat{p}(1 - \widehat{p})}{n} + \frac{c^2}{4n^2}} \] The first factor in this product is strictly positive. And even when $\widehat{p}$ equals zero or one, the second factor is also positive: the additive term $c^2/(4n^2)$ inside the square root ensures this. For $\widehat{p}$ equal to zero or one, the width of the Wilson interval becomes \[ 2c \left(\frac{n}{n + c^2}\right) \times \sqrt{\frac{c^2}{4n^2}} = \left(\frac{c^2}{n + c^2}\right) = (1 - \omega). \] Compared to the Wald interval, this is quite reasonable. A sample proportion of zero (or one) conveys much more information when $n$ is large than when $n$ is small. Accordingly, the Wilson interval is shorter for large values of $n$. Similarly, higher confidence levels should demand wider intervals at a fixed sample size. The Wilson interval, unlike the Wald, retains this property even when $\widehat{p}$ equals zero or one.

Wald Can Include Impossible Values; Wilson Can’t

A population proportion necessarily lies in the interval $[0,1]$, so it would make sense that any confidence interval for $p$ should as well. An awkward fact about the Wald interval is that it can extend beyond zero or one. In contrast, the Wilson interval always lies within $[0,1]$. For example, suppose that we observe two successes in a sample of size 10. Then the 95% Wald confidence interval is approximately [-0.05, 0.45] while the corresponding Wilson interval is [0.06, 0.51]. Similarly, if we observe eight successes in ten trials, the 95% Wald interval is approximately [0.55, 1.05] while the Wilson interval is [0.49, 0.94].

With a bit of algebra we can show that the Wald interval will include negative values whenever $\widehat{p}$ is less than $(1 - \omega) \equiv c^2/(n + c^2)$. Why is this so? The lower confidence limit of the Wald interval is negative if and only if $\widehat{p} < c \times \widehat{\text{SE}}$. Substituting the definition of $\widehat{\text{SE}}$ and re-arranging, this is equivalent to \[ \begin{align} \widehat{p} &< c \sqrt{\widehat{p}(1 - \widehat{p})/n}\\ n\widehat{p}^2 &< c^2(\widehat{p} - \widehat{p}^2)\\ 0 &> \widehat{p}\left[(n + c^2)\widehat{p} - c^2\right] \end{align} \] The right-hand side of the preceding inequality is a quadratic function of $\widehat{p}$ that opens upwards. Its roots are $\widehat{p} = 0$ and $\widehat{p} = c^2/(n + c^2) = (1 - \omega)$. Thus, whenever $\widehat{p} < (1 - \omega)$, the Wald interval will include negative values of $p$. A nearly identical argument, exploiting symmetry, shows that the upper confidence limit of the Wald interval will extend beyond one whenever $\widehat{p} > \omega \equiv n/(n + c^2)$. Putting these two results together, the Wald interval lies within $[0,1]$ if and only if $(1 - \omega) < \widehat{p} < \omega$. This is equivalent to \[ \begin{align} n(1 - \omega) &< \sum_{i=1}^n X_i < n \omega\\ \left\lceil n\left(\frac{c^2}{n + c^2} \right)\right\rceil &\leq \sum_{i=1}^n X_i \leq \left\lfloor n \left( \frac{n}{n + c^2}\right) \right\rfloor \end{align} \] where $\lceil \cdot \rceil$ is the ceiling function and $\lfloor \cdot \rfloor$ is the floor function.⁵ Using this inequality, we can calculate the minimum and maximum number of successes in $n$ trials for which a 95% Wald interval will lie inside the range $[0,1]$ as follows:

n <- 10:20
omega <- n / (n + qnorm(0.975)^2)
cbind("n" = n,
"min_success" = ceiling(n * (1 - omega)),
"max_success" = floor(n * omega))

## n min_success max_success
## [1,] 10 3 7
## [2,] 11 3 8
## [3,] 12 3 9
## [4,] 13 3 10
## [5,] 14 4 10
## [6,] 15 4 11
## [7,] 16 4 12
## [8,] 17 4 13
## [9,] 18 4 14
## [10,] 19 4 15
## [11,] 20 4 16

This agrees with our calculations for $n = 10$ from above. With a sample size of ten, any number of successes outside the range $\{3, ..., 7\}$ will lead to a 95% Wald interval that extends beyond zero or one. With a sample size of twenty, this range becomes $\{4, ..., 16\}$.

Finally, we’ll show that the Wilson interval can never extend beyond zero or one. There’s nothing more than algebra to follow, but there’s a fair bit of it. If you feel that we’ve factorized too many quadratic equations already, you have my express permission to skip ahead. Suppose by way of contradiction that the lower confidence limit of the Wilson confidence interval were negative. The only way this could occur is if $\widetilde{p} - \widetilde{\text{SE}} < 0$, i.e. if \[ \omega\left\{\left(\widehat{p} + \frac{c^2}{2n}\right) - c\sqrt{ \widehat{\text{SE}}^2 + \frac{c^2}{4n^2}} \,\,\right\} < 0. \] But since $\omega$ is between zero and one, this is equivalent to \[ \left(\widehat{p} + \frac{c^2}{2n}\right) < c\sqrt{ \widehat{\text{SE}}^2 + \frac{c^2}{4n^2}}. \] We will show that this leads to a contradiction, proving that lower confidence limit of the Wilson interval cannot be negative. To begin, factorize each side as follows \[ \frac{1}{2n}\left(2n\widehat{p} + c^2\right) < \frac{c}{2n}\sqrt{ 4n^2\widehat{\text{SE}}^2 + c^2}. \] Cancelling the common factor of $1/(2n)$ from both sides and squaring, we obtain \[ \left(2n\widehat{p} + c^2\right)^2 < c^2\left(4n^2\widehat{\text{SE}}^2 + c^2\right). \] Expanding, subtracting $c^4$ from both sides, and dividing through by $4n$ gives \[ n\widehat{p}^2 + \widehat{p}c^2 < nc^2\widehat{\text{SE}}^2 = c^2 \widehat{p}(1 - \widehat{p}) = \widehat{p}c^2 - c^2 \widehat{p}^2 \] by the definition of $\widehat{\text{SE}}$. Subtracting $\widehat{p}c^2$ from both sides and rearranging, this is equivalent to $\widehat{p}^2(n + c^2) < 0$. Since the left-hand side cannot be negative, we have a contradiction.

A similar argument shows that the upper confidence limit of the Wilson interval cannot exceed one. Suppose by way of contradiction that it did. This can only occur if $\widetilde{p} + \widetilde{SE} > 1$, i.e. if \[ \left(\widehat{p} + \frac{c^2}{2n}\right) - \frac{1}{\omega} > c \sqrt{\widehat{\text{SE}}^2 + \frac{c^2}{4n^2}}. \] By the definition of $\omega$ from above, the left-hand side of this inequality simplifies to \[ -\frac{1}{2n} \left[2n(1 - \widehat{p}) + c^2\right] \] so the original inequality is equivalent to \[ \frac{1}{2n} \left[2n(1 - \widehat{p}) + c^2\right] < c \sqrt{\widehat{\text{SE}}^2 + \frac{c^2}{4n^2}}. \] Now, if we introduce the change of variables $\widehat{q} \equiv 1 - \widehat{p}$, we obtain exactly the same inequality as we did above when studying the lower confidence limit, only with $\widehat{q}$ in place of $\widehat{p}$. This is because $\widehat{\text{SE}}^2$ is symmetric in $\widehat{p}$ and $(1 - \widehat{p})$. Since we’ve reduced our problem to one we’ve already solved, we’re done!

More to Come on Inference for a Proportion!

This has been a post of epic proportions, pun very much intended. Amazingly, we have yet to fully exhaust this seemingly trivial problem. In a future post I will explore yet another approach to inference: the likelihood ratio test and its corresponding confidence interval. This will complete the classical “trinity” of tests for maximum likelihood estimation: Wald, Score (Lagrange Multiplier), and Likelihood Ratio. In yet another future post, I will revisit this problem from a Bayesian perspective, uncovering many unexpected connections along the way. Until then, be sure to maintain a sense of proportion in all your inferences and never use the Wald confidence interval for a proportion.

Appendix: R Code

get_test_size <- function(p_true, n, test, alpha = 0.05) {
# Compute the size of a hypothesis test for a population proportion
# p_true true population proportion
# n sample size
# test function of p_hat, n, and p_0 that computes test stat
# alpha nominal size of the test
x <- 0:n
p_x <- dbinom(x, n, p_true)
test_stats <- test(p_hat = x / n, sample_size = n, p0 = p_true)
reject <- abs(test_stats) > qnorm(1 - alpha / 2)
sum(reject * p_x)
}
get_score_test_stat <- function(p_hat, sample_size, p0) {
SE_0 <- sqrt(p0 * (1 - p0) / sample_size)
return((p_hat - p0) / SE_0)
}
get_wald_test_stat <- function(p_hat, sample_size, p0) {
SE_hat <- sqrt(p_hat * (1 - p_hat) / sample_size)
return((p_hat - p0) / SE_hat)
}
plot_size <- function(n, test, nominal = 0.05, title = '') {
p_seq <- seq(from = 0.01, to = 0.99, by = 0.001)
size <- sapply(p_seq, function(p) get_test_size(p, n, test, nominal))
plot(p_seq, size, type = 'l', xlab = 'p',
ylab = 'Type I Error Rate',
main = title)
text(0.5, 0.98 * max(size), bquote(n == .(n)))
abline(h = nominal, lty = 2, col = 'red', lwd = 2)
}
plot_size_comparison <- function(n, nominal = 0.05) {
par(mfrow = c(1, 2))
plot_size(n, get_score_test_stat, nominal, title = 'Score Test')
plot_size(n, get_wald_test_stat, nominal, title = 'Wald Test')
par(mfrow = c(1, 1))
}

For the R code used to generate these plots, see the Appendix at the end of this post.↩︎
The value of $p$ that maximizes $p(1-p)$ is $p=1/2$ and $(1/2)^2 = 1/4$.↩︎
If you know anything about Bayesian statistics, you may be suspicious that there’s a connection to be made here. Indeed this whole exercise looks very much like a dummy observation prior in which we artificially augment the sample with “fake data.” There is a Bayesian connection here, but the details will have to wait for a future post.↩︎
As far as I’m concerned, 1.96 is effectively 2. If you disagree, please replace all instances of “95%” with “95.45%$.↩︎
The final inequality follows because $\sum_{i}^n X_i$ can only take on a value in $\{0, 1, ..., n\}$ while $n\omega$ and $n(1 - \omega)$ may not be integers, depending on the values of $n$ and $c^2$.↩︎

Lessons from the Oxford Vaccination Survey

Fri, 31 Dec 2021 00:00:00 +0000

Back in November a colleague pointed me to a website describing the recent COVID-19 Student Vaccination Survey carried out by my employer, the University of Oxford. At the time I briefly tweeted my concerns at the University:

Sorry @UniofOxford, but this is wildly misleading. Students who choose to reply to a vaccination survey are likely very different from those who choose not to reply. If your estimate is seriously biased, constructing a confidence interval is pointless.https://t.co/YB7HUa6Y0k pic.twitter.com/FYQA97QmU1
— Francis DiTraglia (@economictricks) November 19, 2021

but never received a response. In this post, armed with far more than 280 characters, I’ll explain what went wrong in the Oxford Vaccination Survey and suggest some ways of doing better next time.

What We Know About the Survey

While the university website doesn’t provide detailed information on the survey’s methodology, it does allow us to establish a few key facts. First: this was not in fact a survey; it was a census.

All students were invited to complete the COVID-19 Student Vaccination Survey in 3rd and 4th Weeks of Michaelmas term.¹ This very short form asked students to confirm their COVID-19 vaccination status.

Surveys use a sample to learn about a population. Pollsters don’t ask all American voters if they approve of Biden; they ask a small subset of voters, using a carefully-designed sampling scheme. A census, on the other hand, attempts to reach each person in the target population. This is how the Oxford Vaccination “Survey” was conducted.

Second: roughly half of Oxford students chose not to respond:

The response rate was 49.3%, and there were virtually no differences in vaccination rates between different colleges and departments.

Third: among those who did complete the questionnaire, a very high proportion indicated that they were vaccinated:

A total of 98% of respondents reported that they were vaccinated (95% fully and 3% partially).

How Oxford Interpreted the Results

Their headline conclusion was that “the survey indicated that the vast majority of students are now vaccinated.” Further down, under the FAQs, they added the following clarification:

Given that the response rate was 50%, how can you be sure that the vaccination rate reflects the whole student population?

50% is considered a high response rate in surveys of this type. It allows for a reliable and powerful statistical test to be conducted on the response data. Based on our evaluation, we can be 95% certain that the true vaccination level among all Oxford students (based on those who responded) is between 97.8% and 98.3%. We can also be 99% certain that the true vaccination level among all Oxford students is between 97.7% and 98.4%. In light of this, we can be extremely confident that the vast majority of Oxford students are either fully or partially vaccinated.

This is wrong. First, the response rates for other “surveys of this type” are irrelevant: what matters is what we can learn from this dataset given the observed response rate and the reasons why students chose not to respond. It seems quite implausible that the vaccination rate among respondents reflects the whole student population. If, as I suspect, conscientious students are both more likely to be vaccinated and more likely to respond to a questionnaire, the 95% figure is almost certainly too high. Second, statistical tests and confidence intervals are designed to quantify sampling error: non-systematic differences in vaccination rates in the sample compared to the population that arise when a sample is drawn at random. But the problem here is non-sampling error: students who respond are likely different in their rate of vaccination from those who do not. This kind of systematic difference between sample and population is not accounted for in a confidence interval or statistical test. In light of this, we cannot conclude that “the vast majority of Oxford students are either fully or partially vaccinated.”

So what can we conclude? It depends on what we’re willing to assume. I’ll start off by considering the worst case: suppose we’re not willing to assume anything about the reasons why some students chose not to respond.

Worst-Case Bounds for the Proportion Vaccinated

I have a bowl containing 100 balls; $W$ of them are white and the rest are are black. I remove 50 balls from the bowl and lay them on the table so you can see them all at once. Of these, 49 are white and 1 is black. What can you conclude about $W$? The answer depends crucially on how I chose which balls to show you.

If I drew the balls at random, then you can use standard statistical tools–hypothesis testing, confidence intervals, a Bayesian posterior distribution–to make inferences about $W$. Random sampling ensures that the balls you see before you on the table are representative of the balls that remain in the bowl. This creates an inferential link between what you can observe and what you can’t, and allows you to quantify uncertainty about $W$ using the language of probability.

But what if you know nothing about how I chose which balls to show you? Perhaps I decided to show you all of the white balls in the urn; or perhaps I decided to show you all of the black balls: you have no idea. In this case the inferential link between what you can see on the table and what remains in the bowl is broken and the familiar tools of statistical inference cannot be directly applied. Unless you are willing to take a stand of some kind on how I chose which balls to remove from the urn, all that you can conclude about $W$ is that it must be at least 49 and no more than 99. Based on an observed fraction of 49/50 white balls on the table, you can conclude only that between 49/100 and 99/100 of the balls in the bowl are white.

Now replace 100 balls in the bowl with 25820 students in the university, 50 balls on the table with 12729 respondents to the questionnaire, and 49 white balls on the table with 12475 vaccinated students among those who responded.² Suppose for simplicity that everyone who responds to the questionnaire does do truthfully. (We’ll revisit this below.) Unless we know something about how and why respondents decided to respond, all we can infer from this information is that no fewer than 12475 and no more than 25566 Oxford students have been vaccinated. Expressed as a proportion, this works out to between 48% and 99% of students at Oxford with at least one COVID-19 vaccination.

While this result, [0.48, 0.99], looks superficially similar to a confidence interval, it’s a completely different animal. A confidence interval contains the values of an unknown quantity that are “consistent with the data” in a precise sense.³ As explained above, confidence intervals quantifies uncertainty arising from sampling error. But our uncertainty in the vaccine example does not come from random sampling: it comes from missing data. In effect, we have a complete census of the kind of Oxford student who fills out COVID-19 vaccination questionnaires. What we lack is any data whatsoever on the kind of Oxford student who doesn’t reply. Each point in the interval [0.48, 0.99] corresponds to a different assumption about the relative vaccination rates of respondents compared to non-respondents (the fraction of white balls on the table compared to the fraction remaining in the bowl).

Besides reporting the rather pessimistic bound [0.48, 0.99], is there anything more that we can say about the share of vaccinated Oxford students? Yes, but only if we’re willing to make some assumptions.

Missing (Completely) at Random?

Let’s start with the strongest possible assumption: data that are missing completely at random (MCAR). In the vaccination example, MCAR amounts to assuming that students who respond are representative of students who do not respond. In the language of my balls-and-bowl example, MCAR would hold if the 50 balls on the table were drawn from the bowl at random. Under MCAR, we can effectively ignore the problem of non-response and report an estimate of 95% vaccine coverage among Oxford students. But while it’s logically possible for the MCAR assumption to hold in this example, it’s not very plausible. As I mentioned above, it seems likely that more conscientious students will both be more likely to get vaccinated and to reply to a questionnaire. If so, MCAR fails.

More plausible than MCAR is the assumption of data that are missing at random (MAR). This idea is best illustrated with an example. Suppose that there are four kinds of balls in the bowl: large white balls, large black balls, small white balls, and small black balls. We’re not interested in the size of the balls; we only want to know how many are white and how many are black. As before, I draw 50 balls from the bowl at random and lay them on the table. Unfortunately, small balls tend to sink to the bottom of the bowl so I’m disproportionately likely to draw a large ball. In this case, the balls you can see on the table are no longer representative of the balls in the bowl: there are probably too many large balls on the table.

To make progress we need an assumption. Suppose that, after accounting for size, the chance that a I draw a particular ball doesn’t depend on its color. Under this assumption the large balls on the table are representative of the large balls that remain in the bowl. Similarly, the small balls on the table are representative of the small balls that remain in the bowl. This is precisely the MAR assumption: conditional on something we know (the size of a ball), the data we can observe (balls on the table) are representative of the missing data (balls in the bowl). MCAR does not hold because the 50 balls on the table are not representative of the 50 balls that remain in the urn (large balls are disproportionately likely to be drawn). They only become representative after we account for size.

So how might the MAR assumption allow us to learn more in the vaccination example? From the figures on page 85 of this HSA report we see that, within each UK age group, females are more likely to be vaccinated than males. Suppose that this holds among Oxford students as well. If there is any difference between survey response rates for male and female students, MCAR fails and the 95% estimated vaccination rate will be unreliable. But suppose we are willing to assume that female respondents are representative of female non-respondents. In this case, the share of vaccinated female respondents is a good estimate of the share of vaccinated female students, and similarly for male respondents. Because we know that 49% of Oxford students are female, we can use this information to calculate a good estimate of the overall share of vaccinated students.⁴ If MAR conditional on sex seems like too strong an assumption, perhaps you’d be willing to assume that female undergraduates from the UK who respond are representative of all female undergraduates from the UK. After all: overseas students are almost certainly vaccinated, given government travel restrictions. In this case we’d form eight groups (male/female $\times$ overseas/home $\times$ undergrad/post-grad), estimate the share of vaccination separately for each group using the data for respondents. To obtain an overall estimate for the share of vaccinated students, we’d simply average the estimates for each “type” of students, weighting by their respective shares in the Oxford student body. This approach of re-weightingestimates for particular groups is called post-stratification and is widely used in political polling.⁵ The assumption that underlies it is MCAR: conditional on characteristics we can observe, whether a person responds to the survey or not is “as good as random,” i.e. unrelated to the question we ask in our poll.

So does MAR hold in our vaccination example? Perhaps not. Getting vaccinated is widely viewed as a pro-social act. The phenomenon of social desirability bias suggests that people who are not vaccinated may be less likely to respond even after conditioning on their observed characteristics. If so MAR will not hold. Even so, I would put much more faith in results that used post-stratification to adjust for observed characteristics like sex and home/overseas that are plausibly related to both response rates and vaccination status. Depending on the precise nature of the data that Oxford collected, it’s possible that this analysis could still be carried out.

Designing a Better Survey

There’s an old saying (often attributed to Fisher) that calling in a statistician after you’ve collected your data is like calling in a doctor after a loved one has died: all she can do is perform a post-mortem; it’s too late to save the patient. So how could Oxford do a better job the next time that they want to carry out a survey?

One of the key lessons of statistics is that a relatively small sample can still provide reliable inferences about the population provided that the sample is drawn at random. Rather than contacting all students, choose a random sample and focus on maximizing the response rate. There are several ways to do this. First is multi-mode surveys: send an email in addition to a letter in addition to a text message. Second is incentives: perhaps offer a gift card upon receipt of the completed survey. In spite of your best efforts, inevitably some people still won’t reply. This is where two-stage sampling can help: take a second random sample of the non-respondents and contact these people a second time. As long as you reach a representative sample of non-respondents, it doesn’t matter if you reach them all. If some non-response remains, try to adjust for it using post-stratification.

All of the discussion thus far has assumed that respondents answer truthfully. But social desirability bias might also lead people to mis-represent their true vaccination status. This is a much harder problem to solve, but ensuring privacy can help. Posting a detailed privacy policy is a helpful first step, but students may still be suspicious when a government regulator asks their university to gather potentially sensitive data about them. Randomized response methods address this concern by designing surveys in which it is impossible for researchers or anyone with whom they share data to infer individual responses. How does this work? In the simplest possible example, I give you a coin and tell you to flip it in secret. I instruct you to answer the survey truthfully if you flip heads, and to simply check “NO I AM NOT VACCINATED” if you flip tails. When I read the survey, I have no way of knowing if you haven’t been vaccinated or merely flipped tails. In spite of this, it’s still possible to construct a reliable estimator of the share of people who are vaccinated.⁶

A New Year’s Resolution

Numbers are only valuable when they tell us something meaningful about the world. If we care enough to ask the question, we should care enough to do a good job answering it. While statistical inference and modeling can be extremely valuable tools for learning about the world, what matters most is collecting good, clean data. May you be blessed with 100% response rates in 2022. If you’re not, make it your new year’s resolution to do something about it!

For those of you who live outside the Oxford bubble, “in 3rd and 4th Weeks of Michaelmas term” means in the last week of October and first week of November 2021.↩︎
These are my approximations based on the information taken from the survey website and student numbers for the University of Oxford.↩︎
A 95% confidence interval for an unknown proportion $p$ contains the values of $p$ that we cannot reject based on a hypothesis test with significance level $0.05$.↩︎
I computed the share of female students from the University statistics posted at this url: https://www.ox.ac.uk/about/facts-and-figures/student-numbers.↩︎
An example I particularly enjoy is this paper by Andrew Gelman and co-authors.↩︎
If you’re studying introductory probability and statistics, see if you can figure out how. It’s a great practice exam question!↩︎

Street Fighting Numerical Analysis - Part 1

Fri, 29 Oct 2021 00:00:00 +0000

Computing is a crucial part of modern applied and theoretical econometrics but most economists, myself included, have little if any formal training numerical analysis and computer science. This means that we often learn things the hard way: by making boneheaded mistakes and spending hours browsing stackoverflow to try to figure out what went wrong. In preparation for my upcoming course on Empirical Research Methods ¹ I’ve started trying to collect and organize the various nuggets of computational wisdom that I’ve picked up over the years. This post is the first of several that I plan to write on that theme. Its origin is an enigmatic bug that I detected in a seemingly trivial line of my R code involving rep().

Is R broken?

For no particular reason, let’s use the R function rep() to print out the string "econometrics.blog" four times:

rep("econometrics.blog", times = 4)

## [1] "econometrics.blog" "econometrics.blog" "econometrics.blog"
## [4] "econometrics.blog"

Since 0.2 multiplied by 20 equals 4, it comes no surprise that replacing times = 4 with times = 0.2 * 20 gives the same result:

rep("econometrics.blog", times = 0.2 * 20)

## [1] "econometrics.blog" "econometrics.blog" "econometrics.blog"
## [4] "econometrics.blog"

Now let’s try times = (1 - 0.8) * 20 instead. Since 0.2 equals (1 - 0.8) this couldn’t possibly make a difference, could it? Distressingly, it does: we obtain only three copies of "econometrics.blog"

rep("econometrics.blog", times = (1 - 0.8) * 20)

## [1] "econometrics.blog" "econometrics.blog" "econometrics.blog"

What on earth is going on here? Has R made some kind of mistake? Let’s try a sanity check. First we’ll calculate (1 - 0.8) * 20 and call it x. Then we’ll check that x really does equal four:

x <- (1 - 0.8) * 20
x

## [1] 4

What a relief: surely setting times = x should give us four copies of "econometrics.blog". Alas, it does not:

rep('econometrics.blog', times = x)

## [1] "econometrics.blog" "econometrics.blog" "econometrics.blog"

Clearly using open-source software like R is a bad idea and I should switch to STATA.²

Numeric Types in R

Because R is a dynamically-typed programming language, we can almost always ignore the question of precisely how it stores numeric values “under the hood.” In fact R has two numeric types: integer and double. Integers are rare in practice. The operator : returns an integer vector

y <- 1:5
typeof(y)

## [1] "integer"

and the length of a vector is always an integer

n <- length(y)
typeof(y)

## [1] "integer"

but nearly every other numeric value you encounter in R will be stored as a double, i.e. a double precision floating point number:

typeof(4)

## [1] "double"

typeof(4.0)

## [1] "double"

typeof(cos(0))

## [1] "double"

To force R to store a value as an integer rather than double, we can either append an L

z <- 4L
typeof(z)

## [1] "integer"

or coerce, i.e. convert, a double to an integer using as.integer()

a <- 4
typeof(a)

## [1] "double"

b <- as.integer(a)
typeof(b)

## [1] "integer"

The trade-off between integers and doubles is between precision and range. Calculations carried out with integers are always exact, but integers can only be used to represent a fairly limited number of values. Calculations with doubles, on the other hand, are not always exact, but doubles can store a much larger range of values, including decimals.

This post isn’t the right place to delve into the details of floating point numbers, of which doubles are an instance, but there are two points worth emphasizing. First, it’s generally safe to store a value that is “truly” an integer, e.g. 4, as double. As explained in the help file integer {base}

current implementations of R use 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly.

This explains why you may never have encountered the L suffix in the wild. Because doubles can represent very large integers exactly, calculations with whole numbers stored as doubles will also be exact. Notice that R automatically converts integers that are “too big” into doubles:

x <- 999999999L
typeof(x)

## [1] "integer"

y <- 9999999999L
typeof(y)

## [1] "double"

While converting integers to doubles in innocuous, you need to be careful when converting doubles to integers. This turns out to be the root of our problem with rep() from above. Both 0.2 * 20 and (1 - 0.8) * 20 are doubles, and both appear to equal 4

x <- 0.2 * 20
x

## [1] 4

typeof(x)

## [1] "double"

y <- (1 - 0.8) * 20
y

## [1] 4

But x and y are coerced to different integer values:

as.integer(x)

## [1] 4

as.integer(y)

## [1] 3

The function rep() expects its second argument times to be an integer. If we supply a double instead, then rep() makes a conversion in the same way as the function as.integer(), namely by truncating. Far down in the help file for rep() we find this crucial caveat:

Non-integer values of times will be truncated towards zero. If times is a computed quantity it is prudent to add a small fuzz or use round().

But wait: aren’t x and y precisely equal to each other? How can one truncate to 4 while the other truncates to 3? As it turns out, appearances can be deceiving:

identical(x, y)

## [1] FALSE

To find out why these values aren’t equal, we need to learn a bit more about how computers approximate real numbers using doubles.

What you see isn’t always what you get.

R has various handy built-in constants, including $\pi$

pi

## [1] 3.141593

Notwithstanding bill number 246 of the 1897 sitting of the Indiana General Assembly, $\pi$ is an irrational number. By default, however, R only shows us a small number of its digits. To display twenty digits of $\pi$, we can specify the argument digits to the function print() like so

print(pi, digits = 20)

## [1] 3.141592653589793116

To see even more digits, we can use the function sprintf(). Let’s try to display 60 digits of $\pi$:

sprintf("%.60f", pi)

## [1] "3.141592653589793115997963468544185161590576171875000000000000"

Why do the last twelve decimal points display as zero? The answer is that computers cannot represent real numbers to infinite precision. At some point, the remaining digits of $\pi$ get chopped off, and we’re left with an approximation that’s more than sufficient for any practical application.

At first glance, the number 0.8 seems nothing like $\pi$. It is, after all, a rational number: 4/5. But sprintf() reveals that there’s more here than meets the eye:

sprintf("%.54f", 0.8)

## [1] "0.800000000000000044408920985006261616945266723632812500"

The fraction 4/5 cannot be represented exactly as a double; it can only be approximated. The same is true of 1/5. Because 0.2 and 0.8 can only be represented approximately, 1 - 0.8 and 0.2 turn out not to be equal from the computer’s perspective:

identical(1 - 0.8, 0.2)

## [1] FALSE

This in turn explains why (1 - 0.8) * 20 and 0.2 * 20 truncate to different integer values:

sprintf("%.54f", 20 * c(1 - 0.8, 0.2))

## [1] "3.999999999999999111821580299874767661094665527343750000"
## [2] "4.000000000000000000000000000000000000000000000000000000"

Which decimals have an exact representation?

The fraction 1/3 lacks a finite decimal expansion. You might guess that R would simply store 1/3 as a zero, followed by a decimal, followed by a large number of 3s. But in fact it does not:

sprintf("%.54f", 1/3)

## [1] "0.333333333333333314829616256247390992939472198486328125"

Every digit from the 1 onward is wrong. The fraction 1/10, on the other hand, clearly does have a finite decimal expansion, 0.1, but R gets this one wrong as well:

sprintf("%.54f", 1/10)

## [1] "0.100000000000000005551115123125782702118158340454101562"

At the same time, it handles 1/4 with perfect accuracy:

sprintf("%.54f", 1/4)

## [1] "0.250000000000000000000000000000000000000000000000000000"

What’s going on? Here’s a clue: the fraction 1/32 can also be represented exactly as a double. See if you can figure out why before reading further.

The reason why 1/3 lacks a terminating decimal expansion is that it can’t be written as a counting number divided by a power of ten. In other words, we can’t find values $m, n \in \mathbb{N}$ such that $m/10^n$ equals $1/3$. In contrast, 3/4 has a finite decimal expansion because it equals 75/100, corresponding to $m = 75$ and $n = 2$. Of course I’ve left off a crucial qualification: wherever I wrote “finite decimal expansion” above, I should have added “in base 10.” The same number could have a terminating decimal expansion in one base and not in another.

Although R displays numbers on the screen in base 10, it represents and computes with them in binary. So the question becomes: which values have a terminating decimal expansion in base 2? To find out, simply replace 10 with 2 in the expression from above. A rational number has a terminating decimal expansion base 2 if it can be written as $m/2^n$ for some $m, n \in \mathbb{N}$. Since 1/4 equals $1/2^2$, it has an exact representation. Since 3/4 can be written as 3/2^2 it also has an exact representation. In contrast, 1/5 lacks an exact representation because there are no natural numbers $m,n$ such that $5 = 2^n/m$. We can get close by making $n$ large and choosing $m$ carefully, but we can never satisfy this equation exactly.

Take-home Lesson

High-level programming languages like R and Python are extremely convenient: they allow us to focus on the big picture rather than writing line after line of boilerplate code. But we should never forget that computers cannot represent all numeric values with perfect accuracy. Sometimes this matters. In R, coercing from integer to double is safe but the reverse can be risky, as we have gleaned from a deceptively simple example involving rep(). To learn more about the subtleties of R, I highly recommend The R Inferno by Patrick Burns and Advanced R by Hadley Wickham. Despite what you may have heard, R is quite a capable language, but it does have some quirks!

Content coming soon!↩︎
Health warning: this sentence is satire. The author does not condone the use of STATA or other closed-source statistical packages.↩︎

Don't Use the Textbook CI for a Proportion

Mon, 04 Oct 2021 00:00:00 +0000

In a previous post I showed an example in which the “textbook” confidence interval for a proportion performs poorly despite a fairly large sample size. My aim in that post was to convince you that the oft-repeated advice concerning $n > 30$ and the central limit theorem is worthless. Today I’d like to convince you of something even more subversive: the textbook confidence interval for a proportion is absolutely terrible and you should never use it or teach it under any circumstances. Fortunately there’s a simple fix. For a 95% interval, simply add four “fake” observations to your dataset, two successes and two failures, and then follow the textbook recipe for this “artificially augmented” dataset.

This post draws on two very approachable papers from around the turn of the millennium: Agresti & Coull (1998) and Brown, Cai, & DasGupta (2001). More theoretically-inclined readers may also enjoy a companion paper to the latter reference: Brown, Cai, & DasGupta (2002).

The Wald Confidence Interval

This is what you learned in your introductory statistics course: it’s the textbook interval I alluded to above. Let $X_1, ..., X_n \sim \text{iid Bernoulli}(p)$ and define $\widehat{p} = \sum_{i=1}^n X_i/n$. By the central limit theorem \[ \frac{\widehat{p} - p}{\sqrt{\widehat{p}(1 - \widehat{p})/n}} \rightarrow_d N(0,1) \] leading to the so-called “Wald” 95% confidence interval for a population proportion: \[ \widehat{p} \pm 1.96 \times \sqrt{\frac{\widehat{p}(1- \widehat{p})}{n}}. \]

An Obvious Objection to the Wald Interval

Suppose we want to estimate the proportion of Trump voters in Berkeley California. We decide carry out a poll of 25 randomly-sampled Berkeley residents and find that none of them voted for Trump. Then $\widehat{p} = 0$ and the Wald confidence interval is \[ 0 \pm 1.96 \times \sqrt{0 \times (1 - 0) / 25} \] in other words $[0, 0]$. Clearly this is absurd: unless there are literally zero Trump voters in Berkeley, we can be certain that this interval does not contain the true population parameter. A similar problem would emerge if we instead tried to estimate the proportion of Biden voters in the Berkeley: if $\widehat{p} = 1$ then our confidence interval would be $[1,1]$. This too is absurd. More broadly, the Wald confidence interval is extremely poorly behaved in situations where $p$ is close to zero or one. You may have encountered a suggestion that $np(1-p)$ should be at least 5 for the Wald interval to perform well. There’s something to this advice, as we’ll see below, but it’s not sufficient. More to the point: we don’t know $p$ in practice so there is no way to apply this rule!

The Agresti-Coull Interval

Here’s a quick and dirty fix that is suprisingly effective. Simply add four “fake” observations to the dataset: two zeros (failures) and two ones (successes). The 95% Agresti-Coull confidence interval is constructed in exactly the same way as 95% Wald interval only using this “artificially augmented” dataset rather than the original one. In other words, if the sample size is $n$ and the sample proportion is $\widehat{p}$, then Agresti-Coull interval is constructed from $\widetilde{n} = n + 4$ and \[ \widetilde{p} \equiv \frac{n \widehat{p} + 2}{n + 4} = \frac{\left(\sum_{i=1}^n X_i\right) + 0 + 0 + 1 + 1}{n + 4} \] yielding \[ \widetilde{p} \pm 1.96 \times \sqrt{\frac{\widetilde{p}(1 - \widetilde{p})}{\widetilde{n}}}, \quad \widetilde{n} \equiv n + 4, \quad \widetilde{p} \equiv \frac{n\widehat{p}+ 2}{n + 4}. \] Note that this “add four fake observations” adjustment is specific to the case of a 95% confidence interval. In a future post, I’ll explain where the rule comes from and how to generalize it to other confidence levels. For now, let’s ask ourselves a more fundamental question: does this adjustment make sense? “Wait!” I can hear you object: “adding fake observations introduces a bias!” Indeed it does. Since $\widehat{p}$ is an unbiased estimator of $p$, \[ \begin{aligned} \text{Bias}(\widetilde{p}) &\equiv \mathbb{E}[\widetilde{p} - p] = \mathbb{E}\left[ \frac{n \widehat{p} + 2}{n + 4}\right] - p\\ &= \frac{np + 2}{n+4} - p = (1 - 2p) \left(\frac{2}{n + 4}\right). \end{aligned} \] If $p = 1/2$ this estimator is unbiased. Otherwise, the addition of four fake observations pulls $\widetilde{p}$ away from $\widehat{p}$ and towards $1/2$: when $p>1/2$ the estimator is downward-biased, and when $p < 1/2$ it is upward-biased. The smaller the sample size, the larger the bias.

Let’s try this out on our Trump/Berkeley example. Adding four fake observations gives a sample proportion of \[ \widetilde{p} = \frac{n \widehat{p} + 2}{n + 4} = \frac{25 \times 0 + 2}{25 + 4} = \frac{2}{29} \approx 0.07 \] in the augmented dataset and hence a 95% confidence interval of approximately \[ 0.07 \pm 1.96 \times \sqrt{\frac{0.07 \times (1 - 0.07)}{29}} = [-0.02, 0.16]. \] Of course a proportion can’t be negative, so we would report $[0, 0.16]$. This seems like a much more reasonable summary of our uncertainty than reporting an interval of $[0,0]$, but does it really work? Does adding fake data really improve things?

Comparing the Wald and Agresti-Coull Intervals

To answer the question raised at the end of the last paragraph, let’s use R to calculate the coverage probability of the Wald and Agresti-Coull intervals for a range of values of the sample size $n$ and true population proportion $p$. In other words, let’s see how often these intervals actually contain the true population parameter $p$. If they are bona fide 95% confidence intervals, this should occur with probability close to 0.95. One way to carry out this exercise is via Monte Carlo simulation: repeatedly drawing randomly generated datasets and counting the proportion of our resulting confidence intervals that contain the true value of $p$. In this example, however, it turns out that there’s a quick and easy way to calculate exact coverage probabilities using the R function dbinom. For full details, see the R code appendix at the end of the post.

To begin, let’s compare the two confidence intervals over a grid of values for the true population proportion $p$ while holding the sample size $n$ fixed. When $n = 25$ we obtain the following: In each of these plots, along with those that follow, the solid black curve gives the coverage probability while the dashed red line passes through $0.95$ on the vertical axis. A well-behaved confidence interval should produce a black curve that is close to the dashed red line. To make a long story short: the Agresti-Coull interval is quite well-behaved while the Wald interval is a disaster. For values of $p$ close to zero or one, the Wald interval is extremely erratic: its coverage probability can be exactly 95% or far below depending on the precise value of $p$. Moreover, the Wald interval systematically undercovers. There are very few values of $p$ for which its coverage probability is 0.95 or higher and very many for which it is below this level. In stark contrast, the Agresti-Coull interval at worst undercovers by around 0.01 or 0.02. In general its actual coverage probability is very close to 95%, although it does have a tendency to overcover for values of $p$ that are close to zero or one. It turns out that there is nothing special about $n = 25$. The same basic story holds for larger sample sizes, for example $n=50$ and $n = 100$.

So what do we make of the rule of thumb that the Wald interval will perform well if $np(1-p)>5$? Indeed, values of $p$ that are close to zero or one present the biggest problems for this confidence interval. But like many traditional statistical rules of thumb, this one leaves much to be desired. Suppose that $n = 1270$ and $p = 0.005$. In this case $np(1-p)$ equals 6.3 but the coverage probability of the Wald interval is an unsatisfying 0.875 compared to 0.958 for the Agresti-Coull interval.

Because the central limit theorem is an asymptotic result, one that holds as $n$ approaches infinity, we might hope that at least the performance of the Wald interval improves as the sample size grows. Alas, this is not always the case. The following plot compares the coverage of Wald and Agresti-Coull confidence intervals for $p = 0.005$ as $n$ increases from 200 to 2000. Note the pronounced “sawtooth” pattern in the Wald confidence interval. It improves steadily as the sample size grows only to jump precipitously downward, before beginning a steady upward climb followed by another jump. In contrast, the performance of the Agresti-Coull interval is fairly steady. While a bit less dramatic, a similar qualitative pattern holds for $p=0.2$ as $n$ increases from 25 to 100.

Conclusion

Friends don’t let friends use the Wald interval for a proportion. Fortunately there’s a simple alternative when you’re after a 95% confidence interval: add two successes and two failures to your dataset, then proceed as normal. I encourage you to use my R code to test out different values of $p$ and $n$, making your own comparisons of the Wald and Agresti-Coull intervals. In a future post, I’ll show you where the Agresti-Coull interval comes from, why it works so well, and how to generalize it to construct 90%, 99% and indeed arbitrary $(1 - \alpha) \times 100\%$ confidence intervals.

R Code Appendix

I wrote four R functions to generate the plots shown above: get_Wald_coverage() and get_AC_coverage() calculate the coverage probabilities of the Wald and Agresti-Coull confidence intervals, while plot_n_comparison() and plot_p_comparison() construct the plots comparing coverage probabilities across different values of the sample size $n$ and true population proportion $p$.

get_Wald_coverage <- function(p, n) {
#-----------------------------------------------------------------------------
# Calculates the exact coverage probability of a nominal 95% Wald confidence
# interval for a population proportion.
#-----------------------------------------------------------------------------
# p true population proportion
# n sample size
#-----------------------------------------------------------------------------
x <- 0:n
p_hat <- x / n
z <- qnorm(1 - 0.05 / 2)
SE <- sqrt(p_hat * (1 - p_hat) / n)
cover <- (p >= p_hat - z * SE) & (p <= p_hat + z * SE)
prob_cover <- dbinom(x, n, p)
sum(cover * prob_cover)
}
get_AC_coverage <- function(p, n) {
#-----------------------------------------------------------------------------
# Calculates the exact coverage probability of a nominal 95% Agresti-Coull
# confidence interval for a population proportion.
#-----------------------------------------------------------------------------
# p true population proportion
# n sample size
#-----------------------------------------------------------------------------
x <- 0:n
p_tilde <- (x + 2) / (n + 4)
n_tilde <- n + 4
z <- qnorm(1 - 0.05 / 2)
SE <- sqrt(p_tilde * (1 - p_tilde) / n_tilde)
cover <- (p >= p_tilde - z * SE) & (p <= p_tilde + z * SE)
prob_cover <- dbinom(x, n, p)
sum(cover * prob_cover)
}
plot_n_comparison <- function(n_seq, p) {
#-----------------------------------------------------------------------------
# Plots a comparison of coverage probabilities for Wald and Agresti-Coull
# nominal 95% confidence intervals for a population proportion over a grid of
# values for the sample size, holding the population proportion fixed.
#-----------------------------------------------------------------------------
# n_seq vector of values for the sample size
# p true population proportion
#-----------------------------------------------------------------------------
# Example:
# my_p_seq <- seq(0.02, 0.98, 0.0001)
# plot_p_comparison(my_p_seq, n = 25)
#-----------------------------------------------------------------------------
wald <- sapply(n_seq, function(n) get_Wald_coverage(p, n))
AC <- sapply(n_seq, function(n) get_AC_coverage(p, n))
cover_min <- min(min(wald), min(AC))
cover_max <- max(max(wald), max(AC))
limits <- c(cover_min, 1)
par(mfrow = c(1, 2))
plot(n_seq, wald, type = 'l', xlab = 'n', ylim = limits, main = 'Wald',
ylab = 'Coverage Prob.')
text(mean(n_seq), 1, labels = bquote(p == .(p)))
abline(h = 0.95, lty = 2, col = 'red')
plot(n_seq, AC, type = 'l', xlab = 'n', ylim = limits,
main = 'Agresti-Coull', ylab = '')
text(mean(n_seq), 1, labels = bquote(p == .(p)))
abline(h = 0.95, lty = 2, col = 'red')
par(mfrow = c(1, 1))
}
plot_p_comparison <- function(p_seq, n) {
#-----------------------------------------------------------------------------
# Plots a comparison of coverage probabilities for Wald and Agresti-Coull
# nominal 95% confidence intervals for a population proportion over a grid of
# values for the population proportion, holding sample size fixed.
#-----------------------------------------------------------------------------
# p_seq vector of values for the true population proportion
# n sample size
#-----------------------------------------------------------------------------
# Example:
# plot_n_comparison(n_seq = 25:100, p = 0.2)
#-----------------------------------------------------------------------------
wald <- sapply(p_seq, function(p) get_Wald_coverage(p, n))
AC <- sapply(p_seq, function(p) get_AC_coverage(p, n))
cover_min <- min(min(wald), min(AC))
limits <- c(cover_min, 1)
par(mfrow = c(1, 2))
plot(p_seq, wald, type = 'l', xlab = 'p', ylim = limits, main = 'Wald',
ylab = 'Coverage Prob.')
text(0.5, 1, labels = bquote(n == .(n)))
abline(h = 0.95, lty = 2, col = 'red')
plot(p_seq, AC, type = 'l', xlab = 'p', ylim = limits,
main = 'Agresti-Coull', ylab = '')
text(0.5, 1, labels = bquote(n == .(n)))
abline(h = 0.95, lty = 2, col = 'red')
par(mfrow = c(1, 1))
}

Regressions with a Mis-measured, Binary Outcome

Mon, 30 Aug 2021 00:00:00 +0000

Many outcomes of interest in economics are binary. For example, we may want to learn how employment status $Y^*$ varies with demographics $X$, where $Y^*=1$ means “employed” and $Y^*=0$ means unemployed or not in the labor force. But how do we know if someone is employed? Typically we ask them, perhaps as part of a large, nationally representative survey such as the CPS. Researchers who study labor market dynamics have long known, however, that observed data on labor market status are often inaccurate (Poterba & Summers, 1986). Administrative errors creep into even the most carefully-administered surveys. But more importantly, survey respondents do not always tell the truth, whether by mistake or deliberately, and this problem seems to have gotten worse in recent years (Meyer et al. 2015) Instead of true employment status $Y^*$ researchers only observe a noisy measure $Y\in \{0, 1\}$.

In my previous post I showed that classical measurement error an outcome variable is basically innocuous. But I also showed that measurement error in a binary random variable cannot be classical. In this post, I’ll explore the consequences of this fact when we want to learn $\mathbb{P}(Y^*=1|X)$ but only observe $Y$ and $X$, not the true outcome variable $Y^*$. To keep things concrete, I will assume throughout that $\mathbb{P}(Y^*=1|X) = F(X'\beta)$ where $F$ is a strictly increasing, differentiable function. This covers all the usual suspects: logit, probit, and the linear probability model.¹ The parameter $\beta$ may have a causal interpretation or may simply have a predictive one. Either way, the question I’ll focus on here is whether, and if so how, $\beta$ can be identified in the presence of measurement error. For simplicity I will assume throughout that that the covariates $X$ are measured without error.

What’s the problem?

Why does observing $Y$ rather than $Y^*$ present a problem? To answer this question, we need to derive the relationship between $\mathbb{P}(Y=1|X)$ and $\mathbb{P}(Y^*=1|X)$. Since $Y$ and $Y^*$ are both binary, $\mathbb{E}(Y|X) = \mathbb{P}(Y=1|X)$ and similarly \[ \mathbb{E}(Y^*|X) = \mathbb{P}(Y^*=1|X) = F(X'\beta). \] Now define the measurement error $W$ as $W = Y - Y^*$ so we can write $Y = Y^* + W$. By the linearity of expectation, \[ \begin{aligned} \mathbb{P}(Y=1|X) &= \mathbb{E}(Y|X)\\ &= \mathbb{E}(Y^* + W|X) = \mathbb{E}(Y^*|X) + \mathbb{E}(W|X)\\ &= \mathbb{P}(Y^*=1|X) + \mathbb{E}(W|X) \end{aligned} \] so we see that if $\mathbb{E}(W|X)=0$ then $\mathbb{P}(Y=1|X)$ and $\mathbb{P}(Y^*=1|X)$ will coincide. Unfortunately $\mathbb{E}(W|X)$ in general cannot be zero. This means that learning $\mathbb{P}(Y=1|X)$ will not tell us what we want to know: $\mathbb{P}(Y^*=1|X)$.

To see why $\mathbb{E}(W|X) \neq 0$, first define the mis-classification probabilities $\alpha_0(\cdot)$ and $\alpha_1(\cdot)$ \[ \begin{aligned} \alpha_0(X) &\equiv P(Y=1|Y^*=0,X)\\ \alpha_1(X) &\equiv P(Y=0|Y^*=1,X). \end{aligned} \] The subscripts on $\alpha$ refer to the value of $Y^*$ on which we condition: $\alpha_0(\cdot)$ conditions on $Y^*=0$ while $\alpha_1(\cdot)$ conditions on $Y^*=1$. You can interpret the mis-classification probabilities by analogy to null hypothesis testing: $\alpha_0(X)$ is effectively the type I error rate as a function of $X$ and $\alpha_1(X)$ is the type II error rate as a function of $X$.

To keep things fully general for the moment, we allow the mis-classification probabilities to depend on $X$. Perhaps a young male worker with five years of experience is more likely to make an erroneous self-report in the CPS than and older female worker with more experience, for example.² Since $Y$ and $Y^*$ are both binary, $W \in \{-1, 0, 1\}$ and we calculate $\mathbb{E}(W|X)$ as follows: \[ \begin{aligned} \mathbb{E}(W|X) &= -1 \times \mathbb{P}(W=-1|X) + 0 \times \mathbb{P}(W=0|X) + 1 \times \mathbb{P}(W=1|X) \\ &= \mathbb{P}(W=1|X) - \mathbb{P}(W=-1|X). \end{aligned} \] Now consider the event $\{W = -1\}$. The only way that this could occur is if $Y = 0$ and $Y^* = 1$. Accordingly, \[ \begin{aligned} \mathbb{P}(W = -1|X) &= \mathbb{P}(Y = 0, Y^* = 1|X)\\ &= \mathbb{P}(Y=0|Y^*=1,X)\mathbb{P}(Y^*=1|X) \\ &= \alpha_1(X) F(X'\beta). \end{aligned} \] Similarly, the only way that $\{W=1\}$ can occur is if $Y=1$ and $Y^*=0$ so that \[ \begin{aligned} \mathbb{P}(W = 1|X) &= \mathbb{P}(Y = 1, Y^* = 0|X)\\ &= \mathbb{P}(Y=1|Y^*=0,X) \mathbb{P}(Y^*=0|X)\\ &= \alpha_0(X) \left[1 - F(X'\beta)\right]. \end{aligned} \] Therefore, \[ \begin{aligned} \mathbb{E}(W|X) &= \mathbb{P}(W=1|X) - \mathbb{P}(W=-1|X) \\ &= \alpha_0(X)\left[1 - F(X'\beta) \right] -\alpha_1(X) F(X'\beta). \end{aligned} \] So how could $\mathbb{E}(W|X) = 0$? Re-arranging the preceding to solve for $F(X'\beta)$, \[ \mathbb{E}(W|X) = 0 \iff F(X'\beta) = \frac{\alpha_0(X)}{\alpha_0(X) + \alpha_1(X)}. \] This shows that that $\mathbb{E}(W|X)$ can only be zero in an extremely peculiar case where $\alpha_0(\cdot)$ and $\alpha_1(\cdot)$ depend on $X$ in just the right way. If the mis-classification probabilities are constants that do not depend on $X$, we would require $F(X'\beta) = \alpha_0/(\alpha_0 + \alpha_1)$. This is only possible if all the elements of $\beta$ besides the intercept are zero. Since $\mathbb{E}(W|X)$ will not in general equal zero, $\mathbb{P}(Y=1|X)$ will not in general equal $\mathbb{P}(Y^*=1|X)$.

What happens if we ignore the problem?

Substituting our expression for $\mathbb{E}(W|X)$ and factoring the result, \[ \begin{aligned} \mathbb{P}(Y=1|X) &= \mathbb{P}(Y^*=1|X) + E(W|X) \\ &= F(X'\beta) + \alpha_0(X)\left[1 - F(X'\beta) \right] -\alpha_1(X) F(X'\beta)\\ &= \alpha_0(X) + F(X'\beta) [1 - \alpha_0(X) - \alpha_1(X)]. \end{aligned} \]

Because we observe $(Y, X)$, $\mathbb{P}(Y=1|X)$ is identified. With enough data, we can learn this conditional probability as a function of $X$ as accurately as we wish. The problem is that $\alpha_0(X)$ and $\alpha_1(X)$ drive a wedge between what we can observe, $\mathbb{P}(Y=1|X)$, and what we’re trying to learn $\mathbb{P}(Y^*=1|X) = F(X'\beta)$. Without knowing more about the functions $\alpha_0(\cdot)$ and $\alpha_1(\cdot)$ we can’t say much about how $\mathbb{P}(Y=1|X)$ and $\mathbb{P}(Y^*=1|X)$ will differ. Because they are probabilities, \[ 0\leq \alpha_0(X) \leq 1, \quad 0\leq \alpha_1(X) \leq 1. \] But because they are conditional probabilities that condition on different events, $\{Y^*=0, X=x\}$ versus $\{Y^*=1, X=x\}$, the sum $\alpha_0(x) + \alpha_1(x)$ could be greater than one. This means that $1 - \alpha_0(X) - \alpha_1(X)$ could be negative, at least for certain values of $X$. It’s common in practice, however, to assume that $\alpha_0(X) + \alpha_1(X) < 1$ for all possible values that the covariates $X$ could take on. To understand this assumption, and the problem more generally, it’s helpful to consider a simple special case in which the mis-classification probabilities do not depend on $X$. In this case we can say precisely how measurement error in the outcome affects what we can learn about the relationship between $X$ and $Y^*$ and make clear why $\alpha_0(X) + \alpha_1(X) < 1$ is usually a reasonable assumption.

A Special Case: Fixed Mis-classification

Suppose that the mis-classification probabilities are fixed, i.e. that \[ \begin{aligned} \alpha_0(X) &\equiv \mathbb{P}(Y=1|Y^*=0|X) = \mathbb{P}(Y=1|Y^*=0)\equiv \alpha_0 \\ \alpha_1(X) &\equiv \mathbb{P}(Y=0|Y^*=1|X) = \mathbb{P}(Y=0|Y^*=1)\equiv \alpha_1. \end{aligned} \] This is a fairly strong assumption. It says that both self-reporting and administrative errors occur at the same rate for everyone, regardless of their observed characteristics. In this case, our expression for $\mathbb{P}(Y=1|X)$ from above becomes \[ \begin{aligned} \mathbb{P}(Y=1|X) &= \alpha_0 + F(X'\beta) (1 - \alpha_0 - \alpha_1). \end{aligned} \] Defining $f$ as the derivative of $F$, this means that the observed partial effect of a continuous covariate $X_j$ with respect to $Y$ is \[ \begin{align*} \frac{\partial}{\partial X_j} \mathbb{P}(Y=1|X) &= \frac{\partial}{\partial X_j} \left[ \alpha_0 + F(X'\beta)(1 - \alpha_0 - \alpha_1)\right]\\ &= f(X'\beta)\beta_j (1 - \alpha_0 - \alpha_1) \end{align*} \] whereas the true partial effect, with respect to $Y^*$, is \[ \frac{\partial}{\partial X_j} \mathbb{P}(Y^*=1|X) = \frac{\partial}{\partial X_j} F(X'\beta) = f(X'\beta) \beta_j. \] If $(\alpha_0 + \alpha_1) > 1$ then $(1 - \alpha_0 - \alpha_1)$ will be negative. This means that the measurement error problem is so severe that all of the observed partial effects have the wrong sign. A bit of tedious algebra shows that $Y$ and $Y^*$ must be negatively correlated in this case: $Y$ is such a noisy measure of $Y^*$ that when $Y=1$ we’re better off predicting that $Y^*=0$. For this reason, it’s traditional to assume that $\alpha_0 + \alpha_1 < 1$. In this case $0 < (1 - \alpha_0 - \alpha_1) \leq 1$ so the observed partial effects are attenuated versions of the true partial effects, in that \[ 0 < \frac{\partial}{\partial X_j} \mathbb{P}(Y=1|X) \leq \frac{\partial}{\partial X_j} \mathbb{P}(Y^*=1|X). \] So in this special case, non-classical measurement error in a binary outcome variable has the same effect as classical measurement in a continuous regressor: attenuation bias.

Observational Equivalence and $(\alpha_0 + \alpha_1)$

You may be wondering: do we really need the assumption $\alpha_0 + \alpha_1<1$ or is it merely convenient? Couldn’t the observed data tell us whether $\alpha_0 + \alpha_1$ is less than one or greater than one? The answer turns out to be no, and it’s easy to show why if $F(t) = 1 - F(-t)$ as in the probit, logit, and the linear probability models. Suppose that this condition on $F$ holds. Then we can write \[ \begin{aligned} \mathbb{P}(Y=1|X) &= \alpha_0 + (1 - \alpha_0 - \alpha_1) F(X'\beta)\\ &= \alpha_0 + (1 - \alpha_0 - \alpha_1) \left[ 1 - F\big(X'(-\beta)\big)\right]\\ &= \left(\alpha_0 + 1 - \alpha_0 - \alpha_1\right) - (1 - \alpha_0 - \alpha_1) F\big(X'(-\beta)\big)\\ &= (1 - \alpha_1) + \left[ \alpha_0 - (1 - \alpha_1) \right] F\big(X' (-\beta)\big)\\ &= (1 - \alpha_1) + \left[1 - (1 - \alpha_1) - (1 - \alpha_0) \right] F\big(X' (-\beta)\big). \end{aligned} \] Defining $\widetilde{\alpha}_0 \equiv (1 - \alpha_1)$, $\widetilde{\alpha}_1 \equiv (1 - \alpha_0)$, and $\widetilde{\beta} \equiv -\beta$, we have established that \[ \begin{aligned} \mathbb{P}(Y=1|X) &= \alpha_0 + (1 - \alpha_0 - \alpha_1) F(X'\beta)\\ &= \widetilde{\alpha}_0 + \left( 1 - \widetilde{\alpha}_0 - \widetilde{\alpha}_1 \right) F\big(X'\widetilde{\beta}\big). \end{aligned} \] Since $Y$ is binary, $\mathbb{P}(Y=1|X)$ tells us everything that there is to know about the distribution of $Y$ given $X$. The preceding pair of equalities shows that the observed conditional distribution $\mathbb{P}(Y=1|X)$ could just as well have arisen from $\mathbb{P}(Y^*=1|X) = F(X'\widetilde{\beta})$ with mis-classification probabilities $(\widetilde{\alpha}_0, \widetilde{\alpha}_1)$ as it could from $\mathbb{P}(Y^*=1|X) = F(X'\beta)$ with mis-classification probabilities $(\alpha_0, \alpha_1)$. From observations of $(Y, X)$ alone there is no way to tell these possibilities apart: we say that they are observationally equivalent. Notice that if $\alpha_0 + \alpha_1 < 1$ then $\widetilde{\alpha}_0 + \widetilde{\alpha}_1 > 1$ and vice-versa. This shows that the only way to point identify $\beta$ is to assume either that $\alpha_0 + \alpha_1 < 1$ or the reverse inequality.³ For the reasons discussed above, it usually makes sense to choose $\alpha_0 + \alpha_1 < 1$.

Some Solutions

So what is an applied researcher to do? If we could somehow learn the mis-classification probabilities, we could use them to “adjust” $\mathbb{P}(Y=1|X)$ and identify $\mathbb{P}(Y^*=1|X) = F(X'\beta)$ as follows: \[ F(X'\beta) = \frac{\mathbb{P}(Y=1|X) - \alpha_0(X)}{1 - \alpha_0(X) - \alpha_1(X)}. \] Broadly speaking there are two ways learn the mis-classification probabilities. The first approach estimates $\alpha_0(X)$ and $\alpha_1(X)$ using a second dataset. The second approach uses a single dataset and exploits non-linearity in the function $F$ instead. For the remainder of this discussion I will assume that $\alpha_0(X) + \alpha_1(X) < 1$ or $\alpha_0 + \alpha_1 < 1$ if the mis-classification probabilities are fixed.

Method 1: Auxiliary Data

Let’s start by making life simple: assume fixed mis-classification. Now suppose that we observe two random samples from the same population. In the first, we observe pairs $(Y_i,X_i)$ for $i = 1, ..., n$ and in the second we observe pairs $(Y_j, Y^*_j)$ for $j = 1, ..., m$. Notice that neither dataset contains observations of $X$ and $Y^*$ for the same individual. Using the $(Y_i,X_i)$ observations we can estimate $\mathbb{P}(Y=1|X)$, and using the $(Y_j, Y^*_j)$ observations we can estimate \[ \alpha_0 = \mathbb{P}(Y=1|Y^*=0), \quad \alpha_1 = \mathbb{P}(Y=0|Y^*=1). \] This gives us everything we need to determine $F(X'\beta)$ as a function of $X$ and hence $\beta$. The observations of $(Y_i, Y^*_i)$ are called an auxiliary dataset. In theory, auxiliary data provide a simple and general solution to measurement error problems of all stripes. Suppose, for example, that we were uncomfortable with the assumption of fixed mis-classification. If we observed an auxiliary dataset of triples $(Y_j, Y_j^*, X_j)$ then we could directly estimate $\alpha_0(X)$ and $\alpha_1(X)$. Of course, we if we observed $(Y_j, Y_j^*, X_j)$ for a random sample drawn from the population of interest we could estimate $\mathbb{P}(Y^*=1|X)$ directly without the need to account for measurement error! And here lies the fundamental tension of the auxiliary data approach: if we had sufficiently rich auxiliary data we wouldn’t really have a measurement error problem in the first place. More typically, we either observe $(Y_j, Y_j^*, X_j)$ for a different population, or only observe a subset of these variables for our population of interest. Either way we need to rely on modeling assumptions to bridge the gap. For example, fixed mis-classification and an auxiliary dataset of $(Y^*_j, Y_j)$ suffice to solve the measurement error problem but only if $\alpha_0(X)$ and $\alpha_1(X)$ do not in fact depend on $X$.

Method 2: Nonlinearity of $F$

The auxiliary data approach is very general in principle but relies on information that we simply may not have in practice: a second dataset from the same population. An alternative approach uses only one dataset, $(Y_i, X_i)$ for $i = 1, ..., n$, and instead exploits the shape of the function $F$. This second approach is a bit less general but doesn’t require any outside sources of information.

To begin, suppose that the mis-classification probabilities are fixed and that $F$ is a known function, e.g. the standard logistic CDF. Suppose further that $F$ is strictly increasing and hence invertible. Then, applying $F^{-1}$ to both sides of our expression for $F(X'\beta)$ from above, \[ X'\beta = F^{-1} \left[\frac{\mathbb{P}(Y = 1|X) - \alpha_0}{1 - \alpha_0 - \alpha_1}\right] \] and thus, pre-multiplying both sides by $X$ and taking expectations, \[ \mathbb{E}[XX']\beta = \mathbb{E}\left\{X F^{-1} \left[\frac{\mathbb{P}(Y = 1|X) - \alpha_0}{1 - \alpha_0 - \alpha_1}\right]\right\}. \] Therefore, if $\mathbb{E}[XX']$ is invertible, \[ \beta = \mathbb{E}[XX']^{-1}\mathbb{E}\left\{X F^{-1} \left[\frac{\mathbb{P}(Y = 1|X) - \alpha_0}{1 - \alpha_0 - \alpha_1}\right]\right\}. \] Since $\mathbb{P}(Y=1|X)$ depends only on the observed data $(Y,X)$, this function is point identified. Since $F$ is assumed to be a known function, it follows that $\beta$ is point identified whenever $\mathbb{E}[XX']$ is invertible and $(\alpha_0, \alpha_1)$ are known.⁴ So if we can find a way to point identify $\alpha_0$ and $\alpha_1$, we will immediately identify $\beta$.

Easier said than done! How can we possibly learn $\alpha_0$ and $\alpha_1$ without auxiliary data? Nonlinearity is the key. If $F$ is a cumulative distribution function, then $\lim_{t\rightarrow \infty} F(t) = 1$ and $\lim_{t\rightarrow -\infty} F(t) = 0$. Now suppose that $X$ contains at least one covariate, call it $V$, that is continuous and has “large support,” i.e. takes on values in a very wide range. Without loss of generality, suppose that the coefficient $\beta_v$ on $V$ is positive. (If it’s negative, then apply the following argument to $-V$ instead.) For $V$ large and positive $X'\beta$ is large and positive so that $F(X'\beta)$ is close to one. In this case \[ \begin{aligned} \alpha_0 + (1 - \alpha_0 - \alpha_1) F(X'\beta) &\approx \alpha_0 + (1 - \alpha_0 - \alpha_1) \times 1 \\ &= (1 - \alpha_1). \end{aligned} \] For $V$ large and negative, on the other hand, $X'\beta$ is large and negative, $F(X'\beta)$ is close to zero, and \[ \begin{aligned} \alpha_0 + (1 - \alpha_0 - \alpha_1) F(X'\beta) &\approx \alpha_0 + (1 - \alpha_0 - \alpha_1) \times 0\\ &= \alpha_0. \end{aligned} \] Intuitively, by examining values of $X_i$ for which $F(X_i'\beta)$ is close to one we can learn $(1 - \alpha_1)$ and by examining values for which $F(X_i'\beta)$ is close to zero we can identify $\alpha_0$.

You may object that the preceding identification argument sounds suspiciously circular: doesn’t this idea at least implicitly require us to know $\beta$? Fortunately the answer is no. We only need to know the signs of $\beta$. Under the assumption that $\alpha_0 + \alpha_1 < 1$ these are the same as the signs of the observed partial effects $\partial \mathbb{P}(Y=1|X) /\partial \beta_j$. An example may help. Suppose $Y=1$ means “graduated from college.” Under fixed misclassification, we would learn $\alpha_0$ from the observations of $(Y_i, X_i)$ for people who almost certainly didn’t graduate from college, based on their covariates, and $(1 - \alpha_1)$ from observations of $(Y_i, X_i)$ for people who almost certainly did. By first estimating $\mathbb{P}(Y=1|X)$ we learn attenuated versions of the true partial effects $F(X'\beta) \beta_j$. In other words, we learn how reported education varies with $X$. But this information suffices to show us how to make $F(X'\beta)$ close to zero or one.

The preceding argument crucially relies on the assumption that $F$ is nonlinear. To see why, consider the linear probability model $F(X'\beta) = X'\beta$ and let $X' = (1, X_1')$ and $\beta' = (\beta_0, \beta_1')$. Then, \[ \begin{aligned} \mathbb{P}(Y=1|X) &= \alpha_0 + \left(1 - \alpha_0 - \alpha_1\right) F(X'\beta)\\ &= \alpha_0 + \left(1 - \alpha_0 - \alpha_1\right)(X'\beta) \\ &= \alpha_0 + \left(1 - \alpha_0 - \alpha_1\right)(\beta_0 + X_1' \beta_1) \\ &= \alpha_0 + (1 - \alpha_0 - \alpha_1) \beta_0 + X_1' (1 - \alpha_0 - \alpha_1) \beta_1. \end{aligned} \] Now, defining $\widetilde{\beta}_0 \equiv \alpha_0 + (1 - \alpha_0 - \alpha_1)\beta_0$, $\widetilde{\beta}_1 = (1 - \alpha_0 - \alpha_1) \beta_1$, and $\widetilde{\beta}' = (\widetilde{\beta}_0, \widetilde{\beta}_1')$ we have \[ \mathbb{P}(Y=1|X) = \alpha_0 + (1 - \alpha_0 - \alpha_1) X'\beta = X'\widetilde{\beta}. \] This shows that a linear probability model with coefficient vector $\beta$ and mis-classification probabilities $(\alpha_0, \alpha_1)$ is observationally equivalent to a linear probability model with no mis-classification and coefficient $\widetilde{\beta}$. To put it another way: there is no way to tell whether mis-classification is present or absent in a linear model. Doing so requires non-linearity.

So how can we use these results in practice? If $(\alpha_0, \alpha_1, \beta)$ are identified and $F$ is assumed known, we can proceed via garden-variety maximum likelihood estimation. The log-likelihood function is only slightly more complicated than in the standard binary outcome setting, in particular: \[ \begin{aligned} \ell_n(\alpha_0, \alpha_1, \beta) &= \frac{1}{n} \sum_{i=1}^n \log\left\{ \mathbb{P}(Y_i=1|X)^{\mathbb{1}(Y_i=1)}\mathbb{P}(Y_i=0|X_i)^{\mathbb{1}(Y_i=0)} \right\} \\ &= \frac{1}{n} \sum_{i=1}^n Y_i \log\left\{ \alpha_0 + (1 - \alpha_0 - \alpha_1) F(X_i'\beta) \right\} + (1 - Y_i) \log\left\{ 1 - \alpha_0 - (1 - \alpha_0 - \alpha_1) F(X_i'\beta) \right\}. \end{aligned} \] If $F$ is unknown, estimation is more complicated but the intuition from above continues to hold: a regressor $V$ with “large support” allows us to identify the mis-classification probabilities, and hence $F$. Indeed, we can even allow $\alpha_0$ and $\alpha_1$ to depend covariates, as long as they don’t depend on $V$ itself. For more details on the “identification by nonlinearity” approach, see Hausman et al. (1998) and Lewbel (2000).

Coming Attractions

That’s more than enough about measurement error for now! When I return to this topic in a few weeks time, I’ll consider the problem of a mis-measured binary regressor. In my next installment, however, I’ll put measurement error to one side and revisit a classic problem from introductory statistics: constructing a confidence interval for a population proportion. Sometimes the easiest things turn out to be much harder than they first appear.

For more details of these models, see my lecture notes.↩︎
Below we’ll examine a simpler special case in which $\alpha_0$ and $\alpha_1$ are fixed probabilities that do not depend on $X$.↩︎
Alternatively, you could say that the only way to identify $(\alpha_0, \alpha_1)$ is by making an assumption about the sign of one component of $\beta$.↩︎
We also need $\alpha_0 + \alpha_1$ to avoid division by zero!↩︎

Beyond Classical Measurement Error

Mon, 23 Aug 2021 00:00:00 +0000

Pop Quiz: If $D^*$ and $D$ are binary random variables and $D$ is a noisy measure of $D^*$, is it possible for the measurement error $W \equiv D - D^*$ to be classical? Explain why or why not. (Answer below)

Classical Measurement Error

Classical measurement error is a problem that is easy to understand and relatively easy to address. Roughly speaking, classical measurement error refers to a situation in which the variable we observe equals the truth plus noise \[ \text{Observed} = \text{Truth} + \text{Noise} \] where the noise is unrelated to the truth and “everything else.” (I’ll be precise about the meaning of “unrelated” and “everything else” in a moment.) Mis-measuring a regressor $X$ in this way biases the OLS slope estimator towards zero (attenuation bias) but we can correct for this with a valid instrument. Mis-measuring the outcome $Y$ increases standard errors but doesn’t bias the OLS estimator. You can find all the details in your favorite introductory econometrics textbook, but in the interest of making this post self-contained, here’s a quick review.

Least Squares Attenuation Bias

Suppose that we want to learn the slope coefficient from a population linear regression of $Y$ on $X^*$: \[ \beta \equiv \frac{\text{Cov}(Y,X^*)}{\text{Var}(X^*)}. \] Unfortunately we observe not $X^*$ but a noisy measure $X = X^* + W_X$ where $W_X$ is uncorrelated with both $X^*$ and $Y$. Then \[ \begin{aligned} \text{Cov}(Y, X) &= \text{Cov}(Y, X^* + W_X) = \text{Cov}(Y, X^*)\\ \text{Var}(X) &= \text{Var}(X^* + W_X) = \text{Var}(X^*) + \text{Var}(W_X). \end{aligned} \] Now, define the reliability ratio $\lambda$ as follows: \[ \lambda \equiv \frac{\text{Var}(X^*)}{\text{Var}(X^*) + \text{Var}(W_X)}. \] Measurement error means that $\text{Var}(W_X)$ is positive. Since variances can’t be negative, this implies $0 < \lambda < 1$. Combining our definition of $\lambda$ with the expressions for $\text{Cov}(Y,X)$ and $\text{Var}(X)$ from above, \[ \begin{aligned} \frac{\text{Cov}(Y,X)}{\text{Var}(X)} &= \frac{\text{Cov}(Y, X^*)}{\text{Var}(X^*) + \text{Var}(W_X)} \\ &=\frac{\text{Var}(X^*)}{\text{Var}(X^*) + \text{Var}(W_X)}\cdot \frac{\text{Cov}(Y, X^*)}{\text{Var}(X^*)}\\ &= \lambda \beta \end{aligned} \] so we see that regressing $Y$ on $X$ gives $\lambda \beta$ rather than $\beta$. Since $0 < \lambda < 1$, this phenomenon is called least squares attenuation bias: $\lambda \beta$ has the same sign as $\beta$ but is smaller in magnitude. The greater the extent of measurement error, the larger the variance of $W_X$ and the smaller that $|\lambda \beta|$ becomes.

Instrumental variables to the rescue

Suppose that $Y = \alpha + \beta X^* + U$ where $X = X^* + W_X$ as above. Now suppose that we can find a variable $Z$ that is correlated with $X^*$ but uncorrelated with $U$ and $W_X$. Then \[ \begin{aligned} \text{Cov}(Y,Z) &= \text{Cov}(\alpha + \beta X^* + U, Z) = \beta\text{Cov}(X^*,Z)\\ \text{Cov}(X,Z) &= \text{Cov}(X^* + W_X, Z) = \text{Cov}(X^*,Z) \end{aligned} \] so that $\beta = \text{Cov}(Y, Z) / \text{Cov}(X,Z)$. If $X^*$ is measured with classical measurement error, a simple instrumental variables regression solves the problem of attenuation bias.¹ Notice that we haven’t said anything about $U$ in relation to $X^*$. If $\beta$ is the population linear regression slope, then $U$ is uncorrelated with $X^*$ by definition. But this derivation still goes through if $Y = \alpha + \beta X^* + U$ is a causal model in which $X^*$ is correlated with $U$, e.g. if $Y$ is wage and $X^*$ is years of schooling, in which case $U$ might be “unobserved ability.” In this way, a single valid instrument can serve “double-duty,” eliminating both attenuation bias and selection bias.

Measurement error in the outcome

Now suppose that $X^*$ is observed but the true outcome $Y^*$ is not: we only observe a noisy measure $Y = Y^* + W_Y$. If $W_Y$ is uncorrelated with $X^*$, \[ \frac{\text{Cov}(Y,X^*)}{\text{Var}(X^*)} = \frac{\text{Cov}(Y^* + W_Y, X^*)}{\text{Var}(X^*)} = \frac{\text{Cov}(Y^*,X^*)}{\text{Var}(X^*)} \]

so we’ll obtain the same slope from a regression of $Y$ on $X^*$ as we would from a regression of $Y^*$ on $X^*$. Classical measurement error in the outcome variable doesn’t introduce a bias.

Solution to the Pop Quiz

Now that we’ve refreshed our memories about classical measurement error, let’s a take a look at my pop quiz question from above:

If $D^*$ and $D$ are binary random variables and $D$ is a noisy measure of $D^*$, is it possible for the measurement error $W \equiv D - D^*$ to be classical? Explain why or why not.

If $W$ is a classical measurement error then, among other things, it must be uncorrelated with $D^*$. But this is impossible if both $D^*$ and $D$ are binary. By the definition of $W$, $D = D^* + W$. If $D^* = 1$ then $D = 1 + W$. To ensure that $D$ takes on a value in $\{0, 1\}$, this means that $W$ must be either $0$ or $-1$. If instead $D^* = 0$, then $D = W$, so $W$ must be either $0$ or $1$. Hence, unless $W$ always equals zero, in which case there’s no measurement error, $W$ must always be negatively correlated with $D^*$. In other words, measurement error in a a binary variable can never be classical. The same basic logic applies whenever $X$ and $X^*$ are bounded: to ensure that $X$ stays within its bounds, any measurement error must be correlated with $X^*$.

Non-Differential Measurement Error

Classical measurement error, as we’ve seen, is a very special case. Or to put it another way, non-classical measurement error isn’t as exotic as it sounds. Because discrete random variables cannot be subject to classical measurement error, non-classical measurement error should be on any applied economist’s radar. My next few posts will provide an overview of the simplest case: non-differential measurement error in a binary variable. This assumption allows $D^*$ to be correlated with $W$, but assumes that conditioning on $D^*$ is sufficient to break the dependence between $W$ and everything else. Even in this relatively simple case, everything we’ve learned about classical measurement error goes out the window:

Non-differential measurement error does not necessarily cause attenuation.
The IV estimator doesn’t correct for non-differential measurement error, and a single instrument cannot serve “double-duty.”
Non-classical measurement error in the outcome variable generally does introduce bias.

The good news is that there are methods to address non-differential measurement error. In my next post, I’ll start by considering the case of a mis-measured binary outcome.

Strictly speaking I haven’t used the assumption that $W_X$ is uncorrelated with $X^*$ in this derivation, but it’s implicit in the assumption that $Z$ is correlated with $X^*$ but not with $W_X$.↩︎

Understanding the F Statistic

Sun, 15 Aug 2021 00:00:00 +0000

The F-statistic for a test of multiple linear restrictions is a staple of introductory econometrics courses. In the simplest case, it can be written as \[F \equiv \frac{(SSR_r - SSR_{u})/q}{SSR_{u} / (n - k - 1)}\] where $SSR_r$ is the restricted sum of squared residuals, $SSR_{u}$ is the unrestricted sum of squared residuals, $q$ is the number of restrictions, and $(n - k - 1)$ is the degrees of freedom of the unrestricted model.¹

In my experience, students encountering this expression for the first time find it bewilderingly arbitrary; it becomes just one more item to add the a list of formulas memorized for the exam and promptly forgotten. My aim in this post is to demystify the $F$ statistic. By the end, I hope that you will find the form of this expression intuitive, perhaps even obvious.

This is not a post about asymptotic theory, and it is not a post about heteroskedasticity. I will not prove that $F$ follows an $F$-distribution, and I will blithely assume that we inhabit the idealized textbook realm in which all errors are homoskedastic. I will also dodge the question of whether you should even be carrying out an F-test in the first place.² This is a post about understanding what the $F$-statistic measures and why it takes the form that it does.

The Simplest Possible Example

The best way to understand the $F$-statistic is by looking at an example that’s so simple that there’s no reason to use an $F$-test in the first place. Here’s a dataset of students’ scores on two introductory statistics midterms that I gave many years ago:

midterms <- read.csv('https://ditraglia.com/econ103/midterms.csv')
head(midterms)

## Midterm1 Midterm2
## 1 57.14 60.71
## 2 77.14 77.86
## 3 83.57 93.57
## 4 88.00 NA
## 5 69.29 72.14
## 6 80.71 89.29

As you can see, there is at least one missing observation: student #4 scored 88% on the first midterm, but missed the second. In fact, nine students missed the second midterm:

summary(midterms)

## Midterm1 Midterm2
## Min. :56.43 Min. :47.86
## 1st Qu.:70.53 1st Qu.:74.64
## Median :80.36 Median :84.29
## Mean :79.74 Mean :81.39
## 3rd Qu.:87.86 3rd Qu.:90.71
## Max. :97.86 Max. :99.29
## NA's :9

To keep this example as simple as possible, I’ll drop the missing observations.³

midterms <- na.omit(midterms)

Let’s call student #4 Natalie: she scored 88% on the first midterm but missed the second. Suppose we wanted to predict how well Natalie would have done on the second midterm had she taken it. There are many ways that we could try to make this prediction. One possibility would be to ignore Natalie’s score on the first midterm, and predict that she would have scored 81.4 on the second: the average score among all students who took this exam. Another possibility would be to fit a linear regression to the scores of all students who took both exams and use this to project Natalie’s score on midterm two based on her score on midterm one. If scores on the two exams are correlated, option two seems like a better idea: Natalie outperformed the class average on midterm one by 8.4 or roughly 0.77 standard deviations. It seems reasonable to account for this when predicting her second on the second exam.

In fact both of these these prediction rules can be viewed as special cases of linear regression. Let $x_i$ denote student $i$’s score on midterm one and $y_i$ denote her score on midterm two. The sample mean $\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i$ solves the optimization problem \[ \min_a \sum_{i=1}^n (y_i - a)^2 \] which is simply least squares without a predictor variable.⁴ In contrast, the usual least-squares regression problem is \[ \min_{a,b} \sum_{i=1}^n (y_i - a - b x_i)^2 \] with solutions $\hat{a} = \bar{y} - \hat{b} \bar{x}$ and $\hat{b} = s_{xy} / s_x^2$, where $s_{xy}$ is the sample covariance of scores on the two midterms and $s_x^2$ is the sample variance of scores on the first midterm. Notice how these two optimization problems are related: the first is a restricted (aka constrained) version of the second with the constraint $b = 0$. In the discussion below, I will call the first of these the restricted regression and the second the unrestricted regression.

It’s easy to fit these regressions in R. We’ll start with the restricted:

restricted <- lm(Midterm2 ~ 1, data = midterms)
summary(restricted)

##
## Call:
## lm(formula = Midterm2 ~ 1, data = midterms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.531 -6.746 2.899 9.319 17.899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.391 1.457 55.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.28 on 70 degrees of freedom

The syntax Midterm2 ~ 1 specifies a regression formula containing no predictor variables, only an intercept. Notice that our estimate for the intercept agrees with the sample mean score on the second midterm from above, as it should!

Turning our attention to the unrestricted regression, we see that scores on the first midterm are strongly predictive of scores on the second:

unrestricted <- lm(Midterm2 ~ Midterm1, data = midterms)
summary(unrestricted)

##
## Call:
## lm(formula = Midterm2 ~ Midterm1, data = midterms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.809 -7.127 2.047 8.125 18.549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.575 9.243 3.524 0.000759 ***
## Midterm1 0.613 0.115 5.329 1.17e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.41 on 69 degrees of freedom
## Multiple R-squared: 0.2916, Adjusted R-squared: 0.2813
## F-statistic: 28.4 on 1 and 69 DF, p-value: 1.174e-06

For a pair of students who differed by one point in their scores on the first midterm, we would predict a difference of 0.61 points on the second.

The restricted regression ignores a student’s score on the first midterm when predicting her score on the second. But we’ve seen from the unrestricted regression that scores on midterm #1 are strongly correlated with scores on midterm #2. As such, our best bet is to predict Natalie’s second midterm score using the unrestricted regression model:

predict(unrestricted, newdata = data.frame(Midterm1 = 88))

## 1
## 86.52169

Because Natalie scored above the mean on the first exam, we predict that she will score above the mean on the second exam.

How much better is the fit of the unrestricted regression?

While I didn’t present it in this way, the choice between restricted and unrestricted regressions above could be formulated as a hypothesis test. The restricted regression imposes a zero regression slope, but the unrestricted regression doesn’t. In this case, we can test the restriction that the slope is in fact zero using simple t-test. Based on the t-statistic of 5.33 from above we would easily reject the restriction at any conventional significance level.⁵

But there’s another way to carry out the same test. Although it would be overkill in this example, the principles that underlie it can be used to carry out tests in more complicated situations where a simple t-test wouldn’t suffice. Rather than examining the slope estimate from the unrestricted regression, this alternative approach compares the sum of squared residuals (SSR) of the two regressions to see which does a better job of fitting the observed data.⁶ It’s easy to compute the SSR of the two regressions from above using the residuals() function:

SSR_u <- sum(residuals(unrestricted)^2)
SSR_r <- sum(residuals(restricted)^2)
c(Unrestricted = SSR_u, Restricted = SSR_r)

## Unrestricted Restricted
## 7475.741 10552.883

The SSR of the restricted regression is higher than that of the unrestricted regression. But what exactly should we make of this? A picture can help to make things clearer. This one has two panels: one for the restricted regression and another for the unrestricted regression. Each panel plots the observations from the midterms dataset along with the fitted regression line, using dashed vertical lines to indicate the residuals: the distance from a given observation to the regression line.

Notice that the restricted regression line is flat because it does not use scores on the first midterm to predict those on the second. The lower SSR of the unrestricted model reflects the fact that the observations in the midterms dataset are on average closer to a line with slope 0.6 and intercept 32.6 than they are to a line with slope zero and intercept 81.4.

To understand this picture, it helps to think about the following question: is it possible for the unrestricted regression to have a higher SSR than the restricted one? Recall from above that each of these regressions is the solution to an optimization problem. The difference between them is that the restricted regression imposes a constraint while the unrestricted regression doesn’t. If the best slope for predicting second midterm scores using first midterm scores is zero, the unrestricted regression is free to set $b = 0$. In this case its estimates would coincide with those of the restricted regression. On the other hand, if the best slope isn’t zero that means some other choice of $b$ by definition results in a lower SSR: linear regression chooses the line whose slope and intercept minimize the squared vertical deviations between the data and the line. The restricted regression is forced to have $b = 0$, so in this case it must do a worse job fitting the data.⁷ This reasoning shows that we will always find that the SSR of the restricted model is at least as large as that of the unrestricted model.

We shouldn’t be surprised to see that the unrestricted regression “fits the data” better than the restricted one: it can’t do otherwise unless the sample correlation between midterm scores is exactly zero. But there is still the question of how much better it fits. Taking differences tells us how much larger the SSR of the restricted model is compared to that of the unrestricted model:

SSR_r - SSR_u

## [1] 3077.142

So is this a big difference or a small difference? The answer depends on the units in which $y$ is measured. A residual is a vertical deviation, i.e. a distance along the $y$-axis. This means that it has the same units as the $y$-variable. If $y$ is measured in inches, so are the residuals; if $y$ is measured in kilometers, so are the residuals. Because the SSR is a sum of squared residuals, it has the same units as $y^2$. If $y$ is measured in inches, the SSR is measured in square inches; if $y^2$ is measured in kilometers, the SSR is measured in square kilometers. Changing the units of $y$ changes the units of the SSR. For example, an SSR of one becomes an SSR of one million if we change the units of $y$ from kilometers to meters. Accordingly, a comparison of SSR_r to SSR_u is meaningless unless we account for the units of $y$.

The simplest way account for units is by eliminating them from the problem. This is precisely what we do when we carry out a t-test: $\bar{x}/\text{SE}(\bar{x})$ is unitless because the standard error of $\bar{x}$ has the same units as $\bar{x}$ itself. Any change of units in the numerator would be cancelled out in the denominator. This is a crucial point: test statistics are unitless. We do not compare $\bar{x}$ to a table of normal critical values measured in inches for a distribution with standard deviation $2.4$; we compare $\bar{x}/2.4$ to a unitless standard normal distribution.

The t-statistic eliminates units by taking a ratio, so let’s try the same idea in our comparison of SSR_r to SSR_u. There are various possibilities, and any of them would work just as well from the perspective of eliminating units. The F-test statistic is based on a ratio that asks how much worse the restricted model fits relative to the unrestricted regression. In other words, we ask: how much larger is SSR_r compared to SSR_u as a percentage expressed in decimal terms?

(SSR_r - SSR_u) / SSR_u

## [1] 0.4116171

There is nothing subtle going on here. If we wanted to know how much larger US GDP is in 2021 compared to 1921, we would simply calculate \[ \frac{\text{GDP}_{2021} - \text{GDP}_{1921}}{\text{GDP}_{1921}} \] assuming, of course, that both of these figures are corrected for inflation! This is precisely the same reasoning that we used above: the SSR “grows” when we impose a restriction. We want to know how much it grows as a percentage. The answer is 0.41 or equivalently 41%.

Sampling Uncertainty

We’ve nearly arrived at the F-statistic. To see what’s missing, we’ll use a bit of algebra to re-write it as \[F \equiv \frac{(SSR_r - SSR_{u})/q}{SSR_{u} / (n - k - 1)} = \left(\frac{SSR_r - SSR_u}{SSR_u}\right) \left(\frac{n - k - 1}{q}\right).\] We obtained the first factor on the RHS, $(SSR_r - SSR_u) / SSR_u$, simply by reasoning about units and the nature of constrained versus unconstrained optimization problems. Stop for a minute and appreciate how impressive this is: simple intuition has taken us halfway to this rather formidable-looking expression. To understand the second factor, we need to think about sampling uncertainty.

In the midterms example we found that the restricted regression SSR was 41% larger than the unrestricted one. Is this a big difference or a small one? Units don’t enter into it, because we have already eliminated them. But the midterms dataset only contains information on 71 students. If we merely want to summarize the relationship between test scores for these students, there is no role for statistical inference: summary statistics suffice. Tests and confidence intervals enter the picture when we hope to generalize from an observed sample to the population from which it was drawn. Imagine a large population of introductory statistics students who took my two midterms. Now suppose that we observe a random sample of 71 students from this population. How much information do the observed exam scores for these students provide about the relationship between midterm scores that we in the population?⁸

The larger the sample size, the more evidence an observed difference in the sample provides about a potential difference in the population. We can see this in the expression for the standard error of the sample mean: $\text{SE}(\bar{X}) = \sigma_x^2/\sqrt{n}$. The larger the sample size, the smaller the standard error, all else equal. Accordingly, given two datasets with identical summary statistics, the larger sample will have the larger t-statistic. The same reasoning applies to the F-statistic above. The numerator $(n - k - 1)$ in the second factor increases with the sample size $n$. This magnifies the effect of the first factor. An $SSR_r$ that is 41% higher than the $SSR_u$ is “more impressive” evidence when the sample size is $1000$ than when it is $10$. Small samples are intrinsically more variable than large ones, so we should expect them to turn up anomalous results more frequently. The F-statistic takes this into account.

So why $(n - k - 1)$ rather than $n$? This is a so-called “degrees of freedom correction.” By estimating $k$ regression slope parameters and $1$ intercept parameter, we “use up” $(k + 1)$ of the observations, leaving only $(n - k - 1)$ pieces of truly independent information. This is not particularly intuitive. In a stunning departure from my usual advice to introductory statistics and econometrics students, I suggest that you simply memorize this part of the F-statistic. It may help to notice that the same degrees of freedom correction appears in the expression for the standard error of the regression, $SER \equiv \sqrt{SSR/(n - k - 1)}$, a measure of the average distance that the observed data fall from the regression line.

The only as-yet-unexplained quantity in the F-test statistic is $q$. This denotes the number of restrictions imposed by the restricted model. Counting restrictions is the same thing as counting equals signs. In a regression of the form $Y = \beta_0 + \beta_1 X + U$ a restriction of the form $\beta_1 = 1$ gives $q = 1$ because there it takes a single equals sign to assert that $\beta_1$ equals one. More complicated regressions allow more complicated kinds of restrictions. For example, in the regression \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + U \] we could consider the restriction $\beta_1 = \beta_2 = \beta_3 = 0$ yielding $q =3$. Alternatively we could consider $\beta_0 = \beta_2 = 7$ yielding $q = 2$. We could even consider $\beta_1 + \beta_2 = 1$ yielding $q = 1$. Again: to count the number of restrictions, count the number of equals signs that it requires to express these restrictions.

Now we know how to determine $q$, but the question remains: why does it enter the F-test statistic? Above we discussed why $SSR_u$ cannot exceed $SSR_r$ in a particular dataset. Now we have to think about what happens when sampling uncertainty enters the picture. The sum of squared residuals measures how well a linear regression model fits the observed dataset. Crucially, this is the very same dataset that was used to calculate the regression slope and intercept. In effect, we have used the data twice: first to determine the parameter values that minimize the sum of squared vertical deviations and then to assess how well our regression fits the data, measured by the same vertical deviations. The danger lurking here is a phenomenon called overfitting. We’re not really interested in how well the regression fits this dataset; what we want to know is how well it would help us to predict future observations. In-sample fit, as measured by SSR or related quantities, can be shown to be an over-optimistic measure of out-of-sample fit. This is well-known to machine learning practitioners, who generally use one dataset to fit their models, the training data, and a separate dataset, the test data, to evaluate their predictive performance. In general, the more “flexible” the model, the worse the overfitting problem becomes. Because adding restrictions reduces a model’s flexibility, this creates a challenge for any procedure that compares the in-sample fit of two regressions. Because it is less flexible, we should expect the restricted regression to fit the sample data less well than the unrestricted regression even if the restrictions are true in the population.

To drive this point home, I generated a dataset called sim_data in which $Y_i = \alpha + \epsilon_i$ where $\epsilon_i \sim \text{Normal}(0, 1)$. I then simulated a large number of regressors $(X_{i1}, X_{i2}, \dots, X_{iq})$ completely independently of $Y_i$. (For the simulation code, see the appendix below.) In the population from which I simulated my data, none of these regressors contains any information to predict $Y$. Nevertheless, if $q$ is relatively large compared to the sample size $n$, some of these regressors will appear to be correlated with $Y$ based on the observed data, purely because of sampling variability. In this example I set $n = 100$ and $q = 50$. Fitting a restricted regression with only an intercept, and an unrestricted regression that includes all 50 regressors from sim_dat we obtain

reg_sim_unrestricted <- lm(y ~ ., sim_dat)
reg_sim_restricted <- lm(y ~ 1, sim_dat)
SSR_sim_r <- sum(residuals(reg_sim_restricted)^2)
SSR_sim_u <- sum(residuals(reg_sim_unrestricted)^2)
(SSR_sim_r - SSR_sim_u) / SSR_sim_u

## [1] 0.9144509

Even though the restrictions are true in this simulation study, the restricted SSR is 91% larger than the unrestricted SSR purely due to sampling variability. The F-statistic explicitly takes this phenomenon into account via the scaling factor $(n - k - 1) / q$. What matters is not the sample size per se, but the sample size relative to the number of restrictions imposed by the restricted regression.

While we’re here we might as well carry out the test!

Now that we understand why the F-test statistic takes the form that it does, let’s carry out the F-test in each of the two examples from above: midterms and sim_dat. Under the null hypothesis that the constraints imposed by the restricted regression are correct, the F-test statistic follows an $F(q, n-k-1)$ distribution.⁹ For the midterms dataset our test statistic is

F_midterms <- ((SSR_r - SSR_u) / SSR_u) * (71 - 1 - 1) / 1
F_midterms

## [1] 28.40158

while the 10%, 5% and 1% critical values for an $F(1, 69)$ distribution are

alpha <- c(0.1, 0.05, 0.01)
qf(1 - alpha, df1 = 1, df2 = 69)

## [1] 2.779684 3.979807 7.017078

The associated p-value is

1 - pf(F_midterms, df1 = 1, df2 = 69)

## [1] 1.173624e-06

so we resoundingly reject the restriction: first midterm scores clearly do help to predict second midterm scores. The sim_data example gives a very different result. The test statistic in this example is well below any standard critical value, and the p-value is very large:

F_sim <- ((SSR_sim_r - SSR_sim_u) / SSR_sim_u) * (100 - 50 - 1) / 50
F_sim

## [1] 0.8961619

qf(1 - alpha, df1 = 50, df2 = 49)

## [1] 1.444392 1.604442 1.957803

1 - pf(F_sim, df1 = 50, df2 = 49)

## [1] 0.6497498

In this case we would fail to reject the restrictions. Indeed, the restrictions are true: my simulation generated regressors that are completely independent of $Y$!

The Bottom Line

The F-statistic is a product of two factors. The first factor measures how much larger the sum of squared residuals becomes in percentage terms when we impose the restriction. We use a relative comparison to eliminate units from the problem. The second factor accounts for sampling variability. The larger the sample size $n$ relative to the number of restrictions $q$, the more we “inflate” the value of the first factor. The only thing you need to memorize is the degrees of freedom correction: $(n - k - 1)$.

Appendix: Code

I used the following code to generate my plot comparing the SSR of the restricted and unrestricted regression models in the midterm exams dataset from above:

par(mfrow = c(1, 2))
plot(Midterm2 ~ Midterm1, data = midterms, main = 'Unrestricted', pch = 20)
abline(coef(unrestricted), lwd = 2, col = 'blue')
with(midterms, segments(x0 = Midterm1, y0 = Midterm2,
x1 = Midterm1, y1 = fitted(unrestricted),
col = 'blue', lty = 2, lwd = 1))
text(90, 50, bquote(SSR == .(round(SSR_u))))
plot(Midterm2 ~ Midterm1, data = midterms, main = 'Restricted', pch = 20)
with(midterms, segments(x0 = Midterm1, y0 = Midterm2,
x1 = Midterm1, y1 = fitted(restricted),
col = 'red', lty = 2, lwd = 1))
abline(h = coef(restricted), lwd = 2, col = 'red')
text(90, 50, bquote(SSR == .(round(SSR_r))))
par(mfrow = c(1,1))

and the following code to generate the data contained in sim_data

set.seed(3817)
n <- 100
q <- 50
y <- 0.5 + rnorm(n)
x <- matrix(rnorm(n * q), n, q)
colnames(x) <- paste0('x', 1:q)
sim_dat <- data.frame(x, y)

Specifically, the “simplest case” refers to a setting in which both the restricted and unrestricted models include an intercept and we assume homoskedasticity.↩︎
See F-Tests, R-squared, and Other Distractions for an insightful critique.↩︎
Health warning: it dangerous to glibly drop missing observations in applied work! The key question is why those nine students missed the second exam. If their reasons for doing so are “as good as randomly assigned,” e.g. a death in the family or an unexpected illness or injury, these observations as missing at random (MAR). In this case, dropping them is statistically innocuous. If instead these students’ reasons for missing the exam are related to their course performance, then dropping their observations could yield a misleading picture of the underlying relationship between midterm scores.↩︎
If you’re taking introductory statistics or econometrics it’s a good idea to try to prove this for yourself!↩︎
If you’re familiar with the “trinity” of classical tests (Score/LM, Likelihood Ratio, and Wald), this reasoning amounts to a Wald test: fit the unrestricted model and use the distance between the parameter estimates and their values under the restriction to form a test statistic. If the unrestricted estimates are close enough to the restriction, don’t reject it.↩︎
Again, if you’re familiar with the aforementioned “trinity” of tests, the procedure I’m about to describe amounts to a Likelihood Ratio Test: fit both the restricted and unrestricted models, and compare their maximized sample likelihoods. If the likelihoods are similar, don’t reject the restriction.↩︎
More generally, imposing a constraint can never result in a better solution to an optimization problem: at best it can leave the optimum unchanged.↩︎
It may be difficult to imagine a population from which the students who happen to have taken my course in a particular semester could be viewed as a random sample. For the purposes of this exercise I kindly ask you to suspend your disbelief.↩︎
Strictly speaking this requires the regression errors to be normally distributed and homoskedastic. If the errors are non-normal but homoskedastic, then the F-statistic is approximately distributed as an $F(q, \infty)$ random variable for large $n$. Life is much more complicated under heteroskedasticity, but this is a topic for a future post!↩︎

(Mis)understanding Selection on Observables

Sun, 08 Aug 2021 00:00:00 +0000

On a recent exam I asked students to extend the logic of propensity score weighting to handle a treatment that takes on three rather than two values: basically a stripped-down version of Imbens (2000). Nearly everyone figured this out without much trouble, which is good news! At the same time, I noticed some common misconceptions about the all-important selection-on-observables assumption: \[ \mathbb{E}[Y_0|D,X] = \mathbb{E}[Y_0|X] \quad \text{and} \quad \mathbb{E}[Y_1|D,X] = \mathbb{E}[Y_1|X] \] where $(Y_0, Y_1)$ are the potential outcomes corresponding to a binary treatment $D$ and $X$ is a vector of observed covariates.¹ Since more than a handful of students made the same mistakes, it seemed like a good opportunity for a short post.

Two Misconceptions

The following two statements about selection on observables are false:

Under selection on observables, if I know the value of someone’s covariate vector $X$, then learning her treatment status $D$ provides no additional information about the average value of her observed outcome $Y$.

Selection on observables requires the treatment $D$ and potential outcomes $(Y_0,Y_1)$ to be conditionally independent given covariates $X$.

If you’ve studied treatment effects, pause for a moment and see if you can figure out what’s wrong with each of them before reading further.

The First Misconception

The first statement:

Under selection on observables, if I know the value of someone’s covariate vector $X$, then learning her treatment status $D$ provides no additional information about the average value of her observed outcome $Y$.

is a verbal description of the following conditional mean independence condition: \[ \mathbb{E}[Y|X,D] = \mathbb{E}[Y|X]. \] So what’s wrong with this equality? The potential outcomes $(Y_0, Y_1)$ and the observed outcome $Y$ are related according to \[ Y = Y_0 + D (Y_1 - Y_0). \] Taking conditional expectations of both sides and using the selection on observables assumption \[ \begin{aligned} \mathbb{E}[Y|X,D] &= \mathbb{E}[Y_0|X,D] + D \mathbb{E}[Y_1 - Y_0|D,X]\\ &= \mathbb{E}[Y_0|X] + D \mathbb{E}[Y_1 - Y_0|X]. \end{aligned} \] In contrast, conditioning on $X$ alone gives \[ \begin{aligned} \mathbb{E}[Y|X] &= \mathbb{E}[Y_0|X] + \mathbb{E}[D(Y_1 - Y_0)|X]\\ &= \mathbb{E}[Y_0|X] + \mathbb{E}_{D|X}[D\mathbb{E}(Y_1 - Y_0|D,X)]\\ &= \mathbb{E}[Y_0|X] + \mathbb{E}_{D|X}[D\mathbb{E}(Y_1 - Y_0|X)]\\ &= \mathbb{E}[Y_0|X] + \mathbb{E}(D|X) \cdot \mathbb{E}(Y_1 - Y_0|X) \end{aligned} \] by iterated expectations and the selection on observables assumption, since $\mathbb{E}(Y_1 - Y_0|X)$ is a measurable function of $X$. Subtracting these expressions, we find that \[ \mathbb{E}(Y|X,D) - \mathbb{E}(Y|X) = \left[ D - \mathbb{E}(D|X) \right] \cdot \mathbb{E}(Y_1 - Y_0|X) \] so that $\mathbb{E}(Y|X,D) = \mathbb{E}(Y|X)$ if and only if the RHS equals zero.

So how could the RHS equal zero? One way is if $D = \mathbb{E}(D|X)$. Since $D$ is a binary random variable, this would require $\mathbb{E}(D|X)$ to be a binary random variable as well. But notice that $\mathbb{E}(D|X) = \mathbb{P}(D=1|X)$ is simply the propensity score $p(X)$. Because $X$ is a random variable, so is $p(X)$. But $p(X)$ cannot take on the values zero or one. If it did, this would violate the overlap assumption: $0 < p(X) < 1$.

So we can’t have $D = \mathbb{E}(D|X)$, but what about $\mathbb{E}(Y_1 - Y_0|X)=0$? Since $(Y_1 - Y_0)$ is the treatment effect of $D$, it follows that $\mathbb{E}(Y_1 - Y_0|X)$ is the conditional average treatment effect $\text{ATE}(X)$ given $X$. It’s not a contradiction for $\text{ATE}(X)$ to equal zero, but think about what it would mean: it would require that the average treatment effect for a person with covariates $(X = x)$ is exactly zero regardless of $x$. Moreover, by iterated expectations it would imply that \[ \text{ATE} = \mathbb{E}(Y_1 - Y_0) = \mathbb{E}_X[\mathbb{E}(Y_1 - Y_0| X)] = \mathbb{E}[\text{ATE}(X)] = 0 \] so the average treatment effect would also be zero. Again, this is not a contradiction but it would definitely be odd to assume that the treatment effect is zero before you even try to estimate it!

To summarize: the first statement above cannot be an implication of selection on observables because it would either require a violation of the overlap assumption, or imply that there is no treatment effect whatsoever. To correct the statement, we simply need to change the last three words:

Under selection on observables, if I know the value of someone’s covariate vector $X$, then learning her treatment status $D$ provides no additional information about the average values of her potential outcomes $(Y_0, Y_1)$.

This is a correct verbal statement of the mean exclusion restriction $\mathbb{E}(Y_0|D,X) = \mathbb{E}(Y_0|X)$ and $\mathbb{E}(Y_1|D,X) = \mathbb{E}(Y_1|X)$.

The Second Misconception

And this leads nicely to the second misconception:

Selection on observables requires the treatment $D$ and potential outcomes $(Y_0,Y_1)$ to be conditionally independent given covariates $X$.

To see why this is false, consider an example in which \[ \begin{aligned} Y &= (1 - D) \cdot (\alpha_0 + X'\beta_0 + U_0) + D \cdot (\alpha_1 + X' \beta_1 + U_1)\\ U_0|(D,X) &\sim \text{Normal}(0,1 - D/2)\\ U_1|(D,X) &\sim \text{Normal}(0,1 + D). \end{aligned} \] Notice that the distributions of $U_0$ and $U_1$ given $(D,X)$ depend on $D$. Now, by iterated expectations, \[ \begin{aligned} \mathbb{E}(U_0|X) &= \mathbb{E}_{(D|X)}[\mathbb{E}(U_0|D,X)] = 0\\ \mathbb{E}(U_0) &= \mathbb{E}_{X}[\mathbb{E}(U_0|X)] = 0 \end{aligned} \] and similarly $\mathbb{E}(U_1|X) = \mathbb{E}(U_1)=0$. Substituting $D=0$ and $D=1$, we can calculate the potential outcomes and average treatment effect as follows \[ \begin{aligned} Y_0 &= \alpha_0 + X'\beta_0 + U_0 \\ Y_1 &= \alpha_1 + X'\beta_1 + U_1 \\ \text{ATE} &= \mathbb{E}(Y_1 - Y_0) = (\alpha_1 - \alpha_0) + \mathbb{E}[X'](\beta_1 - \beta_0). \end{aligned} \] It follows that $D$ is not conditionally independent of $(Y_0, Y_1)$ given $X$. In particular, the variance of the potential outcomes depends on $D$ even after conditioning on $X$: \[ \begin{aligned} \text{Var}(Y_0|X,D) &= \text{Var}(U_0|X,D) = 1 - D/2\\ \text{Var}(Y_1|X,D) &= \text{Var}(U_1|X,D) = 1 + D. \end{aligned} \] In spite of this, the selection on observables assumption still holds: \[ \begin{aligned} \mathbb{E}(Y_0|D,X) &= \alpha_0 + X'\beta_0 + \mathbb{E}(U_0|D,X) = \alpha_0 + X'\beta_0\\ \mathbb{E}(Y_0|X) &= \alpha_0 + X'\beta_0 + \mathbb{E}(U_0|X) = \alpha_0 + X'\beta_0\\ \end{aligned} \] and similarly $\mathbb{E}(Y_1|D,X) = \mathbb{E}(Y_1|X) = \alpha_1 + X'\beta_0$. While this example is admittedly a bit peculiar, the point is more general: because the average treatment effect is an expectation, identifying it only requires assumptions about conditional means.² The second statement is even easier to correct than the first: we need only add a single word:

Selection on observables requires the treatment $D$ and potential outcomes $(Y_0,Y_1)$ to be conditionally mean independent given covariates $X$.

Conditional independence implies conditional mean independence, but the converse is false.

Epilogue

So what’s the moral here? First, it’s crucial to distinguish between the observed outcome $Y$ and the potential outcomes $(Y_0, Y_1)$. Second, the various notions of “unrelatedness” between random variables—independence, conditional mean independence, and uncorrelatedness—can be confusing. Be sure to pay attention to exactly which condition is used and why. In a future post, I’ll have more to say about the relationships between these notions.

For more details see my lecture notes on treatment effects ↩︎
You might object that in the real world it is difficult to think of settings in which conditional mean independence is plausible but full independence does not. This is a fair point. Nevertheless, it’s important to be clear about which assumptions are actually used in a given derivation, and here we only rely on conditional mean independence.↩︎

Untangling Cause and Effect Without Experiments

Sun, 01 Aug 2021 00:00:00 +0000

The following is a piece I wrote for the LMH News, based on a general interest webinar that I gave in November of 2020. If this post inspires you to learn more about causal inference, you may enjoy browsing my teaching materials on treatment effects.

Will earning a PPE degree from Oxford increase your lifetime earnings? Does eating bacon sandwiches cause cancer? Does watching Fox News make you vote Republican? Will owning a dog increase your lifespan? Each of these questions concerns the causal effect of a treatment on an outcome. In social science, a “treatment” is any factor whose causal effect we hope to learn. As far as I know, there has never been an experiment that compelled people to study a particular subject at university, watch Fox News, or own a dog: nonetheless, papers have been written and published that use data to estimate the causal effects of each of these treatments. Datasets in which the treatment of interest is “naturally occurring” rather than randomly assigned as part of an experiment are called observational. Many of the most interesting and important treatments in social science cannot be randomly assigned. Social scientists have therefore developed a set of tools for studying treatment effects using observational data. By introducing you to some of these tools and briefly summarising the ways in which researchers have used them, I’ll shed some light on that age-old question: how much is your education worth?

Alice read PPE at Oxford and currently earns £75,000 a year. Would she have earned as much if she had studied at Oxford Brookes instead? The fundamental problem of causal inference is that we can never observe a person’s counterfactual outcome. In other words, we can never know what her outcome would have been if her treatment had been different. A counterfactual is fundamentally a “within-person” comparison, asking us to imagine two parallel universes, one in which Alice attends Oxford and another in which she attends Brookes. The causal question of interest is how much the Alice in our world earns compared to the Alice who resides through the looking glass. Of course, this comparison can never be more than a thought experiment. To learn about treatment effects in the real world, we develop methods and assumptions that allow us to substitute the idealized within-person comparison with a between-person comparison.

According to recent data from Department for Education, UCAS and the ONS, the median salary of Oxford graduates is nearly £15,000 higher than that of Brookes graduates.¹ Does this mean that the treatment effect of attending Oxford rather than Brookes is £15,000 a year? Almost certainly not! This is not an apples-to-apples comparison. One of the crucial differences between the two universities is entry requirements: Oxford requires A*AA for Economics and Management applicants, whereas Brookes asks for BCC for a similar degree. Oxford students on average have higher levels of academic preparation and ability upon entering university: accordingly, it’s possible that attending Oxford has no causal effect on wage, but earning high grades at A level does. In statistical parlance, we would say that ability confounds the relationship between university attended and wage.

So how can we solve the problem of confounding in observational datasets? One approach is matching, which compares treated and untreated people with the same values of any confounders. For example, we might compare Oxford Economics students with three A-stars at A-level to Brookes Economics students with the same A level results. Repeating this for every combination of subject and A-levels and averaging the results gives an estimate of the overall causal effect of attending Oxford. A recent report from the IFS used a closely related approach to estimate the relative returns to different undergraduate degrees in the UK.² Their findings suggest that confounding is a very serious problem when comparing raw wages of students across universities. For example, women who graduate from LSE earn over 70% more than the average female graduate. After adjusting for differences in student characteristics, however, this wage premium falls dramatically: female graduates of LSE earn only a little over 35% more than similar women who attended different universities. The same story applies to other elite UK institutions such as Oxford, Cambridge, and UCL.

For matching methods to be effective, we need to observe all important confounders. In some settings this is a reasonable assumption, but in others it clearly isn’t. For this reason, researchers have developed a number of techniques to address the problem of unobserved confounding. Much of my own research focuses on the use of so-called “instrumental variables.” An instrumental variable, or instrument for short, is something that affects the treatment of interest but is unrelated to any unobserved confounders. To understand this idea, we’ll examine one of the most famous papers to use the instrumental variables approach: a 1991 article by Josh Angrist and Alan Krueger studying the impact of compulsory school attendance on later-life earnings.³ The paper begins with a striking observation: in the US, people born in the first quarter of the year tend to complete fewer years of education. Why might this be the case? According to Angrist and Krueger: “children born in different months of the year start school at different ages, while compulsory schooling laws generally require students to remain in school until their sixteenth or seventeenth birthday. In effect, the interaction of school-entry requirements and compulsory schooling laws compels students born in certain months to attend school longer than students born in other months.”

Angrist and Krueger use quarter of birth as an instrumental variable to estimate the causal effect of schooling on wage. Quarter of birth is indeed related to the treatment of interest, years of schooling. But there are many unobserved factors that influence both how many years of education a person attains, and her later-life outcomes: demographics, family background etc. Is quarter of birth unrelated to these? Angrist and Krueger argue in the affirmative: “one’s birthday is unlikely to be correlated with personal attributes other than age at school entry.” If this is correct, then we can estimate the causal effect of education on wages as follows. First we calculate the difference of wages between men born in the first quarter and those born in the rest of the year. Those born in the first quarter earn less on average, so this difference is negative. Next we calculate the corresponding difference in years of education for these two groups. Those born in the first quarter have fewer years of education on average, so this difference is also negative. The ratio of the two differences tells us the fraction of the observed difference in wages that is caused by differences in education. Since both differences are negative, the ratio is positive. Angrist and Krueger find that an extra year of education causes between a 5% and 15% increase in wages.

But is it really true that a person’s birthday is uncorrelated with “personal attributes other than age at school entry?” About seven years ago, Buckles and Hungerman revisited this question, examining US data that includes information on both birth dates and family background⁴. In the years since Angrist and Krueger published their original paper, there have been more than 20 other published papers using season of birth as an instrumental variable. Across these studies, US children born in the first quarter—or more generally in the winter months—earn less, pursue less education, and have lower measured intelligence on average compared those born in other parts of the year. At the same time, researchers have found a correlation between season of birth and schizophrenia, autism, dyslexia, extreme shyness, and even suicide risk.

What’s going on here? Buckles and Hungerman propose a simple explanation: “children born in different seasons are not initially similar but rather are conceived by different groups of women.” Mothers who give birth in the winter months are disproportionately likely to be teenagers. They are also less educated, and less likely to be married. Buckles and Hungerman conclude that: “The well-known relationship between season of birth and later outcomes is largely driven by differences in fertility patterns across socioeconomic groups, and not merely natural phenomena or schooling laws that intervene after conception.” In other words, quarter of birth is indeed related to confounders that were unobserved by Angrist and Krueger in their original paper.

So where does all of this leave us? Untangling cause and effect is extremely challenging, and always relies upon assumptions. Social scientists have a powerful toolbox for studying treatment effects in settings where randomized experimentation is impossible, impractical, or unethical. But like any tools, matching, instrumental variables, and related methods depend for their success on the care with which they are used. We can indeed learn about cause-and-effect from observational data, but doing so requires knowledge of the problem we’re studying, a willingness to question our assumptions, and some good old-fashioned intellectual humility.

Thirty isn't the magic number

Sat, 08 May 2021 00:00:00 +0000

The simplest version of the central limit theorem (CLT) says that if $X_1, \dots, X_n$ are iid random variables with mean $\mu$ and finite variance $\sigma^2$

\[ \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \rightarrow_d N(0,1) \] where $\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i$. In other words, if $n$ is sufficiently large, the sample mean is approximately normally distributed with mean $\mu$ and variance $\sigma^2/n$, regardless of the distribution of $X_1, \dots, X_n$. This is a pretty impressive result! It is so impressive, in fact, that students encountering it for the first time are usually a little wary. I’m typically asked “but how large is sufficiently large?” or “how do we know when the CLT will provide a good approximation?” My answer is disappointing: without some additional information about the distribution from which $X_1, \dots, X_n$ were drawn, we simply can’t say how large a sample is large enough for the CLT work well. At this point, someone invariably volunteers “but in my high school statistics course, we learned that $n = 30$ is big enough for the CLT to hold!”

I’ve always been surprised by the prevalence of the $n \geq 30$ dictum. It even appears in Charles Wheelan’s Naked Statistics, an otherwise excellent book that I assign as summer reading for our incoming economics undergraduates: “as a rule of thumb, the sample size must be at least 30 for the central limit theorem to hold true.” In this post I’d like to set the record straight: $n\geq 30$ is neither necessary nor sufficient for the CLT to provide a good approximation, as we’ll see by examining two simple examples. Along the way, we’ll learn about two useful tools for visualizing and comparing distributions: the empirical cdf, and quantile-quantile plots.

A sample size of thirty isn’t necessary.

We’ll start by showing that the CLT can work extremely well even when $n$ is much smaller than $30$ and the random variables that we average are far from normally distributed themselves. Along the way we’ll learn about the empirical CDF and quantile-quantile plots, two extremely useful tools for comparing probability distributions.

Informally speaking, a Uniform$(0,1)$ random variable is equally likely to take on any continuous value in the range $[0,1]$.¹ Here’s a histogram of 1000 random draws from this distribution:

# set the seed to get the same draws I did
set.seed(12345)
hist(runif(1000), xlab = '', freq = FALSE,
main = 'Histogram of 1000 Uniform(0,1) Draws')

This distribution clearly isn’t normal! Indeed, its probability density function is $f(x) = 1$ for $x \in [0,1]$. This is a flat line rather than a bell curve. But if we average even a relatively small number of Uniform$(0,1)$ draws, the result will be extremely close to normality. To see that this is true, we’ll carry out a simulation in which we draw $n$ Uniform$(0,1)$ RVs, calculate their sample mean, and store the result. Repeating this a large number of times allows us to approximate the sampling distribution of $\bar{X}_n$. I’ll start by writing a function get_unif_sim that takes a single argument n. This function returns the sample mean of n Uniform$(0,1)$ draws:

get_unif_sim <- function(n) {
sims <- runif(n)
xbar <- mean(sims)
return(xbar)
}

Next I’ll use the replicate function to call get_unif_sim a large number of times, nreps, and store the results as a vector called xbar_sims. Here I’ll take $n = 10$ standard uniform draws, blatantly violating the $n \geq 30$ rule-of-thumb:

set.seed(12345)
nreps <- 1e5 # scientific notation for 100,000
xbar_sims <- replicate(nreps, get_unif_sim(10))
hist(xbar_sims, xlab = '', freq = FALSE,
main = 'Sampling Dist. of Sample Mean of 10 Uniform(0,1) Draws')

A beautiful bell curve! This certainly looks normal, but histograms can be tricky to interpret. Their shape depends on how many bins we use to make the plot, something that can be difficult to choose well in practice. In the following two sections, we’ll instead compare distribution functions and quantiles.

The Empirical CDF

If $X \sim$ Uniform$(0,1)$, then $\mathbb{E}(X) = 1/2$ and $\text{Var}(X) = 1/12$, which follows from $\mathbb{E}[X^2]=1/3$ and the definition of variance.² For $n = 10$, \[ \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{12}} \cdot \frac{1}{\sqrt{10}} = \frac{1}{\sqrt{120}} \] so if the CLT provides a good approximation in this example, we should find that \[ \frac{\bar{X}_n - 1/2}{1/\sqrt{120}} = \sqrt{120} (\bar{X}_n - 1/2) \approx N(0,1) \] in the sense that that the cumulative distribution function (CDF) of $\sqrt{120} (\bar{X}_n - 1/2)$, call it $F$, is approximately equal to the standard normal CDF pnorm(). An obvious way to see if this holds is to plot $F$ against pnorm() and see how they compare. From now on, we’ll be working with the z-scores of xbar_sims rather than the raw simulation values themselves, so we’ll start by constructing them, subtracting the population mean and dividing by the population standard deviation:

z <- (xbar_sims - 1/2) / (1 / sqrt(120))

We haven’t worked out an expression for the function $F$, but we can approximate it using our simulation draws xbar_sims. We do this by calculating the empirical CDF of our centered and standardized simulation draws z. Recall that if $Z$ is a random variable, its CDF $F$ is defined as $F(t) = \mathbb{P}(Z \leq t)$. Given a large number of observed random draws $z_1, \dots, z_J$ from the distribution of $Z$, we can approximate $\mathbb{P}(Z \leq t)$ by calculating the fraction of observed draws less than or equal to $t$. In other words \[ \mathbb{P}(Z \leq t) \approx \frac{1}{J}\sum_{j=1}^J \mathbf{1}\{z_j \leq t\} \] where $\mathbf{1}\{z_j \leq t \}$ is the indicator function: it equals one if $z_j$ is less than or equal to the threshold $t$ and zero otherwise. The sample average on the right-hand side of the preceding expression is called the empirical CDF. It uses empirical data–in this case our simulation draws $z_j$–to approximate the unknown CDF. By increasing the number of random draws $J$ that we use, we can make this approximation as accurate as we like.³ For example, we don’t know the exact value of $F(0)$, the probability that $Z \leq 0$. But using our simulated values z from above, we can approximate it as

mean(z <= 0)

## [1] 0.49993

and if we wanted to the probability that $Z \leq 2$, we could approximate this as

mean(z <= 2)

## [1] 0.97797

So far so good: these values agree with pnorm(0), which equals 0.5, and pnorm(2), which is approximately 0.9772. But we’ve only looked at two values of $t$. While we could continue trying additional values one at a time, it’s much faster to use R’s built-in function for computing an empirical cdf, ecdf(). First we pass our simulated z-scores z into ecdf() function to calculate the empirical CDF and plot the result. Next we overlay some points from the standard normal CDF, pnorm in blue for comparison:

z <- sqrt(120) * (xbar_sims - 1/2)
plot(ecdf(z), xlab = 't', ylab = 'F(t)', main = 'F(t) versus pnorm(t)')
tseq <- seq(-4, 4, by = 0.2)
points(tseq, pnorm(tseq), col = 'blue')

The fit is almost perfect, despite $n=10$ being far below 30. This kind of plot is much more informative than the histogram from above, but it can still be a bit difficult to read. When constructing confidence intervals or calculating p-values it is probabilities in the tails of the distribution that matter most, i.e. values of $t$ that are far from zero in the plot. Ideally, we’d like a plot that makes any discrepancies in the tails jump out at us. That is precisely what we’ll construct next.

Quantile-Quantile Plots

So far we’ve seen that the histogram of xbar_sims is bell-shaped, and that the empirical CDF of sqrt(120) * (xbar_sims - 0.5) is well-approximated by the standard normal CDF pnorm(). If you’re still not convinced that the CLT can work perfectly well with $n = 10$, the final plot that we’ll make should dispel any remaining doubts. As its name suggests, a quantile-quantile plot compares the quantiles of two probability distributions. But rather than comparing two quantile functions plotted against $p$, it compares the quantiles of two distributions plotted against each other. This is a bit confusing the first time you encounter it, so we’ll take things step-by-step.

If our simulated z-scores from above are well-approximated by a standard normal distribution, then their median should be close to that of a standard normal random variable, i.e. zero. This is indeed the case:

median(z)

## [1] 0.0001169817

But it’s not just the medians that should be close to each other: all the quantiles should be. So now let’s look at the 25th-percentile and 75th-percentile as well. Rather than computing them one-by-one, we can generate them in a single batch by first setting up a vector p of probabilities and using rbind to print the results in a convenient format

p <- c(0.25, 0.5, 0.75)
rbind(normal = qnorm(p), simulation = quantile(z, probs = p))

## 25% 50% 75%
## normal -0.6744898 0.0000000000 0.6744898
## simulation -0.6779843 0.0001169817 0.6791994

This looks good as well. If we want to compare quantiles over a finer grid of values for p, it’s more convenient to make a plot rather than a table. Suppose that we treat the values qnorm as an $x$-coordinate and the quantiles of z as a $y$-coordinate. If the CLT is giving us a good approximation, then we should have $x \approx y$ and all of the points should fall near the 45-degree line. This is indeed what we observe:

p <- seq(from = 0.05, to = 0.95, by = 0.05)
x <- qnorm(p)
y <- quantile(z, probs = p)
plot(x, y, xlab = 'std. normal quantiles', ylab = 'quantiles of z')
abline(0, 1) # plot the 45-degree line

The plot that we have just made is called a normal quantile-quantile plot. It is constructed as follows:

Set up a vector p of probabilities.
Calculate the corresponding quantiles of a standard normal RV, qnorm(p). Call these $x$.
Calculate the corresponding quantiles of your data, quantile(your_data_here, probs = p). Call them $y$.
Plot $y$ against $x$.

If the points all fall on a line, then the quantiles of the observed data agree with those of some normal distribution, although perhaps not a standard normal. If we standardize the data before making such a plot, as we did to construct z above, the relevant line with be the 45-degree line. If not, it will be a different line but the interpretation remains the same The easiest way to make a normal quantile-quantile plot in R is by using the function qqnorm followed by qqline. We could do this either using the centered and standardized simulation draws z or the original draws xbar_sims

par(mfrow = c(1, 2))
qqnorm(z, ylab = 'Quantiles of z')
qqline(z)
qqnorm(xbar_sims, ylab = 'Quantiles of xbar_sims')
qqline(xbar_sims)

par(mfrow = c(1, 1))

The only difference between these two plots is the scale of the $y$-axis. The plot that uses the original simulation draws xbar_sims has a $y$-axis that runs between $0.1$ and $0.9$ because the sample average of $\text{Uniform}(0,1)$ random variables must lie within the interval $[0,1]$. In contrast, the corresponding $z$-scores lie in the range $[-4,4]$.⁴ For $x$-values between $-3$ and $3$, we can’t even see the line generated by qqline: the quantiles of our simulation draws are extremely close to those of a normal distribution. Outside of this range, however, we see that the black circles curve away from the line. For values of $x$ around $-4$, the quantiles of z are above those of a standard normal, i.e. shifted to the right. For values of $x$ around $4$, the picture is reversed: the quantiles of z are below those of a standard normal, i.e. shifted to the left. This means that z has lighter tails than a standard normal: it is a bit less likely to yield extremely large positive or negative values, for example

cbind(simulation = quantile(z, 0.0001), normal = qnorm(0.0001))

## simulation normal
## 0.01% -3.563576 -3.719016

This makes perfect sense. A standard normal can take on arbitrarily large values, while the sample mean of ten uniforms is necessarily bounded above by $1$. So if you want to carry out a $0.01\%$ test ($\alpha = 0.0001$), the approximation provided by the CLT won’t quite cut it with $n = 10$ in this example. But for any conventional significance level, it’s nearly perfect:

p <- c(0.01, 0.025, 0.05, 0.1)
rbind(normal = qnorm(p), simulation = quantile(z, prob = p))

## 1% 2.5% 5% 10%
## normal -2.326348 -1.959964 -1.644854 -1.281552
## simulation -2.302793 -1.958568 -1.654508 -1.290797

A sample size of thirty isn’t sufficient.

Now suppose that $n = 100$ and $X_1, \dots X_n \sim$ iid Bernoulli$(1/60)$. What is the CDF of $\bar{X}_n$? Rather than approximating the answer to this question by simulation, as we did in the uniform example from above, we’ll work out the exact result and compare it to the approximation provided by the CLT. If $X_1, \dots, X_n \sim$ iid Bernoulli$(p)$, then by definition the sum $S_n = \sum_{i=1}^n X_i$ follows a Binomial$(n,p)$ distribution. The probability mass function and CDF of this distribution are available in R via the dbinom() and pbinom() commands. So what about $\bar{X}_n$? Notice that \[ \mathbb{P}(\bar{X}_n = x) = \mathbb{P}(S_n/n = x) = \mathbb{P}(S_n = nx) \] Thus, if $f(s) = \mathbb{P}(S_n = s)$ is the pmf of $S_n$ for $s \in \{0, 1, \dots n\}$, it follows that $f(nx)$ is the pmf of $\bar{X}_n$ for $x \in \{0, 1/n, 2/n, \dots, 1\}$. This means that we can use dbinom to plot the exact sampling distribution of $\bar{X}_n$ when $n = 100$ and $p = 1/60$ as follows⁵

n <- 100
p <- 1/60
x <- seq(from = 0, to = 1, by = 1/n)
P_x_bar <- dbinom(n * x, size = n, prob = p)
plot(x, P_x_bar, type = 'h', xlim = c(0, 0.1), ylab = 'pmf of Xbar',
lwd = 2, col = 'blue')

The result is far from a normal distribution. Not only is it noticeably discrete, it is also seriously asymmetric. Another way to see this is by examining the CDF. If the central limit theorem is working well in this example, we should have $\bar{X}_n \approx N\big(p, p(1 - p)/n\big)$. Extending the idea from above, we can plot the exact CDF of $\bar{X}_n$ using the binomial CDF pbinom() and compare it to the approximation suggested by the CLT:

x <- seq(-0.02, 0.08, by = 0.001)
F_x_bar <- pbinom(n * x, size = n, prob = p)
F_clt <- pnorm(x, p, sqrt(p * (1 - p) / n))
plot(x, F_x_bar, type = 's', ylab = '', lwd = 2, col = 'blue')
points(x, F_clt, type = 'l', lty = 2, lwd = 2, col = 'red')
legend('topleft', legend = c('Exact', 'CLT'), col = c('blue', 'red'),
lty = 1:2, lwd = 2)

The approximation is noticeably poor, but is the problem serious enough to affect any inferences we might hope to draw? Suppose we wanted to construct a 95% confidence interval for $p$. The textbook approach, based on the CLT, would have us report $\widehat{p} \pm 1.96 \times \sqrt{\widehat{p}(1 - \widehat{p})/n}$ where $\widehat{p}$ is the sample proportion, i.e. $\bar{X}_n$. Let’s set up a little simulation experiment to see how well this interval performs when $n = 100$ and $p = 1/60$.

# Simulate 5000 draws for phat
# with p = 1/60, n = 100
set.seed(54321)
draw_sim_phat <- function(p, n) {
x <- rbinom(n, size = 1, prob = p)
phat <- mean(x)
return(phat)
}
p_true <- 1/60
sample_size <- 100
phat_sims <- replicate(5000, draw_sim_phat(p = p_true, n = sample_size))
# What fraction of the CIs cover the true value of p?
SE <- sqrt(phat_sims * (1 - phat_sims) / sample_size)
lower <- phat_sims - 1.96 * SE
upper <- phat_sims + 1.96 * SE
coverage_prob <- mean((lower <= p_true) & (p_true <= upper))
coverage_prob

## [1] 0.824

So only 82% of these supposed 95% confidence intervals actually cover the true value of $p$! Clearly 100 observations aren’t enough to rely upon the CLT in this example.

Epilogue

I hope these examples have convinced you that, in spite of what you may have heard elsewhere, $n\geq 30$ is neither necessary nor sufficient for the CLT to provide an adequate approximation. But some important questions remain. First, what can we do in situations like the second example, where we want to carry out inference for a small proportion? The problem is hardly academic: at the time of this writing, the most recent estimate of coronavirus prevalence in the UK was approximately 0.2%, i.e. nearly ten times smaller than the value I used for $p$ in my second example.⁶ Second, how did the $n\geq 30$ folk wisdom arise? Is there anything that we can say about $n\geq 30$? Finally, are there any theoretical results that can provide guidance about the quality of the approximation provided by the CLT? These questions will have to wait for a future post!

More formally, $U\sim$ Uniform$(0,1)$ if and only if $\mathbb{P}(a \leq U \leq b) = (b - a)$ for any $0 \leq a \leq b \leq 1$. ↩︎
If you’re taking introductory probability and statistics, filling in the missing details for these calculations would be an excellent homework problem!↩︎
Here we take $J = 100,000$ which is more than enough for the purposes of this exercise.↩︎
Recall that we subtracted $1/2$ and multipled by $\sqrt{120}\approx 11$ to construct z from xbar_sims.↩︎
Notice that I “zoomed in” on the most interesting part of the plot by setting xlim = c(0, 0.1).↩︎
Source: REACT-1 study of coronavirus tranmission: March 2021 final results ↩︎

Econometricians Anonymous

Thu, 01 Apr 2021 00:00:00 +0000

Some years back, I wrote a monologue for the Economics Department Skit Night at UPenn. It was a niche venue, granted, but the crowd seemed to enjoy my contribution. In honor of April Fool’s day I’ve posted a lightly-edited version below. On the off chance that anyone else finds this amusing, I grant unlimited rights for this material to be borrowed, adapted, remixed, stolen, execrated, or burned in effigy as you see fit. It’s funnier when you use the names of your own colleagues so I encourage you to fill in the blanks below!

Group leader walks in and writes “Econometricians Anonymous” in big letters on the blackboard.

Group Leader: Thanks for coming everyone. Tonight we’re going to hear from [YOUR NAME].

Econometrician: I’m [YOUR NAME] and I’m an econometrician. It’s been six months, twelve days, and five hours since my last derivation.

So how did it all start? Like a lot of people, I started out deriving socially: at parties, out clubbing with the econometrics group. [SENIOR ECONOMETRICS COLLEAGUE] would have the bartender set us up with a dozen lemmas and we’d each prove four on the spot: one right after the other. Sure it was a little wild, but I always told myself I was in control. I realize now that I wasn’t.

The more I derived, the more I needed to derive. Sometimes I couldn’t find any co-authors to derive with me, and eventually I started deriving alone. I still remember waking up on the floor of my office after one of my all-night limit theory benders: pads of paper covered with equations strewn about the floor. I was a mess.

Pretty soon they started recognizing me in stationary shops and office supply stores. Sometimes they wouldn’t sell me paper and pencils. One night in my desperation, I broke into the Econ office to steal some notepads. [DEPARTMENT ADMINISTRATOR] was waiting for me: “I think you’ve had a enough of those, [YOUR NAME].” It was the most embarrassing moment of my adult life.

I didn’t realize it at the time, but I had completely cut myself off from my family, friends, and colleagues. I kept finding ways to justify my behavior: “Don’t listen to them: Econometrica’s a great journal. Who cares if no one uses your results?” It all sounds so hollow now, but I really believed it at the time.

Pretty soon I started making outrageous assumptions in my papers: “Suppose that X has finite 128th moments; these regularity conditions are basically standard.” I was out of my mind. Eventually, I started experimenting with simulations. After a while, real data just didn’t do it for me. How could it when I could make thousands of pristine pseudo-random draws dance across my laptop screen at the touch of a button, any time of day or night.

I don’t know what would have happened to me if my friends hadn’t staged an intervention. When I got home there were two stacks on the coffee table: one of my recent working papers and the other, ten times as thick, of the corresponding technical appendices. I knew I had hit rock bottom.

But I’ve made a lot of progress since then, thanks in large part to the love and support of my fellow recovering econometricians here at Econometricians Anonymous. Still, it’s a daily struggle to stay clean. I remember calling my sponsor [APPLIED COLLEAGUE] back in December. It was the middle of the night, I was in the computer lab and I had just double-clicked on the Matlab icon. [APPLIED COLLEAGUE] was there in 15 minutes. He logged me off the computer, took me to a diner and ordered us coffee. We stayed up most of the night talking and running cross-country growth regressions in STATA. It really helped.

I’m doing a lot better now. I’m doing applied work, I’m publishing in general interest journals, and people are citing my research. And I’m here to tell you that if I can do it so can you. We’re all here, all of us at Econometricians Anonymous, to help each other kick the habit. Thank you.

Group Leader: Thanks [YOUR NAME]. That’s all for tonight, but we hope you’ll join us tomorrow for Game Theorists Anonymous where we’ll be hearing about [SENIOR THEORY COLLEAGUE]’s exciting new research agenda in [RESEARCH AREA THAT SENIOR THEORY COLLEAGUE DEPLORES].

Past the Peak? Excess Deaths in England and Wales

Wed, 13 May 2020 00:00:00 +0000

Since my previous post, the Office for National Statistics has posted updated data on weekly deaths, and we’ve updated rcovidUK accordingly. Here’s where things stand:

Figure 1: Total Weekly Deaths in England & Wales.

With the caveat that the date at which a death is reported need not agree with the date at which it actually occurred, excess deaths appear to have peaked and begun to decline in each of the ten regions. (Recently the ONS has started posting data on occurrences rather than reports, but these are unavailable for the historical comparison we’re making here.) As in my earlier post, the red curve is an equally-weighted average of reported weekly deaths in a given region for the past five years, up to and including 2019. Weeks are defined by the ONS to end on Fridays. This means that a given week does not necessarily correspond to the same days of the year across years, but always exactly one of each day of the week: Monday, Tuesday, etc. This eliminates seasonality from day-of-the-week effects, but creates the possibility of neglected seasonality from holidays that move across weeks during different years, e.g. Easter. For each point on the red curve, we have added simple error bars: $\pm 2 \times \text{SE}$ where SE is the usual standard error for a mean, ignoring any possible dependence between deaths in the same week across different years. The blue curve depicts weekly deaths in 2020.

We had hoped to include data from Scotland in this update, but it turns out that the NRS, Scotland’s equivalent of the ONS uses a different definition of weeks of the year, so it wouldn’t be easy to line up the respective series for purposes of comparison. We may revisit this later on.

If you’d like to replicate our plots, the R code is as follows:

library(tidyverse)
library(rcovidUK)
#Set the most recent week of data for 2020
week_2020 <- ONSweekly %>%
filter(year == 2020, reg_nm == "North East", !is.na(deaths)) %>%
pull(week) %>%
max
df_2020 <- ONSweekly %>%
filter(year == 2020)
df_prev5 <- ONSweekly %>%
filter(year < 2020 & year >= 2015) # Use last 5 years for average
df_prev5 %>%
group_by(reg_nm, week) %>%
summarise(deaths_mean = mean(deaths),
deaths_sd = sd(deaths),
n = n(),
se = deaths_sd/sqrt(n)) %>%
mutate(year = "2015-2019") %>%
rename(deaths = deaths_mean) %>%
bind_rows(df_2020) %>%
filter(week<=week_2020) %>%
ggplot(aes(x=week, y = deaths, col = year, group = reg_id)) +
geom_errorbar(aes(ymin=deaths - 2 * se, ymax=deaths + 2 * se), width=.2, colour="black")+
geom_line(size = 1) +
facet_wrap(~reg_nm, ncol = 2) +
scale_color_brewer(palette = "Set1",
labels = c("Av. Past 5 Years", "2020")) +
labs(x= "Week from Start of Year",
y = "Deaths per Week") +
theme_bw()+
theme(legend.position = 'top',
legend.title = element_blank(),
legend.key.size = unit(1, "cm"))

UK Excess Deaths by Age Group and Sex

Wed, 13 May 2020 00:00:00 +0000

In an earlier post I looked at regional differences in the effects of Covid-19 by calculating excess deaths in each week of 2020 relative to an average of the preceding five years. There the idea was to get a sense for the number of deaths that were likely caused by Covid-19 regardless of whether they were officially registered as such, e.g. deaths that occurred outside of hospitals. Thanks to my RA Dan Mead, here’s the same analysis carried out by age group and sex rather than region:

Figure 1: Total Weekly Deaths in England & Wales by Sex and Age Group.

The red curve is an equally-weighted average of reported weekly deaths for the past five years up to and including 2019, while the blue curve depicts weekly deaths in 2020. All calculations are based on data from the ONS. Each point on the red curve is accompanied by error bars indicating plus/minus two standard errors of the mean. Roughly speaking these bounds are equivalent to the margin of error from a typical opinion poll. (For the precise calculations, see the R code below.)

Deaths below age 1 show no clear pattern in 2020 relative to the average of past years. There’s a hint of slightly lower deaths among males aged 1-14 since the lockdown began, although the data are quite noisy given the extremely small number of deaths in this age group. For the remaining age groups, we start to see a clear effect around the time of the covid outbreak, approximately weeks 11 through 22 of 2020. Because death rates vary with age and sex in both ordinary years and 2020, the cleanest way to compare across groups is by computing relative excess deaths. To do this, we add up the differences between the red curve and the blue curve between weeks 11 and 22 and divide the result by sum of the red curve during the same period of time. The result give us the percentage by which total deaths in a given age/sex group in weeks 11 through 22 of 2020 exceed the “usual” number of deaths for this age/sex group during the same weeks based on past data.

In the 15-44 age group, male deaths during weeks 11 through 22 of 2020 were 6.4% higher than usual. In contrast, female deaths were 9.7% higher than usual. Deaths in this age group, however, are relatively rare. Accordingly, the blue curves lie comfortably within the error bars for all but a small number of weeks. For the remaining age groups, the picture becomes much starker and the relative excess deaths much larger:

Age Group	Men	Women
45-64	+42.2%	+32.4%
65-74	+42.9%	+31.6%
75-84	+58.2%	+45.5%
Over 85	+65.3%	+48.5%

In each age group, relative excess deaths are substantially higher for men than women. For example, male deaths in the 45-64 age group were 42.2% higher than usual in weeks 11 through 22 of 2020 compared to 32.4% higher for women. Although relative excess deaths increase markedly with age for both sexes, even the rates in the 45-64 group are alarming, particularly for men.

You can replicate the plot from above by running this R code:

library(tidyverse)
library(rcovidUK)
#Set the most recent week of data for 2020
week_2020 <- ONSweeklyagegender %>%
filter(year == 2020, age == "<1", !is.na(deaths)) %>%
pull(week) %>%
max
df_2020 <- ONSweeklyagegender %>%
filter(year == 2020)
df_prev5 <- ONSweeklyagegender %>%
filter(year < 2020 & year >= 2015) # Use last 5 years for average
df_prev5 %>%
group_by(age, week, gender) %>%
summarise(deaths_mean = mean(deaths),
deaths_sd = sd(deaths),
n = n(),
se = deaths_sd/sqrt(n)) %>%
ungroup() %>%
mutate(year = "2015-2019") %>%
rename(deaths = deaths_mean) %>%
bind_rows(df_2020) %>%
filter(week <= week_2020) %>%
ggplot(aes(x=week, y = deaths, col =year)) +
geom_line(size = 1) +
geom_errorbar(aes(ymin=deaths - 2 * se, ymax=deaths + 2 * se), width=.2, colour="black")+
facet_grid(vars(age), vars(gender), scale="free") +
scale_color_brewer(palette = "Set1",
labels = c("Av. Past 5 Years", "2020")) +
labs(x= "Week from Start of Year",
y = "Deaths per Week") +
theme_bw()+
theme(legend.position = 'top',
legend.title = element_blank(),
legend.key.size = unit(1, "cm"))

Excess Deaths in England and Wales

Mon, 04 May 2020 00:00:00 +0000

To get a better idea of the impact of Covid19 in the UK, Dan Mead and put together an R package rcovidUK with weekly deaths in England and Wales, taken from the Office for National Statistics (ONS), allowing us to produce a plot of total deaths by region in 2020:

Figure 1: Total Weekly Deaths in England & Wales.

The red curve is an equally-weighted average of weekly deaths in a given region for the past five years, up to and including 2019. Weeks are defined by the ONS to end on Fridays. This means that a given week does not necessarily correspond to the same days of the year across years, but always exactly one of each day of the week: Monday, Tuesday, etc. This eliminates seasonality from day-of-the-week effects, but creates the possibility of neglected seasonality from holidays that move across weeks during different years, e.g. Easter. For each point on the red curve, we have added simple error bars: $\pm 2 \times \text{SE}$ where SE is the usual standard error for a mean, ignoring any possible dependence between deaths in the same week across different years. The blue curve depicts weekly deaths in 2020. From week 12 or 13, depending on region, we see a dramatic uptick in deaths across England and Wales, well outside the two standard error bars. R source code for this plot follows below.

library(tidyverse)
library(rcovidUK)
#Set the most recent week of data for 2020
week_2020 <- 16 # original version of this post only had data up to week 16
#week_2020 <- ONSweekly %>%
# filter(year == 2020, !is.na(deaths)) %>%
# pull(week) %>%
# max
df_2020 <- ONSweekly %>%
filter(year == 2020)
df_prev5 <- ONSweekly %>%
filter(year < 2020 & year >= 2015)
df_prev5 %>%
group_by(reg_nm, week) %>%
summarise(deaths_mean = mean(deaths),
deaths_sd = sd(deaths),
n = n(),
se = deaths_sd/sqrt(n)) %>%
mutate(year = "2015-2019") %>%
rename(deaths = deaths_mean) %>%
bind_rows(df_2020) %>%
filter(week<week_2020) %>%
ggplot(aes(x=week, y = deaths, col = year,
group = reg_id)) +
geom_errorbar(aes(ymin=deaths - 2 * se,
ymax = deaths + 2 * se),
width=.2, colour="black") +
geom_line(size = 1) +
facet_wrap(~reg_nm, ncol = 2) +
scale_color_brewer(palette = "Set1",
labels = c("Avg Past 5 Yrs", "2020")) +
labs(x= "Week from Start of Year",
y = "Deaths per Week") +
theme_bw()+
theme(legend.position = 'top',
legend.title = element_blank(),
legend.key.size = unit(1, "cm"))

Mon, 01 Jan 0001 00:00:00 +0000

	\(Y=0\)	\(Y=1\)
\(X = -1\)	\(1/3\)	\(0\)
\(X = 0\)	\(0\)	\(1/3\)
\(X= 1\)	\(1/3\)	\(0\)

econometrics.blog

Not Quite the James-Stein Estimator

Warm-up Exercise

Review of Bias, Variance and MSE

A Shrinkage Estimator

Some Algebra

Stein’s Paradox

Recap

Admissibility

A More General Example

Composite MSE

Stein’s Paradox

Where does the James-Stein Estimator Come From?

An Infeasible Estimator When \(p = 2\)

A Simulation Experiment for \(p = 2\)

An Infeasible Estimator: The General Case

Not Quite the James-Stein Estimator

Conclusion

How to Do Regression Adjustment

A Quick Review

Which regression should we run?

Two Simple Alternatives

What if we ignore the interaction?

What about the general case?

Does it really work? An Empirical Example

Computing the ATE the Hard Way

Computing the ATE the Easy Way

Standard Errors

Excluding the Interaction

Conclusion

Appendix: The Missing Algebra

Is it better to improve sensitivity or specificity?

An Open Invitation

Odds aren’t so odd!

The Solution

Epilogue

How to Read an Econometrics Paper

Read Something Else Instead

Don’t Assume You Have to Understand the Whole Thing

Don’t Assume You’re Stupid

Spread Yourself Thin

Explain It to Someone Else

Head Straight for the Simulation / Empirical Example

Make Things Simpler

Don’t Get Hung Up on Technicalities

Be Appropriately Skeptical of Asymptotics

Sims and Uhlig (1991) Replication

A Simple Example

A Not-so-simple Example

The Replication

The Return of econometrics.blog!

A Good Instrument is a Bad Control

The Model

A Simulation Example

The General Result

Regression of \(Y\) on \(X\) and \(Z\)

Regression of \(Y\) on \(X\) Only

Comparing the Results

Some Intuition

The R Formula Cheatsheet

Random Variables Cheatsheet

Why Econometrics is Confusing Part II: The Independence Zoo

Prerequisites

Two Examples

Example #1 - Discrete RVs \((X,Y)\)

Example #2 - Continuous RVs \((W,Z)\)

Uncorrelatedness

Example #1: \(X\) and \(Y\) are uncorrelated.

Example #2: \(W\) and \(Z\) are uncorrelated.

Conditional Mean Independence

Example #1: \(X\) is mean independent of \(Y\).

Example #1: \(Y\) is NOT mean independent of \(X\).

Example #2: \(Z\) is NOT mean independent of \(W\).

Example #2: \(W\) is mean independent of \(Z\).

Statistical Independence

Example #1: \(X\) and \(Y\) are NOT independent.

Example #2: \(W\) and \(Z\) are NOT independent.

Relating the Three Properties

Uncorrelatedness and Statistical Independence are Symmetric

Statistical Independence Implies Conditional Mean Independence