# (Mis)understanding Selection on Observables

On a recent exam I asked students to extend the logic of propensity score weighting to handle a treatment that takes on three rather than two values: basically a stripped-down version of Imbens (2000). Nearly everyone figured this out without much trouble, which is good news! At the same time, I noticed some common misconceptions about the all-important selection-on-observables assumption: $\mathbb{E}[Y_0|D,X] = \mathbb{E}[Y_0|X] \quad \text{and} \quad \mathbb{E}[Y_1|D,X] = \mathbb{E}[Y_1|X]$ where $$(Y_0, Y_1)$$ are the potential outcomes corresponding to a binary treatment $$D$$ and $$X$$ is a vector of observed covariates.1 Since more than a handful of students made the same mistakes, it seemed like a good opportunity for a short post.

# Two Misconceptions

The following two statements about selection on observables are false:

1. Under selection on observables, if I know the value of someone’s covariate vector $$X$$, then learning her treatment status $$D$$ provides no additional information about the average value of her observed outcome $$Y$$.
2. Selection on observables requires the treatment $$D$$ and potential outcomes $$(Y_0,Y_1)$$ to be conditionally independent given covariates $$X$$.

If you’ve studied treatment effects, pause for a moment and see if you can figure out what’s wrong with each of them before reading further.

# The First Misconception

The first statement:

Under selection on observables, if I know the value of someone’s covariate vector $$X$$, then learning her treatment status $$D$$ provides no additional information about the average value of her observed outcome $$Y$$.

is a verbal description of the following conditional mean independence condition: $\mathbb{E}[Y|X,D] = \mathbb{E}[Y|X].$ So what’s wrong with this equality? The potential outcomes $$(Y_0, Y_1)$$ and the observed outcome $$Y$$ are related according to $Y = Y_0 + D (Y_1 - Y_0).$ Taking conditional expectations of both sides and using the selection on observables assumption \begin{aligned} \mathbb{E}[Y|X,D] &= \mathbb{E}[Y_0|X,D] + D \mathbb{E}[Y_1 - Y_0|D,X]\\ &= \mathbb{E}[Y_0|X] + D \mathbb{E}[Y_1 - Y_0|X]. \end{aligned} In contrast, conditioning on $$X$$ alone gives \begin{aligned} \mathbb{E}[Y|X] &= \mathbb{E}[Y_0|X] + \mathbb{E}[D(Y_1 - Y_0)|X]\\ &= \mathbb{E}[Y_0|X] + \mathbb{E}_{D|X}[D\mathbb{E}(Y_1 - Y_0|D,X)]\\ &= \mathbb{E}[Y_0|X] + \mathbb{E}_{D|X}[D\mathbb{E}(Y_1 - Y_0|X)]\\ &= \mathbb{E}[Y_0|X] + \mathbb{E}(D|X) \cdot \mathbb{E}(Y_1 - Y_0|X) \end{aligned} by iterated expectations and the selection on observables assumption, since $$\mathbb{E}(Y_1 - Y_0|X)$$ is a measurable function of $$X$$. Subtracting these expressions, we find that $\mathbb{E}(Y|X,D) - \mathbb{E}(Y|X) = \left[ D - \mathbb{E}(D|X) \right] \cdot \mathbb{E}(Y_1 - Y_0|X)$ so that $$\mathbb{E}(Y|X,D) = \mathbb{E}(Y|X)$$ if and only if the RHS equals zero.

So how could the RHS equal zero? One way is if $$D = \mathbb{E}(D|X)$$. Since $$D$$ is a binary random variable, this would require $$\mathbb{E}(D|X)$$ to be a binary random variable as well. But notice that $$\mathbb{E}(D|X) = \mathbb{P}(D=1|X)$$ is simply the propensity score $$p(X)$$. Because $$X$$ is a random variable, so is $$p(X)$$. But $$p(X)$$ cannot take on the values zero or one. If it did, this would violate the overlap assumption: $$0 < p(X) < 1$$.

So we can’t have $$D = \mathbb{E}(D|X)$$, but what about $$\mathbb{E}(Y_1 - Y_0|X)=0$$? Since $$(Y_1 - Y_0)$$ is the treatment effect of $$D$$, it follows that $$\mathbb{E}(Y_1 - Y_0|X)$$ is the conditional average treatment effect $$\text{ATE}(X)$$ given $$X$$. It’s not a contradiction for $$\text{ATE}(X)$$ to equal zero, but think about what it would mean: it would require that the average treatment effect for a person with covariates $$(X = x)$$ is exactly zero regardless of $$x$$. Moreover, by iterated expectations it would imply that $\text{ATE} = \mathbb{E}(Y_1 - Y_0) = \mathbb{E}_X[\mathbb{E}(Y_1 - Y_0| X)] = \mathbb{E}[\text{ATE}(X)] = 0$ so the average treatment effect would also be zero. Again, this is not a contradiction but it would definitely be odd to assume that the treatment effect is zero before you even try to estimate it!

To summarize: the first statement above cannot be an implication of selection on observables because it would either require a violation of the overlap assumption, or imply that there is no treatment effect whatsoever. To correct the statement, we simply need to change the last three words:

Under selection on observables, if I know the value of someone’s covariate vector $$X$$, then learning her treatment status $$D$$ provides no additional information about the average values of her potential outcomes $$(Y_0, Y_1)$$.

This is a correct verbal statement of the mean exclusion restriction $$\mathbb{E}(Y_0|D,X) = \mathbb{E}(Y_0|X)$$ and $$\mathbb{E}(Y_1|D,X) = \mathbb{E}(Y_1|X)$$.

# The Second Misconception

And this leads nicely to the second misconception:

Selection on observables requires the treatment $$D$$ and potential outcomes $$(Y_0,Y_1)$$ to be conditionally independent given covariates $$X$$.

To see why this is false, consider an example in which \begin{aligned} Y &= (1 - D) \cdot (\alpha_0 + X'\beta_0 + U_0) + D \cdot (\alpha_1 + X' \beta_1 + U_1)\\ U_0|(D,X) &\sim \text{Normal}(0,1 - D/2)\\ U_1|(D,X) &\sim \text{Normal}(0,1 + D). \end{aligned} Notice that the distributions of $$U_0$$ and $$U_1$$ given $$(D,X)$$ depend on $$D$$. Now, by iterated expectations, \begin{aligned} \mathbb{E}(U_0|X) &= \mathbb{E}_{(D|X)}[\mathbb{E}(U_0|D,X)] = 0\\ \mathbb{E}(U_0) &= \mathbb{E}_{X}[\mathbb{E}(U_0|X)] = 0 \end{aligned} and similarly $$\mathbb{E}(U_1|X) = \mathbb{E}(U_1)=0$$. Substituting $$D=0$$ and $$D=1$$, we can calculate the potential outcomes and average treatment effect as follows \begin{aligned} Y_0 &= \alpha_0 + X'\beta_0 + U_0 \\ Y_1 &= \alpha_1 + X'\beta_1 + U_1 \\ \text{ATE} &= \mathbb{E}(Y_1 - Y_0) = (\alpha_1 - \alpha_0) + \mathbb{E}[X'](\beta_1 - \beta_0). \end{aligned} It follows that $$D$$ is not conditionally independent of $$(Y_0, Y_1)$$ given $$X$$. In particular, the variance of the potential outcomes depends on $$D$$ even after conditioning on $$X$$: \begin{aligned} \text{Var}(Y_0|X,D) &= \text{Var}(U_0|X,D) = 1 - D/2\\ \text{Var}(Y_1|X,D) &= \text{Var}(U_1|X,D) = 1 + D. \end{aligned} In spite of this, the selection on observables assumption still holds: \begin{aligned} \mathbb{E}(Y_0|D,X) &= \alpha_0 + X'\beta_0 + \mathbb{E}(U_0|D,X) = \alpha_0 + X'\beta_0\\ \mathbb{E}(Y_0|X) &= \alpha_0 + X'\beta_0 + \mathbb{E}(U_0|X) = \alpha_0 + X'\beta_0\\ \end{aligned} and similarly $$\mathbb{E}(Y_1|D,X) = \mathbb{E}(Y_1|X) = \alpha_1 + X'\beta_0$$. While this example is admittedly a bit peculiar, the point is more general: because the average treatment effect is an expectation, identifying it only requires assumptions about conditional means.2 The second statement is even easier to correct than the first: we need only add a single word:

Selection on observables requires the treatment $$D$$ and potential outcomes $$(Y_0,Y_1)$$ to be conditionally mean independent given covariates $$X$$.

Conditional independence implies conditional mean independence, but the converse is false.

# Epilogue

So what’s the moral here? First, it’s crucial to distinguish between the observed outcome $$Y$$ and the potential outcomes $$(Y_0, Y_1)$$. Second, the various notions of “unrelatedness” between random variables—independence, conditional mean independence, and uncorrelatedness—can be confusing. Be sure to pay attention to exactly which condition is used and why. In a future post, I’ll have more to say about the relationships between these notions.

1. For more details see my lecture notes on treatment effects↩︎

2. You might object that in the real world it is difficult to think of settings in which conditional mean independence is plausible but full independence does not. This is a fair point. Nevertheless, it’s important to be clear about which assumptions are actually used in a given derivation, and here we only rely on conditional mean independence.↩︎