# (Mis)understanding Selection on Observables

On a recent exam I asked students to extend the logic of propensity score weighting to handle a treatment that takes on *three* rather than two values: basically a stripped-down version of Imbens (2000). Nearly everyone figured this out without much trouble, which is good news! At the same time, I noticed some common misconceptions about the all-important *selection-on-observables* assumption:
\[
\mathbb{E}[Y_0|D,X] = \mathbb{E}[Y_0|X] \quad \text{and} \quad
\mathbb{E}[Y_1|D,X] = \mathbb{E}[Y_1|X]
\]
where \((Y_0, Y_1)\) are the potential outcomes corresponding to a binary treatment \(D\) and \(X\) is a vector of observed covariates.^{1} Since more than a handful of students made the same mistakes, it seemed like a good opportunity for a short post.

# Two Misconceptions

The following two statements about selection on observables are *false*:

- Under selection on observables, if I know the value of someone’s covariate vector \(X\), then learning her treatment status \(D\) provides no additional information about the average value of her observed outcome \(Y\).
- Selection on observables requires the treatment \(D\) and potential outcomes \((Y_0,Y_1)\) to be conditionally independent given covariates \(X\).

If you’ve studied treatment effects, pause for a moment and see if you can figure out what’s wrong with each of them before reading further.

# The First Misconception

The first statement:

Under selection on observables, if I know the value of someone’s covariate vector \(X\), then learning her treatment status \(D\) provides no additional information about the average value of her observed outcome \(Y\).

is a verbal description of the following conditional mean independence condition: \[ \mathbb{E}[Y|X,D] = \mathbb{E}[Y|X]. \] So what’s wrong with this equality? The potential outcomes \((Y_0, Y_1)\) and the observed outcome \(Y\) are related according to \[ Y = Y_0 + D (Y_1 - Y_0). \] Taking conditional expectations of both sides and using the selection on observables assumption \[ \begin{aligned} \mathbb{E}[Y|X,D] &= \mathbb{E}[Y_0|X,D] + D \mathbb{E}[Y_1 - Y_0|D,X]\\ &= \mathbb{E}[Y_0|X] + D \mathbb{E}[Y_1 - Y_0|X]. \end{aligned} \] In contrast, conditioning on \(X\) alone gives \[ \begin{aligned} \mathbb{E}[Y|X] &= \mathbb{E}[Y_0|X] + \mathbb{E}[D(Y_1 - Y_0)|X]\\ &= \mathbb{E}[Y_0|X] + \mathbb{E}_{D|X}[D\mathbb{E}(Y_1 - Y_0|D,X)]\\ &= \mathbb{E}[Y_0|X] + \mathbb{E}_{D|X}[D\mathbb{E}(Y_1 - Y_0|X)]\\ &= \mathbb{E}[Y_0|X] + \mathbb{E}(D|X) \cdot \mathbb{E}(Y_1 - Y_0|X) \end{aligned} \] by iterated expectations and the selection on observables assumption, since \(\mathbb{E}(Y_1 - Y_0|X)\) is a measurable function of \(X\). Subtracting these expressions, we find that \[ \mathbb{E}(Y|X,D) - \mathbb{E}(Y|X) = \left[ D - \mathbb{E}(D|X) \right] \cdot \mathbb{E}(Y_1 - Y_0|X) \] so that \(\mathbb{E}(Y|X,D) = \mathbb{E}(Y|X)\) if and only if the RHS equals zero.

So how could the RHS equal zero? One way is if \(D = \mathbb{E}(D|X)\). Since \(D\) is a binary random variable, this would require \(\mathbb{E}(D|X)\) to be a binary random variable as well. But notice that \(\mathbb{E}(D|X) = \mathbb{P}(D=1|X)\) is simply the propensity score \(p(X)\). Because \(X\) is a random variable, so is \(p(X)\). But \(p(X)\) *cannot* take on the values zero or one. If it did, this would violate the *overlap assumption*: \(0 < p(X) < 1\).

So we can’t have \(D = \mathbb{E}(D|X)\), but what about \(\mathbb{E}(Y_1 - Y_0|X)=0\)? Since \((Y_1 - Y_0)\) is the treatment effect of \(D\), it follows that \(\mathbb{E}(Y_1 - Y_0|X)\) is the conditional average treatment effect \(\text{ATE}(X)\) given \(X\). It’s not a contradiction for \(\text{ATE}(X)\) to equal zero, but think about what it would mean: it would require that the average treatment effect for a person with covariates \((X = x)\) is exactly zero *regardless* of \(x\). Moreover, by iterated expectations it would imply that
\[
\text{ATE} = \mathbb{E}(Y_1 - Y_0) = \mathbb{E}_X[\mathbb{E}(Y_1 - Y_0|
X)] = \mathbb{E}[\text{ATE}(X)] = 0
\]
so the average treatment effect would also be zero. Again, this is not a contradiction but it would definitely be odd to assume that the treatment effect is zero before you even try to estimate it!

To summarize: the first statement above cannot be an implication of selection on observables because it would either require a violation of the overlap assumption, or imply that there is no treatment effect whatsoever. To correct the statement, we simply need to change the last three words:

Under selection on observables, if I know the value of someone’s covariate vector \(X\), then learning her treatment status \(D\) provides no additional information about the average values of her

potential outcomes\((Y_0, Y_1)\).

This is a correct verbal statement of the mean exclusion restriction \(\mathbb{E}(Y_0|D,X) = \mathbb{E}(Y_0|X)\) and \(\mathbb{E}(Y_1|D,X) = \mathbb{E}(Y_1|X)\).

# The Second Misconception

And this leads nicely to the second misconception:

Selection on observables requires the treatment \(D\) and potential outcomes \((Y_0,Y_1)\) to be conditionally independent given covariates \(X\).

To see why this is false, consider an example in which
\[
\begin{aligned}
Y &= (1 - D) \cdot (\alpha_0 + X'\beta_0 + U_0) + D \cdot (\alpha_1 + X' \beta_1 + U_1)\\
U_0|(D,X) &\sim \text{Normal}(0,1 - D/2)\\
U_1|(D,X) &\sim \text{Normal}(0,1 + D).
\end{aligned}
\]
Notice that the distributions of \(U_0\) and \(U_1\) given \((D,X)\) *depend on* \(D\). Now, by iterated expectations,
\[
\begin{aligned}
\mathbb{E}(U_0|X) &= \mathbb{E}_{(D|X)}[\mathbb{E}(U_0|D,X)] = 0\\
\mathbb{E}(U_0) &= \mathbb{E}_{X}[\mathbb{E}(U_0|X)] = 0
\end{aligned}
\]
and similarly \(\mathbb{E}(U_1|X) = \mathbb{E}(U_1)=0\). Substituting \(D=0\) and \(D=1\), we can calculate the potential outcomes and average treatment effect as follows
\[
\begin{aligned}
Y_0 &= \alpha_0 + X'\beta_0 + U_0 \\
Y_1 &= \alpha_1 + X'\beta_1 + U_1 \\
\text{ATE} &= \mathbb{E}(Y_1 - Y_0) = (\alpha_1 - \alpha_0) + \mathbb{E}[X'](\beta_1 - \beta_0).
\end{aligned}
\]
It follows that \(D\) is *not* conditionally independent of \((Y_0, Y_1)\) given \(X\). In particular, the *variance* of the potential outcomes depends on \(D\) even after conditioning on \(X\):
\[
\begin{aligned}
\text{Var}(Y_0|X,D) &= \text{Var}(U_0|X,D) = 1 - D/2\\
\text{Var}(Y_1|X,D) &= \text{Var}(U_1|X,D) = 1 + D.
\end{aligned}
\]
In spite of this, the selection on observables assumption still holds:
\[
\begin{aligned}
\mathbb{E}(Y_0|D,X) &= \alpha_0 + X'\beta_0 + \mathbb{E}(U_0|D,X) = \alpha_0 + X'\beta_0\\
\mathbb{E}(Y_0|X) &= \alpha_0 + X'\beta_0 + \mathbb{E}(U_0|X) = \alpha_0 + X'\beta_0\\
\end{aligned}
\]
and similarly \(\mathbb{E}(Y_1|D,X) = \mathbb{E}(Y_1|X) = \alpha_1 + X'\beta_0\). While this example is admittedly a bit peculiar, the point is more general: because the average treatment effect is an *expectation*, identifying it only requires assumptions about *conditional means*.^{2} The second statement is even easier to correct than the first: we need only add a single word:

Selection on observables requires the treatment \(D\) and potential outcomes \((Y_0,Y_1)\) to be conditionally

meanindependent given covariates \(X\).

Conditional independence implies conditional mean independence, but the converse is false.

# Epilogue

So what’s the moral here? First, it’s crucial to distinguish between the *observed* outcome \(Y\) and the *potential outcomes* \((Y_0, Y_1)\). Second, the various notions of “unrelatedness” between random variables—independence, conditional mean independence, and uncorrelatedness—can be confusing. Be sure to pay attention to exactly which condition is used and why. In a future post, I’ll have more to say about the relationships between these notions.

For more details see my lecture notes on treatment effects↩︎

You might object that in the real world it is difficult to think of settings in which conditional mean independence is plausible but full independence does not. This is a fair point. Nevertheless, it’s important to be clear about which assumptions are actually

*used*in a given derivation, and here we only rely on conditional mean independence.↩︎