econometrics.bloghttps://www.econometrics.blog/econometrics.blogWowchemy (https://wowchemy.com)en-usSat, 08 May 2021 00:00:00 +0000https://www.econometrics.blog/media/icon_hu527ab5b51f8e6010ad28ca0572c37a69_25313_512x512_fill_lanczos_center_2.pngeconometrics.bloghttps://www.econometrics.blog/Thirty isn't the magic numberhttps://www.econometrics.blog/post/thirty-isn-t-the-magic-number/Sat, 08 May 2021 00:00:00 +0000https://www.econometrics.blog/post/thirty-isn-t-the-magic-number/
<script src="https://www.econometrics.blog/post/thirty-isn-t-the-magic-number/index_files/header-attrs/header-attrs.js"></script>
<p>The simplest version of the central limit theorem (CLT) says that if <span class="math inline">\(X_1, \dots, X_n\)</span> are iid random variables with mean <span class="math inline">\(\mu\)</span> and finite variance <span class="math inline">\(\sigma^2\)</span></p>
<p><span class="math display">\[
\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \rightarrow_d N(0,1)
\]</span>
where <span class="math inline">\(\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i\)</span>. In other words, if <span class="math inline">\(n\)</span> is sufficiently large, the sample mean is <em>approximately</em> normally distributed with mean <span class="math inline">\(\mu\)</span> and variance <span class="math inline">\(\sigma^2/n\)</span>, regardless of the distribution of <span class="math inline">\(X_1, \dots, X_n\)</span>. This is a pretty impressive result! It is so impressive, in fact, that students encountering it for the first time are usually a little wary. I’m typically asked “but how large is <em>sufficiently large</em>?” or “how do we know when the CLT will provide a good approximation?” My answer is disappointing: without some additional information about the distribution from which <span class="math inline">\(X_1, \dots, X_n\)</span> were drawn, we simply <em>can’t say</em> how large a sample is large enough for the CLT work well. At this point, someone invariably volunteers “but in my high school statistics course, we learned that <span class="math inline">\(n = 30\)</span> is big enough for the CLT to hold!”</p>
<p>I’ve always been surprised by the prevalence of the <span class="math inline">\(n \geq 30\)</span> dictum. It even appears in Charles Wheelan’s <em>Naked Statistics</em>, an otherwise excellent book that I assign as summer reading for our incoming economics undergraduates: “as a rule of thumb, the sample size must be at least 30 for the central limit theorem to hold true.” In this post I’d like to set the record straight: <span class="math inline">\(n\geq 30\)</span> is neither necessary <em>nor</em> sufficient for the CLT to provide a good approximation, as we’ll see by examining two simple examples. Along the way, we’ll learn about two useful tools for visualizing and comparing distributions: the empirical cdf, and quantile-quantile plots.</p>
<div id="a-sample-size-of-thirty-isnt-necessary." class="section level2">
<h2>A sample size of thirty isn’t necessary.</h2>
<p>We’ll start by showing that the CLT can work extremely well even when <span class="math inline">\(n\)</span> is much smaller than <span class="math inline">\(30\)</span> and the random variables that we average are far from normally distributed themselves. Along the way we’ll learn about the <em>empirical CDF</em> and <em>quantile-quantile</em> plots, two extremely useful tools for comparing probability distributions.</p>
<p>Informally speaking, a Uniform<span class="math inline">\((0,1)\)</span> random variable is equally likely to take on any continuous value in the range <span class="math inline">\([0,1]\)</span>.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> Here’s a histogram of 1000 random draws from this distribution:</p>
<pre class="r"><code># set the seed to get the same draws I did
set.seed(12345)
hist(runif(1000), xlab = '', freq = FALSE,
main = 'Histogram of 1000 Uniform(0,1) Draws')</code></pre>
<p><img src="https://www.econometrics.blog/post/thirty-isn-t-the-magic-number/index_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
<p>This distribution clearly isn’t normal!
Indeed, its probability density function is <span class="math inline">\(f(x) = 1\)</span> for <span class="math inline">\(x \in [0,1]\)</span>.
This is a flat line rather than a bell curve.
But if we <em>average</em> even a relatively small number of Uniform<span class="math inline">\((0,1)\)</span> draws, the result will be <em>extremely close</em> to normality. To see that this is true, we’ll carry out a simulation in which we draw <span class="math inline">\(n\)</span> Uniform<span class="math inline">\((0,1)\)</span> RVs, calculate their sample mean, and store the result. Repeating this a large number of times allows us to approximate the sampling distribution of <span class="math inline">\(\bar{X}_n\)</span>. I’ll start by writing a function <code>get_unif_sim</code> that takes a single argument <code>n</code>. This function returns the sample mean of <code>n</code> Uniform<span class="math inline">\((0,1)\)</span> draws:</p>
<pre class="r"><code>get_unif_sim <- function(n) {
sims <- runif(n)
xbar <- mean(sims)
return(xbar)
}</code></pre>
<p>Next I’ll use the <code>replicate</code> function to call <code>get_unif_sim</code> a large number of times, <code>nreps</code>, and store the results as a vector called <code>xbar_sims</code>. Here I’ll take <span class="math inline">\(n = 10\)</span> standard uniform draws, blatantly violating the <span class="math inline">\(n \geq 30\)</span> rule-of-thumb:</p>
<pre class="r"><code>set.seed(12345)
nreps <- 1e5 # scientific notation for 100,000
xbar_sims <- replicate(nreps, get_unif_sim(10))
hist(xbar_sims, xlab = '', freq = FALSE,
main = 'Sampling Dist. of Sample Mean of 10 Uniform(0,1) Draws')</code></pre>
<p><img src="https://www.econometrics.blog/post/thirty-isn-t-the-magic-number/index_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>A beautiful bell curve! This certainly looks normal, but histograms can be tricky to interpret. Their shape depends on how many bins we use to make the plot, something that can be difficult to choose well in practice. In the following two sections, we’ll instead compare <em>distribution functions</em> and <em>quantiles</em>.</p>
<div id="the-empirical-cdf" class="section level3">
<h3>The Empirical CDF</h3>
<!--To find out what the CLT implies in this example, we first need to know the mean and variance of $X_i$. If $X \sim U(0,1)$, then
$$
\mathbb{E}[X] = \int_{-\infty}^{\infty} x f(x)\, dx = \int_0^1 x \cdot 1 \,dx = \left. \frac{x^2}{2} \right|_0^1 = \frac{1}{2}
$$
which makes perfect sense given the symmetry of the Uniform$(0,1)$ density about $x = 1/2$. The quickest way to calculate $\text{Var}(X)$ is to begin by calculating $\mathbb{E}[X^2]$
$$
\mathbb{E}[X^2] = \int_{0}^{1} x^2\, dx = \left. \frac{x^3}{3} \right|_0^1 = \frac{1}{3}
$$
and then use the "shortcut rule" as follows:
$$
\text{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \frac{1}{3} - \left(\frac{1}{2}\right)^2 = \frac{1}{12}.
$$-->
<p>If <span class="math inline">\(X \sim\)</span> Uniform<span class="math inline">\((0,1)\)</span>, then <span class="math inline">\(\mathbb{E}(X) = 1/2\)</span> and <span class="math inline">\(\text{Var}(X) = 1/12\)</span>, which follows from <span class="math inline">\(\mathbb{E}[X^2]=1/3\)</span> and the definition of variance.<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> For <span class="math inline">\(n = 10\)</span>,
<span class="math display">\[
\frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{12}} \cdot \frac{1}{\sqrt{10}} = \frac{1}{\sqrt{120}}
\]</span>
so if the CLT provides a good approximation in this example, we should find that
<span class="math display">\[
\frac{\bar{X}_n - 1/2}{1/\sqrt{120}} = \sqrt{120} (\bar{X}_n - 1/2) \approx N(0,1)
\]</span>
in the sense that that the <em>cumulative distribution function</em> (CDF) of <span class="math inline">\(\sqrt{120} (\bar{X}_n - 1/2)\)</span>, call it <span class="math inline">\(F\)</span>, is approximately equal to the standard normal CDF <code>pnorm()</code>. An obvious way to see if this holds is to plot <span class="math inline">\(F\)</span> against <code>pnorm()</code> and see how they compare. From now on, we’ll be working with the z-scores of <code>xbar_sims</code> rather than the raw simulation values themselves, so we’ll start by constructing them, subtracting the population mean and dividing by the population standard deviation:</p>
<pre class="r"><code>z <- (xbar_sims - 1/2) / (1 / sqrt(120))</code></pre>
<p>We haven’t worked out an expression for the function <span class="math inline">\(F\)</span>, but we can approximate it using our simulation draws <code>xbar_sims</code>. We do this by calculating the <em>empirical CDF</em> of our centered and standardized simulation draws <code>z</code>. Recall that if <span class="math inline">\(Z\)</span> is a random variable, its CDF <span class="math inline">\(F\)</span> is defined as <span class="math inline">\(F(t) = \mathbb{P}(Z \leq t)\)</span>. Given a large number of observed random draws <span class="math inline">\(z_1, \dots, z_J\)</span> from the distribution of <span class="math inline">\(Z\)</span>, we can approximate <span class="math inline">\(\mathbb{P}(Z \leq t)\)</span> by calculating the fraction of observed draws less than or equal to <span class="math inline">\(t\)</span>. In other words
<span class="math display">\[
\mathbb{P}(Z \leq t) \approx \frac{1}{J}\sum_{j=1}^J \mathbf{1}\{z_j \leq t\}
\]</span>
where <span class="math inline">\(\mathbf{1}\{z_j \leq t \}\)</span> is the <em>indicator function</em>: it equals one if <span class="math inline">\(z_j\)</span> is less than or equal to the threshold <span class="math inline">\(t\)</span> and zero otherwise. The sample average on the right-hand side of the preceding expression is called the <em>empirical CDF</em>. It uses empirical data–in this case our simulation draws <span class="math inline">\(z_j\)</span>–to approximate the unknown CDF. By increasing the number of random draws <span class="math inline">\(J\)</span> that we use, we can make this approximation as accurate as we like.<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a> For example, we don’t know the exact value of <span class="math inline">\(F(0)\)</span>, the probability that <span class="math inline">\(Z \leq 0\)</span>. But using our simulated values <code>z</code> from above, we can approximate it as</p>
<pre class="r"><code>mean(z <= 0)</code></pre>
<pre><code>## [1] 0.49993</code></pre>
<p>and if we wanted to the probability that <span class="math inline">\(Z \leq 2\)</span>, we could approximate this as</p>
<pre class="r"><code>mean(z <= 2)</code></pre>
<pre><code>## [1] 0.97797</code></pre>
<p>So far so good: these values agree with <code>pnorm(0)</code>, which equals 0.5, and <code>pnorm(2)</code>, which is approximately 0.9772.
But we’ve only looked at two values of <span class="math inline">\(t\)</span>.
While we could continue trying additional values one at a time, it’s much faster to use R’s built-in function for computing an empirical cdf, <code>ecdf()</code>. First we pass our simulated z-scores <code>z</code> into <code>ecdf()</code> function to calculate the empirical CDF and plot the result. Next we overlay some points from the standard normal CDF, <code>pnorm</code> in blue for comparison:</p>
<pre class="r"><code>z <- sqrt(120) * (xbar_sims - 1/2)
plot(ecdf(z), xlab = 't', ylab = 'F(t)', main = 'F(t) versus pnorm(t)')
tseq <- seq(-4, 4, by = 0.2)
points(tseq, pnorm(tseq), col = 'blue')</code></pre>
<p><img src="https://www.econometrics.blog/post/thirty-isn-t-the-magic-number/index_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>The fit is almost perfect, despite <span class="math inline">\(n=10\)</span> being far below 30. This kind of plot is much more informative than the histogram from above, but it can still be a bit difficult to read. When constructing confidence intervals or calculating p-values it is probabilities in the tails of the distribution that matter most, i.e. values of <span class="math inline">\(t\)</span> that are far from zero in the plot. Ideally, we’d like a plot that makes any discrepancies in the tails <em>jump out</em> at us. That is precisely what we’ll construct next.</p>
</div>
<div id="quantile-quantile-plots" class="section level3">
<h3>Quantile-Quantile Plots</h3>
<p>So far we’ve seen that the histogram of <code>xbar_sims</code> is bell-shaped, and that the empirical CDF of <code>sqrt(120) * (xbar_sims - 0.5)</code> is well-approximated by the standard normal CDF <code>pnorm()</code>. If you’re still not convinced that the CLT <em>can</em> work perfectly well with <span class="math inline">\(n = 10\)</span>, the final plot that we’ll make should dispel any remaining doubts.
<!--If $Z$ is a continuous random variable with CDF $F$, its *quantile function* is given by $Q(p) = F^{-1}(p)$. In words $Q(p)$ is the threshold $t$ such that $\mathbb{P}(Z \leq t) = p$, i.e.\ the *inverse function* of the CDF.^[Defining a quantile function for discrete random variables requires a bit more care.] For example, here I have plotted the standard normal quantile function `qnorm` alongside the corresponding CDF `pnorm`
```r
par(mfrow = c(1, 2))
curve(qnorm(x), 0, 1, n = 1001)
curve(pnorm(x), -3, 3, n = 1001)
```
<img src="https://www.econometrics.blog/post/thirty-isn-t-the-magic-number/index_files/figure-html/unnamed-chunk-8-1.png" width="672" />
```r
par(mfrow = c(1, 1))
```
-->
As its name suggests, a quantile-quantile plot compares the quantiles of two probability distributions. But rather than comparing two quantile functions plotted against <span class="math inline">\(p\)</span>, it compares the quantiles of two distributions <em>plotted against each other</em>. This is a bit confusing the first time you encounter it, so we’ll take things step-by-step.</p>
<p>If our simulated z-scores from above are well-approximated by a standard normal distribution, then their median should be close to that of a standard normal random variable, i.e. zero. This is indeed the case:</p>
<pre class="r"><code>median(z)</code></pre>
<pre><code>## [1] 0.0001169817</code></pre>
<p>But it’s not just the medians that should be close to each other: <em>all</em> the quantiles should be. So now let’s look at the 25th-percentile and 75th-percentile as well. Rather than computing them one-by-one, we can generate them in a single batch by first setting up a vector <code>p</code> of probabilities and using <code>rbind</code> to print the results in a convenient format</p>
<pre class="r"><code>p <- c(0.25, 0.5, 0.75)
rbind(normal = qnorm(p), simulation = quantile(z, probs = p))</code></pre>
<pre><code>## 25% 50% 75%
## normal -0.6744898 0.0000000000 0.6744898
## simulation -0.6779843 0.0001169817 0.6791994</code></pre>
<p>This looks good as well.
If we want to compare quantiles over a finer grid of values for <code>p</code>, it’s more convenient to make a plot rather than a table. Suppose that we treat the values <code>qnorm</code> as an <span class="math inline">\(x\)</span>-coordinate and the quantiles of <code>z</code> as a <span class="math inline">\(y\)</span>-coordinate. If the CLT is giving us a good approximation, then we should have <span class="math inline">\(x \approx y\)</span> and all of the points should fall near the 45-degree line. This is indeed what we observe:</p>
<pre class="r"><code>p <- seq(from = 0.05, to = 0.95, by = 0.05)
x <- qnorm(p)
y <- quantile(z, probs = p)
plot(x, y, xlab = 'std. normal quantiles', ylab = 'quantiles of z')
abline(0, 1) # plot the 45-degree line</code></pre>
<p><img src="https://www.econometrics.blog/post/thirty-isn-t-the-magic-number/index_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
<p>The plot that we have just made is called a <em>normal quantile-quantile plot</em>. It is constructed as follows:</p>
<ol style="list-style-type: decimal">
<li>Set up a vector <code>p</code> of probabilities.</li>
<li>Calculate the corresponding quantiles of a standard normal RV, <code>qnorm(p)</code>. Call these <span class="math inline">\(x\)</span>.</li>
<li>Calculate the corresponding quantiles of your data, <code>quantile(your_data_here, probs = p)</code>. Call them <span class="math inline">\(y\)</span>.</li>
<li>Plot <span class="math inline">\(y\)</span> against <span class="math inline">\(x\)</span>.</li>
</ol>
<p>If the points all fall on a line, then the quantiles of the observed data agree with those of <em>some</em> normal distribution, although perhaps not a standard normal. If we standardize the data before making such a plot, as we did to construct <code>z</code> above, the relevant line with be the 45-degree line. If not, it will be a different line but the interpretation remains the same The easiest way to make a normal quantile-quantile plot in R is by using the function <code>qqnorm</code> followed by <code>qqline</code>. We could do this either using the centered and standardized simulation draws <code>z</code> or the original draws <code>xbar_sims</code></p>
<pre class="r"><code>par(mfrow = c(1, 2))
qqnorm(z, ylab = 'Quantiles of z')
qqline(z)
qqnorm(xbar_sims, ylab = 'Quantiles of xbar_sims')
qqline(xbar_sims)</code></pre>
<p><img src="https://www.econometrics.blog/post/thirty-isn-t-the-magic-number/index_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<pre class="r"><code>par(mfrow = c(1, 1))</code></pre>
<p>The only difference between these two plots is the scale of the <span class="math inline">\(y\)</span>-axis. The plot that uses the original simulation draws <code>xbar_sims</code> has a <span class="math inline">\(y\)</span>-axis that runs between <span class="math inline">\(0.1\)</span> and <span class="math inline">\(0.9\)</span> because the sample average of <span class="math inline">\(\text{Uniform}(0,1)\)</span> random variables must lie within the interval <span class="math inline">\([0,1]\)</span>. In contrast, the corresponding <span class="math inline">\(z\)</span>-scores lie in the range <span class="math inline">\([-4,4]\)</span>.<a href="#fn4" class="footnote-ref" id="fnref4"><sup>4</sup></a> For <span class="math inline">\(x\)</span>-values between <span class="math inline">\(-3\)</span> and <span class="math inline">\(3\)</span>, we can’t even see the line generated by <code>qqline</code>: the quantiles of our simulation draws are extremely close to those of a normal distribution. Outside of this range, however, we see that the black circles <em>curve away</em> from the line. For values of <span class="math inline">\(x\)</span> around <span class="math inline">\(-4\)</span>, the quantiles of <code>z</code> are above those of a standard normal, i.e. shifted to the right. For values of <span class="math inline">\(x\)</span> around <span class="math inline">\(4\)</span>, the picture is reversed: the quantiles of <code>z</code> are below those of a standard normal, i.e. shifted to the left. This means that <code>z</code> has <em>lighter tails</em> than a standard normal: it is a bit less likely to yield <em>extremely</em> large positive or negative values, for example</p>
<pre class="r"><code>cbind(simulation = quantile(z, 0.0001), normal = qnorm(0.0001))</code></pre>
<pre><code>## simulation normal
## 0.01% -3.563576 -3.719016</code></pre>
<p>This makes perfect sense. A standard normal can take on arbitrarily large values, while the sample mean of ten uniforms is necessarily bounded above by <span class="math inline">\(1\)</span>. So if you want to carry out a <span class="math inline">\(0.01\%\)</span> test (<span class="math inline">\(\alpha = 0.0001\)</span>), the approximation provided by the CLT won’t quite cut it with <span class="math inline">\(n = 10\)</span> in this example. But for any conventional significance level, it’s nearly perfect:</p>
<pre class="r"><code>p <- c(0.01, 0.025, 0.05, 0.1)
rbind(normal = qnorm(p), simulation = quantile(z, prob = p))</code></pre>
<pre><code>## 1% 2.5% 5% 10%
## normal -2.326348 -1.959964 -1.644854 -1.281552
## simulation -2.302793 -1.958568 -1.654508 -1.290797</code></pre>
</div>
</div>
<div id="a-sample-size-of-thirty-isnt-sufficient." class="section level2">
<h2>A sample size of thirty isn’t sufficient.</h2>
<p>Now suppose that <span class="math inline">\(n = 100\)</span> and <span class="math inline">\(X_1, \dots X_n \sim\)</span> iid Bernoulli<span class="math inline">\((1/60)\)</span>. What is the CDF of <span class="math inline">\(\bar{X}_n\)</span>? Rather than approximating the answer to this question by simulation, as we did in the uniform example from above, we’ll work out the <em>exact</em> result and compare it to the approximation provided by the CLT. If <span class="math inline">\(X_1, \dots, X_n \sim\)</span> iid Bernoulli<span class="math inline">\((p)\)</span>, then by definition the sum <span class="math inline">\(S_n = \sum_{i=1}^n X_i\)</span> follows a Binomial<span class="math inline">\((n,p)\)</span> distribution. The probability mass function and CDF of this distribution are available in R via the <code>dbinom()</code> and <code>pbinom()</code> commands. So what about <span class="math inline">\(\bar{X}_n\)</span>? Notice that
<span class="math display">\[
\mathbb{P}(\bar{X}_n = x) = \mathbb{P}(S_n/n = x) = \mathbb{P}(S_n = nx)
\]</span>
Thus, if <span class="math inline">\(f(s) = \mathbb{P}(S_n = s)\)</span> is the pmf of <span class="math inline">\(S_n\)</span> for <span class="math inline">\(s \in \{0, 1, \dots n\}\)</span>, it follows that <span class="math inline">\(f(nx)\)</span> is the pmf of <span class="math inline">\(\bar{X}_n\)</span> for <span class="math inline">\(x \in \{0, 1/n, 2/n, \dots, 1\}\)</span>. This means that we can use <code>dbinom</code> to plot the <em>exact</em> sampling distribution of <span class="math inline">\(\bar{X}_n\)</span> when <span class="math inline">\(n = 100\)</span> and <span class="math inline">\(p = 1/60\)</span> as follows<a href="#fn5" class="footnote-ref" id="fnref5"><sup>5</sup></a></p>
<pre class="r"><code>n <- 100
p <- 1/60
x <- seq(from = 0, to = 1, by = 1/n)
P_x_bar <- dbinom(n * x, size = n, prob = p)
plot(x, P_x_bar, type = 'h', xlim = c(0, 0.1), ylab = 'pmf of Xbar',
lwd = 2, col = 'blue')</code></pre>
<p><img src="https://www.econometrics.blog/post/thirty-isn-t-the-magic-number/index_files/figure-html/unnamed-chunk-15-1.png" width="672" /></p>
<p>The result is <em>far</em> from a normal distribution. Not only is it noticeably discrete, it is also seriously asymmetric. Another way to see this is by examining the CDF. If the central limit theorem is working well in this example, we should have <span class="math inline">\(\bar{X}_n \approx N\big(p, p(1 - p)/n\big)\)</span>. Extending the idea from above, we can plot the exact CDF of <span class="math inline">\(\bar{X}_n\)</span> using the binomial CDF <code>pbinom()</code> and compare it to the approximation suggested by the CLT:</p>
<pre class="r"><code>x <- seq(-0.02, 0.08, by = 0.001)
F_x_bar <- pbinom(n * x, size = n, prob = p)
F_clt <- pnorm(x, p, sqrt(p * (1 - p) / n))
plot(x, F_x_bar, type = 's', ylab = '', lwd = 2, col = 'blue')
points(x, F_clt, type = 'l', lty = 2, lwd = 2, col = 'red')
legend('topleft', legend = c('Exact', 'CLT'), col = c('blue', 'red'),
lty = 1:2, lwd = 2)</code></pre>
<p><img src="https://www.econometrics.blog/post/thirty-isn-t-the-magic-number/index_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>The approximation is noticeably poor, but is the problem serious enough to affect any inferences we might hope to draw? Suppose we wanted to construct a 95% confidence interval for <span class="math inline">\(p\)</span>. The textbook approach, based on the CLT, would have us report <span class="math inline">\(\widehat{p} \pm 1.96 \times \sqrt{\widehat{p}(1 - \widehat{p})/n}\)</span> where <span class="math inline">\(\widehat{p}\)</span> is the sample proportion, i.e. <span class="math inline">\(\bar{X}_n\)</span>. Let’s set up a little simulation experiment to see how well this interval performs when <span class="math inline">\(n = 100\)</span> and <span class="math inline">\(p = 1/60\)</span>.</p>
<pre class="r"><code># Simulate 5000 draws for phat
# with p = 1/60, n = 100
set.seed(54321)
draw_sim_phat <- function(p, n) {
x <- rbinom(n, size = 1, prob = p)
phat <- mean(x)
return(phat)
}
p_true <- 1/60
sample_size <- 100
phat_sims <- replicate(5000, draw_sim_phat(p = p_true, n = sample_size))
# What fraction of the CIs cover the true value of p?
SE <- sqrt(phat_sims * (1 - phat_sims) / sample_size)
lower <- phat_sims - 1.96 * SE
upper <- phat_sims + 1.96 * SE
coverage_prob <- mean((lower <= p_true) & (p_true <= upper))
coverage_prob</code></pre>
<pre><code>## [1] 0.824</code></pre>
<p>So only 82% of these supposed 95% confidence intervals actually cover the true value of <span class="math inline">\(p\)</span>! Clearly 100 observations aren’t enough to rely upon the CLT in this example.</p>
</div>
<div id="epilogue" class="section level2">
<h2>Epilogue</h2>
<p>I hope these examples have convinced you that, in spite of what you may have heard elsewhere, <span class="math inline">\(n\geq 30\)</span> is neither necessary nor sufficient for the CLT to provide an adequate approximation. But some important questions remain. First, what can we do in situations like the second example, where we want to carry out inference for a small proportion? The problem is hardly academic: at the time of this writing, the most recent estimate of coronavirus prevalence in the UK was approximately 0.2%, i.e. nearly ten times <em>smaller</em> than the value I used for <span class="math inline">\(p\)</span> in my second example.<a href="#fn6" class="footnote-ref" id="fnref6"><sup>6</sup></a> Second, how did the <span class="math inline">\(n\geq 30\)</span> folk wisdom arise? Is there anything that we can say about <span class="math inline">\(n\geq 30\)</span>? Finally, are there any theoretical results that can provide guidance about the quality of the approximation provided by the CLT? These questions will have to wait for a future post!</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>More formally, <span class="math inline">\(U\sim\)</span> Uniform<span class="math inline">\((0,1)\)</span> if and only if <span class="math inline">\(\mathbb{P}(a \leq U \leq b) = (b - a)\)</span> for any <span class="math inline">\(0 \leq a \leq b \leq 1\)</span>. <a href="#fnref1" class="footnote-back">↩︎</a></p></li>
<li id="fn2"><p>If you’re taking introductory probability and statistics, filling in the missing details for these calculations would be an excellent homework problem!<a href="#fnref2" class="footnote-back">↩︎</a></p></li>
<li id="fn3"><p>Here we take <span class="math inline">\(J = 100,000\)</span> which is more than enough for the purposes of this exercise.<a href="#fnref3" class="footnote-back">↩︎</a></p></li>
<li id="fn4"><p>Recall that we subtracted <span class="math inline">\(1/2\)</span> and multipled by <span class="math inline">\(\sqrt{120}\approx 11\)</span> to construct <code>z</code> from <code>xbar_sims</code>.<a href="#fnref4" class="footnote-back">↩︎</a></p></li>
<li id="fn5"><p>Notice that I “zoomed in” on the most interesting part of the plot by setting <code>xlim = c(0, 0.1)</code>.<a href="#fnref5" class="footnote-back">↩︎</a></p></li>
<li id="fn6"><p>Source: <a href="https://www.gov.uk/government/publications/react-1-study-of-coronavirus-transmission-march-2021-final-results/react-1-study-of-coronavirus-transmission-march-2021-final-results">REACT-1 study of coronavirus tranmission: March 2021 final results</a><a href="#fnref6" class="footnote-back">↩︎</a></p></li>
</ol>
</div>Econometricians Anonymoushttps://www.econometrics.blog/post/econometricians-anonymous/Thu, 01 Apr 2021 00:00:00 +0000https://www.econometrics.blog/post/econometricians-anonymous/<p><em>Some years back, I wrote a monologue for the Economics Department Skit Night at UPenn. It was a niche venue, granted, but the crowd seemed to enjoy my contribution. In honor of April Fool’s day I’ve posted a lightly-edited version below. On the off chance that anyone else finds this amusing, I grant unlimited rights for this material to be borrowed, adapted, remixed, stolen, execrated, or burned in effigy as you see fit. It’s funnier when you use the names of your own colleagues so I encourage you to fill in the blanks below!</em></p>
<hr>
<p><em>Group leader walks in and writes “Econometricians Anonymous” in big letters on the blackboard.</em></p>
<p><strong>Group Leader:</strong> Thanks for coming everyone. Tonight we’re going to hear from [YOUR NAME].</p>
<p><strong>Econometrician:</strong> I’m [YOUR NAME] and I’m an econometrician. It’s been six months, twelve days, and five hours since my last derivation.</p>
<p>So how did it all start? Like a lot of people, I started out deriving socially: at parties, out clubbing with the econometrics group. [SENIOR ECONOMETRICS COLLEAGUE] would have the bartender set us up with a dozen lemmas and we’d each prove four on the spot: one right after the other. Sure it was a little wild, but I always told myself I was in control. I realize now that I wasn’t.</p>
<p>The more I derived, the more I needed to derive. Sometimes I couldn’t find any co-authors to derive with me, and eventually I started deriving alone. I still remember waking up on the floor of my office after one of my all-night limit theory benders: pads of paper covered with equations strewn about the floor. I was a mess.</p>
<p>Pretty soon they started recognizing me in stationary shops and office supply stores. Sometimes they wouldn’t sell me paper and pencils. One night in my desperation, I broke into the Econ office to steal some notepads. [DEPARTMENT ADMINISTRATOR] was waiting for me: “I think you’ve had a enough of those, [YOUR NAME].” It was the most embarrassing moment of my adult life.</p>
<p>I didn’t realize it at the time, but I had completely cut myself off from my family, friends, and colleagues. I kept finding ways to justify my behavior: “Don’t listen to them: Econometrica’s a great journal. Who cares if no one uses your results?” It all sounds so hollow now, but I really believed it at the time.</p>
<p>Pretty soon I started making outrageous assumptions in my papers: “Suppose that X has finite 128th moments; these regularity conditions are basically standard.” I was out of my mind. Eventually, I started experimenting with simulations. After a while, real data just didn’t do it for me. How could it when I could make thousands of pristine pseudo-random draws dance across my laptop screen at the touch of a button, any time of day or night.</p>
<p>I don’t know what would have happened to me if my friends hadn’t staged an intervention. When I got home there were two stacks on the coffee table: one of my recent working papers and the other, ten times as thick, of the corresponding technical appendices. I knew I had hit rock bottom.</p>
<p>But I’ve made a lot of progress since then, thanks in large part to the love and support of my fellow recovering econometricians here at <em>Econometricians Anonymous</em>. Still, it’s a daily struggle to stay clean. I remember calling my sponsor [APPLIED COLLEAGUE] back in December. It was the middle of the night, I was in the computer lab and I had just double-clicked on the Matlab icon. [APPLIED COLLEAGUE] was there in 15 minutes. He logged me off the computer, took me to a diner and ordered us coffee. We stayed up most of the night talking and running cross-country growth regressions in STATA. It really helped.</p>
<p>I’m doing a lot better now. I’m doing applied work, I’m publishing in general interest journals, and people are citing my research. And I’m here to tell you that if I can do it so can you. We’re all here, all of us at <em>Econometricians Anonymous</em>, to help each other kick the habit. Thank you.</p>
<p><strong>Group Leader:</strong> Thanks [YOUR NAME]. That’s all for tonight, but we hope you’ll join us tomorrow for <em>Game Theorists Anonymous</em> where we’ll be hearing about [SENIOR THEORY COLLEAGUE]’s exciting new research agenda in [RESEARCH AREA THAT SENIOR THEORY COLLEAGUE DEPLORES].</p>Past the Peak? Excess Deaths in England and Waleshttps://www.econometrics.blog/post/past-the-peak-excess-deaths-in-england-and-wales/Wed, 13 May 2020 00:00:00 +0000https://www.econometrics.blog/post/past-the-peak-excess-deaths-in-england-and-wales/
<script src="https://www.econometrics.blog/post/past-the-peak-excess-deaths-in-england-and-wales/index_files/header-attrs/header-attrs.js"></script>
Since my previous post, the <a href="https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales">Office for National Statistics</a> has posted updated data on weekly deaths, and we’ve updated <a href="https://github.com/fditraglia/rcovidUK">rcovidUK</a> accordingly. Here’s where things stand:
<div class="figure"><span id="fig:plot"></span>
<img src="https://www.econometrics.blog/post/past-the-peak-excess-deaths-in-england-and-wales/index_files/figure-html/plot-1.png" alt="Total Weekly Deaths in England & Wales." width="672" />
<p class="caption">
Figure 1: Total Weekly Deaths in England & Wales.
</p>
</div>
<p>With the caveat that the date at which a death is <em>reported</em> need not agree with the date at which it actually occurred, excess deaths appear to have peaked and begun to decline in each of the ten regions.
(Recently the ONS has started posting data on occurrences rather than reports, but these are unavailable for the historical comparison we’re making here.)
As in my earlier post, the red curve is an equally-weighted average of <em>reported</em> weekly deaths in a given region for the past five years, up to and including 2019.
Weeks are defined by the ONS to end on Fridays. This means that a given week does not necessarily correspond to the same days of the year across years, but always exactly one of each day of the week: Monday, Tuesday, etc. This eliminates seasonality from day-of-the-week effects, but creates the possibility of neglected seasonality from holidays that move across weeks during different years, e.g. Easter.
For each point on the red curve, we have added simple error bars: <span class="math inline">\(\pm 2 \times \text{SE}\)</span> where SE is the usual standard error for a mean, ignoring any possible dependence between deaths in the same week across different years.
The blue curve depicts weekly deaths in 2020.</p>
<p>We had hoped to include data from Scotland in this update, but it turns out that the <a href="https://www.nrscotland.gov.uk/">NRS</a>, Scotland’s equivalent of the <a href="https://www.ons.gov.uk/">ONS</a> uses a different definition of weeks of the year, so it wouldn’t be easy to line up the respective series for purposes of comparison. We may revisit this later on.</p>
<p>If you’d like to replicate our plots, the R code is as follows:</p>
<pre class="r"><code>library(tidyverse)
library(rcovidUK)
#Set the most recent week of data for 2020
week_2020 <- ONSweekly %>%
filter(year == 2020, reg_nm == "North East", !is.na(deaths)) %>%
pull(week) %>%
max
df_2020 <- ONSweekly %>%
filter(year == 2020)
df_prev5 <- ONSweekly %>%
filter(year < 2020 & year >= 2015) # Use last 5 years for average
df_prev5 %>%
group_by(reg_nm, week) %>%
summarise(deaths_mean = mean(deaths),
deaths_sd = sd(deaths),
n = n(),
se = deaths_sd/sqrt(n)) %>%
mutate(year = "2015-2019") %>%
rename(deaths = deaths_mean) %>%
bind_rows(df_2020) %>%
filter(week<=week_2020) %>%
ggplot(aes(x=week, y = deaths, col = year, group = reg_id)) +
geom_errorbar(aes(ymin=deaths - 2 * se, ymax=deaths + 2 * se), width=.2, colour="black")+
geom_line(size = 1) +
facet_wrap(~reg_nm, ncol = 2) +
scale_color_brewer(palette = "Set1",
labels = c("Av. Past 5 Years", "2020")) +
labs(x= "Week from Start of Year",
y = "Deaths per Week") +
theme_bw()+
theme(legend.position = 'top',
legend.title = element_blank(),
legend.key.size = unit(1, "cm"))</code></pre>UK Excess Deaths by Age Group and Sexhttps://www.econometrics.blog/post/uk-excess-deaths-are-concentrated-among-older-men/Wed, 13 May 2020 00:00:00 +0000https://www.econometrics.blog/post/uk-excess-deaths-are-concentrated-among-older-men/
<script src="https://www.econometrics.blog/post/uk-excess-deaths-are-concentrated-among-older-men/index_files/header-attrs/header-attrs.js"></script>
<!--One of the most striking facts about covid has been its disproportionate impact on men and older people.-->
<p>In an earlier post I looked at regional differences in the effects of Covid-19 by calculating excess deaths in each week of 2020 relative to an average of the preceding five years. There the idea was to get a sense for the number of deaths that were likely caused by Covid-19 regardless of whether they were officially registered as such, e.g. deaths that occurred outside of hospitals. Thanks to my RA Dan Mead, here’s the same analysis carried out by age group and sex rather than region:</p>
<div class="figure"><span id="fig:plot"></span>
<img src="https://www.econometrics.blog/post/uk-excess-deaths-are-concentrated-among-older-men/index_files/figure-html/plot-1.png" alt="Total Weekly Deaths in England & Wales by Sex and Age Group." width="672" />
<p class="caption">
Figure 1: Total Weekly Deaths in England & Wales by Sex and Age Group.
</p>
</div>
<p>The red curve is an equally-weighted average of reported weekly deaths for the past five years up to and including 2019, while the blue curve depicts weekly deaths in 2020. All calculations are based on data from the <a href="https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales">ONS</a>.
Each point on the red curve is accompanied by error bars indicating plus/minus two standard errors of the mean. Roughly speaking these bounds are equivalent to the margin of error from a typical opinion poll. (For the precise calculations, see the R code below.)</p>
<p>Deaths below age 1 show no clear pattern in 2020 relative to the average of past years.
There’s a hint of slightly <em>lower</em> deaths among males aged 1-14 since the lockdown began, although the data are quite noisy given the extremely small number of deaths in this age group.
For the remaining age groups, we start to see a clear effect around the time of the covid outbreak, approximately weeks 11 through 22 of 2020.
Because death rates vary with age and sex in both ordinary years and 2020, the cleanest way to compare across groups is by computing <em>relative</em> excess deaths.
To do this, we add up the differences between the red curve and the blue curve between weeks 11 and 22 and divide the result by sum of the red curve during the same period of time.
The result give us the percentage by which total deaths in a given age/sex group in weeks 11 through 22 of 2020 exceed the “usual” number of deaths for this age/sex group during the same weeks based on past data.</p>
<p>In the 15-44 age group, male deaths during weeks 11 through 22 of 2020 were 6.4% higher than usual. In contrast, female deaths were 9.7% higher than usual. Deaths in this age group, however, are relatively rare. Accordingly, the blue curves lie comfortably within the error bars for all but a small number of weeks. For the remaining age groups, the picture becomes much starker and the relative excess deaths much larger:</p>
<table>
<thead>
<tr class="header">
<th>Age Group</th>
<th>Men</th>
<th>Women</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>45-64</td>
<td>+42.2%</td>
<td>+32.4%</td>
</tr>
<tr class="even">
<td>65-74</td>
<td>+42.9%</td>
<td>+31.6%</td>
</tr>
<tr class="odd">
<td>75-84</td>
<td>+58.2%</td>
<td>+45.5%</td>
</tr>
<tr class="even">
<td>Over 85</td>
<td>+65.3%</td>
<td>+48.5%</td>
</tr>
</tbody>
</table>
<p>In each age group, relative excess deaths are substantially higher for men than women.
For example, male deaths in the 45-64 age group were 42.2% higher than usual in weeks 11 through 22 of 2020 compared to 32.4% higher for women. Although relative excess deaths increase markedly with age for both sexes, even the rates in the 45-64 group are alarming, particularly for men.</p>
<p>You can replicate the plot from above by running this R code:</p>
<pre class="r"><code>library(tidyverse)
library(rcovidUK)
#Set the most recent week of data for 2020
week_2020 <- ONSweeklyagegender %>%
filter(year == 2020, age == "<1", !is.na(deaths)) %>%
pull(week) %>%
max
df_2020 <- ONSweeklyagegender %>%
filter(year == 2020)
df_prev5 <- ONSweeklyagegender %>%
filter(year < 2020 & year >= 2015) # Use last 5 years for average
df_prev5 %>%
group_by(age, week, gender) %>%
summarise(deaths_mean = mean(deaths),
deaths_sd = sd(deaths),
n = n(),
se = deaths_sd/sqrt(n)) %>%
ungroup() %>%
mutate(year = "2015-2019") %>%
rename(deaths = deaths_mean) %>%
bind_rows(df_2020) %>%
filter(week <= week_2020) %>%
ggplot(aes(x=week, y = deaths, col =year)) +
geom_line(size = 1) +
geom_errorbar(aes(ymin=deaths - 2 * se, ymax=deaths + 2 * se), width=.2, colour="black")+
facet_grid(vars(age), vars(gender), scale="free") +
scale_color_brewer(palette = "Set1",
labels = c("Av. Past 5 Years", "2020")) +
labs(x= "Week from Start of Year",
y = "Deaths per Week") +
theme_bw()+
theme(legend.position = 'top',
legend.title = element_blank(),
legend.key.size = unit(1, "cm"))</code></pre>Excess Deaths in England and Waleshttps://www.econometrics.blog/post/uk-excess-deaths-by-region/Mon, 04 May 2020 00:00:00 +0000https://www.econometrics.blog/post/uk-excess-deaths-by-region/
<script src="https://www.econometrics.blog/post/uk-excess-deaths-by-region/index_files/header-attrs/header-attrs.js"></script>
To get a better idea of the impact of Covid19 in the UK, Dan Mead and put together an R package <a href="https://github.com/fditraglia/rcovidUK">rcovidUK</a> with weekly deaths in England and Wales, taken from the <a href="https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales">Office for National Statistics</a> (ONS), allowing us to produce a plot of total deaths by region in 2020:
<div class="figure"><span id="fig:plot"></span>
<img src="https://www.econometrics.blog/post/uk-excess-deaths-by-region/index_files/figure-html/plot-1.png" alt="Total Weekly Deaths in England & Wales." width="672" />
<p class="caption">
Figure 1: Total Weekly Deaths in England & Wales.
</p>
</div>
<p>The red curve is an equally-weighted average of weekly deaths in a given region for the past five years, up to and including 2019.
Weeks are defined by the ONS to end on Fridays. This means that a given week does not necessarily correspond to the same days of the year across years, but always exactly one of each day of the week: Monday, Tuesday, etc. This eliminates seasonality from day-of-the-week effects, but creates the possibility of neglected seasonality from holidays that move across weeks during different years, e.g. Easter.
For each point on the red curve, we have added simple error bars: <span class="math inline">\(\pm 2 \times \text{SE}\)</span> where SE is the usual standard error for a mean, ignoring any possible dependence between deaths in the same week across different years.
The blue curve depicts weekly deaths in 2020.
From week 12 or 13, depending on region, we see a dramatic uptick in deaths across England and Wales, well outside the two standard error bars. R source code for this plot follows below.</p>
<pre class="r"><code>library(tidyverse)
library(rcovidUK)
#Set the most recent week of data for 2020
week_2020 <- 16 # original version of this post only had data up to week 16
#week_2020 <- ONSweekly %>%
# filter(year == 2020, !is.na(deaths)) %>%
# pull(week) %>%
# max
df_2020 <- ONSweekly %>%
filter(year == 2020)
df_prev5 <- ONSweekly %>%
filter(year < 2020 & year >= 2015)
df_prev5 %>%
group_by(reg_nm, week) %>%
summarise(deaths_mean = mean(deaths),
deaths_sd = sd(deaths),
n = n(),
se = deaths_sd/sqrt(n)) %>%
mutate(year = "2015-2019") %>%
rename(deaths = deaths_mean) %>%
bind_rows(df_2020) %>%
filter(week<week_2020) %>%
ggplot(aes(x=week, y = deaths, col = year,
group = reg_id)) +
geom_errorbar(aes(ymin=deaths - 2 * se,
ymax = deaths + 2 * se),
width=.2, colour="black") +
geom_line(size = 1) +
facet_wrap(~reg_nm, ncol = 2) +
scale_color_brewer(palette = "Set1",
labels = c("Avg Past 5 Yrs", "2020")) +
labs(x= "Week from Start of Year",
y = "Deaths per Week") +
theme_bw()+
theme(legend.position = 'top',
legend.title = element_blank(),
legend.key.size = unit(1, "cm"))</code></pre><link>https://www.econometrics.blog/admin/config.yml</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://www.econometrics.blog/admin/config.yml</guid><description/></item></channel></rss>