A New Way of Looking at Least Squares

Last updated on Mar 14, 2022 12 min read Statistics, regression

Have you got a ruler handy? Fantastic! Then hold out your right hand, extend your thumb and little finger as far as they’ll go, and measure the distance in centimeters, rounding to the nearest half centimeter. This is your handspan. Mine is around 23.5 centimeters, but is that big, small, or merely average?¹ Fortunately for you, I’ve asked hundreds of introductory statistics students to measure their handspans over the years and (with their consent) posted the resulting data on my website:

library(tidyverse)
dat <- read_csv('https://ditraglia.com/econ103/height-handspan.csv')
quantile(dat$handspan)

##   0%  25%  50%  75% 100% 
## 16.5 20.0 21.5 23.0 28.0

If we take these 326 students as representative of the population of which I am a member, my handspan is roundabout the 84th percentile.

The great thing about handspan, and the reason that I used it in my teaching, is that it’s strongly correlated with height but, in contrast to weight, there’s no temptation to shade the truth. (What’s a good handspan anyway?) Here’s a scatterplot of height against handspan along with the regression line and confidence bands:

dat %>%
  ggplot(aes(x = handspan, y = height)) + 
  geom_point(alpha = 0.3) +
  geom_smooth(method = 'lm', formula = y ~ x)

Because handspan is only measured to the nearest half of a centimeter and height to the nearest inch, the dataset contains multiple “tied” values. I’ve used darker colors to indicate situations in which more than one student reported a given height and handspan pair.² The correlation between height and handspan is approximately 0.67. From the following simple linear regression, we’d predict approximately a 1.3 inch difference in height between two students whose handspan differed by one centimeter:

coef(lm(height ~ handspan, dat))

## (Intercept)    handspan 
##   40.943127    1.266961

Where does the regression line come from?

If you’ve taken an introductory statistics or econometrics course, you most likely learned that the least squares regression line $\widehat{\alpha} + \widehat{\beta} x$ minimizes the sum of squared vertical deviations by solving the optimization problem \[ \max_{\alpha, \beta} \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2. \] You probably also learned that the solution is given by \[ \widehat{\alpha} = \bar{y} - \widehat{\beta} \bar{x}, \quad \widehat{\beta} = \frac{\sum_{i=1}^n (y_i - \bar{y})(x_i - \bar{x})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{s_{xy}}{s_x^2} \] In words: the regression slope equals the ratio of the covariance between height and handspan to the variance of handspan, and the regression line passes through the sample average values of height and handspan. We can check that all of these formulas agree with what we calculated above using lm()

b <- with(dat, cov(height, handspan) / var(handspan))
a <- with(dat, mean(height) - b * mean(handspan))
c(a, b)

## [1] 40.943127  1.266961

coef(lm(height ~ handspan, dat))

## (Intercept)    handspan 
##   40.943127    1.266961

This is all perfectly correct, and an entirely reasonable way of looking at the problem. But I’d now like to suggest a completely different way of looking at regression. Why bother? The more different ways we have of understanding an idea or method, the more deeply we understand how it works, when it works, and when it is likely to fail. So bear with me while I take you on what might at first appear to be a poorly-motivated computational detour. I promise that there’s a payoff at the end!

A Crazy Idea

There is a unique line that passes through any two distinct points in a plane. So here’s a crazy idea: let’s form every unique pair of students from my height and handspan dataset. To understand what I have in mind, consider a small subset of the data, call it test

test <- dat[1:3,]
test

## # A tibble: 3 × 2
##   height handspan
##    <dbl>    <dbl>
## 1     73     22.5
## 2     65     17  
## 3     69     21

With three students, there are three unique unordered pairs: $\{1,2\}, \{1,3\}, \{2,3\}$. Corresponding to these three pairs are three line segments, one through each pair:

plot(height ~ handspan, test, pch = 20)
segments(x0 = c(17, 17, 21),     # FROM: x-coordinates
         y0 = c(65, 65, 69),     # FROM: y-coordinates
         x1 = c(21, 22.5, 22.5), # TO:   x-coordinates
         y1 = c(69, 73, 73),     # TO:   y-coordinates
         lty = 2)

And associated with each line segment is a slope

y_differences <- c(69 - 65, 73 - 65, 73 - 69)
x_differences <- c(21 - 17, 22.5 - 17, 22.5 - 21) 
slopes <- y_differences / x_differences
slopes

## [1] 1.000000 1.454545 2.666667

And here’s my question for you: what, if anything, is the relationship between these three slopes and the slope of the regression line? While it’s a bit silly to run a regression with only three observations, the results are as follows:

coef(lm(height ~ handspan, test))

## (Intercept)    handspan 
##   41.556701    1.360825

The slope of the regression line doesn’t equal any of the three slopes we calculated above, but it does lie between them. This makes sense: if the regression line were steeper or less steep than all three line segments from above, it couldn’t possibly minimize the sum of squared vertical deviations. Perhaps the regression slope is the arithmetic mean of slopes? No such luck:

mean(slopes)

## [1] 1.707071

Something interesting is going on here, but it’s not clear what. To learn more, it would be helpful to play with more than three points. But doing this by hand would be extremely tedious. Time to write a function!

All Pairs of Students

The following function generates all unique pairs of elements taken from a vector x and stores them in a matrix:

make_pairs <- function(x) {
# Returns a data frame whose rows contain each unordered pair of elements of x
# i.e. all committees of two with members drawn from x
  n <- length(x)
  pair_indices <- combn(1:n, 2)
  matrix(x[c(pair_indices)], ncol = 2, byrow = TRUE)
}

For example, applying make_pairs() to the vector c(1, 2, 3, 4, 5) gives

make_pairs(1:5)

##       [,1] [,2]
##  [1,]    1    2
##  [2,]    1    3
##  [3,]    1    4
##  [4,]    1    5
##  [5,]    2    3
##  [6,]    2    4
##  [7,]    2    5
##  [8,]    3    4
##  [9,]    3    5
## [10,]    4    5

Notice that I’ve make_pairs() is constructed such that order doesn’t matter: we don’t distinguish between $(4,5)$ and $(5,4)$, for example. This makes sense for our example: Alice and Bob denotes the same pair of students as Bob and Alice.

To generate all possible pairs of students from dat, we apply make_pairs() to dat$handspan and dat$height separately and then combine the result into a single dataframe:

regression_pairs <- data.frame(make_pairs(dat$handspan), 
                               make_pairs(dat$height))
head(regression_pairs)

##     X1   X2 X1.1 X2.1
## 1 22.5 17.0   73   65
## 2 22.5 21.0   73   69
## 3 22.5 25.5   73   71
## 4 22.5 25.0   73   78
## 5 22.5 20.0   73   68
## 6 22.5 20.5   73   75

The names are ugly, so let’s clean them up a bit. Handspan is our “x” variable and height is our “y” variable, so we’ll refer to the measurements from each pair as x1, x2, y1, y2

names(regression_pairs) <- c('x1', 'x2', 'y1', 'y2')
head(regression_pairs)

##     x1   x2 y1 y2
## 1 22.5 17.0 73 65
## 2 22.5 21.0 73 69
## 3 22.5 25.5 73 71
## 4 22.5 25.0 73 78
## 5 22.5 20.0 73 68
## 6 22.5 20.5 73 75

The 1 and 2 indices indicate a particular student: in a given row x1 and y1 are the handspan and height of the “first student” in the pair while x2 and y2 are the handspan and height of the “second student.” Each student appears many times in the regression_pairs dataframe. This is because there are many pairs of students that include Alice: we can pair her up with any other student in the class. For this reason, regression_pairs has a tremendous number of rows, 52975 to be precise. This is the number of ways to form a committee of size 2 from a collection of 326 people when order doesn’t matter:

choose(326, 2)

## [1] 52975

I just calculated 52,975 slopes.

Corresponding to each row of regression_pairs is a slope. We can calculate and summarize them as follows, using the dplyr package from the tidyverse to make things easier to read:

library(dplyr)
regression_pairs <- regression_pairs %>%
  mutate(slope = (y2 - y1) / (x2 - x1))
regression_pairs %>%
  pull(slope) %>%
  summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    -Inf   0.000   1.333     NaN   2.667     Inf     419

The sample mean slope is NaN, the minimum is -Inf, the maximum is Inf, and there are 419 missing values. So what on earth does this mean? First things first: the abbreviation NaN stands for “not a number.” This is R’s way of expressing $0/0$, and an NaN “counts” as a missing value:

0/0

## [1] NaN

is.na(0/0)

## [1] TRUE

In contrast, Inf and -Inf are R’s way of expressing $\pm \infty$. These do not count as missing values, and they also arise when a number is too big or too small for your computer to represent:

c(-1/0, 1/0, exp(9999999), -exp(9999999))

## [1] -Inf  Inf  Inf -Inf

is.na(c(Inf, -Inf))

## [1] FALSE FALSE

So where do these Inf and NaN values come from? Our slope calculation from above was (y2 - y1) / (x2 - x1). If x2 == x1, the denominator is zero. This occurs when the two students in a given pair have the same handspan. Because we only measured handspan to the nearest 0.5cm, there are many such pairs. Indeed, handspan only takes on 23 distinct values in our dataset but there are 326 students:

sort(unique(dat$handspan))

##  [1] 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 23.5
## [16] 24.0 24.5 25.0 25.5 26.0 26.5 27.0 28.0

If y1 and y2 are different but x1 and x2 are the same, the slope will either be Inf or -Inf, depending on whether y1 > y2 or the reverse. When y1 == y2 and x1 == x2 the slope is NaN.

This isn’t an arcane numerical problem. When y1 == y2 and x1 == x2, our pair contains only a single point, so there’s no way to draw a line segment. With no line to draw, there’s clearly no slope to calculate. When y1 != y2 but x1 == x2 we can draw a line segment, but it will be vertical. Should we call the slope of this vertical line $+\infty$? Or should we call it $-\infty$? Because the labels 1 and 2 for the students in a given pair were arbitrary–order doesn’t matter–there’s no way to choose between Inf and -Inf. From the perspective of calculating a slope, it simply doesn’t make sense to construct pairs of students with the same handspan.

With this in mind, let’s see what happens if we average all of the slopes that are not -Inf, Inf, or NaN

regression_pairs %>%
  filter(!is.na(slope) & !is.infinite(slope)) %>%
  pull(slope) %>%
  mean

## [1] 1.259927

This is tantalizingly close to the slope of the regression line from above: 1.266961. But it’s still slightly off.

Not All Slopes are Created Equal

The median handspan in my dataset is 21.5. Let’s take a closer look at the heights of students whose handspans are close to this value:

boxplot(height ~ handspan, dat, subset = handspan %in% c(21, 21.5, 22))

There is a large amount of variation in height for a given value of handspan. Indeed, from this boxplot alone you might not even guess that there is a strong positive relationship between height and handspan in the dataset as a whole! The “boxes” in the figure, representing the middle 50% of heights for a given handspan, overlap substantially. If we were to randomly choose one student with a handspan of 21 and another with a handspan of 21.5, it’s quite likely that the slope between them would be negative. It’s true that students with bigger hands are taller on average. But the difference in height that we’d predict for two students who differed by 0.5cm in handspan is very small: 0.6 inches according to the linear regression from the beginning of this post. In contrast, the standard deviation of height among students with the median handspan is more than five times as large:

sd(subset(dat, handspan == 21.5)$height)

## [1] 3.172248

The 25th percentile of handspan is 20 while the 75th percentile is 23. Comparing the heights of students with these handspans rather than those close to the median, gives a very different picture:

boxplot(height ~ handspan, dat, subset = handspan %in% c(20, 23))

Now there’s much less overlap in the distributions of height. This accords with the predictions of the linear regression from above: for two students whose handspan differs by 3cm, we would predict a difference of 3.8 inches in height. This difference is big enough to discern in spite of the variation in height for students with the same handspan. If we were to choose one student with a handspan of 20cm and another with a handspan of 23cm, it’s fairly unlikely that the slope between these lines would be negative.

So where does this leave us? Above we saw that forming a pair of students with the same handspan does not allow us to calculate a slope. Now we’ve seen that the slope for a pair of students with a very similar handspan can give a misleading impression about the overall relationship. This turns out to be the key to our puzzle from above. The ordinary least squares slope estimate does equal an average of the slopes for each pair of students, but this average gives more weight to pairs with a larger difference in handspan. As I’ll derive in my next post: \[ \widehat{\beta}_{\text{OLS}} = \sum_{(i,j)\in C_2^n} \omega_{ij} \left(\frac{y_i - y_j}{x_i - x_j}\right), \quad \omega_{ij} \equiv \frac{(x_i - x_j)^2}{\sum_{(i,j)\in C_2^n} (x_i-x_j)^2} \] The notation $C_2^n$ is shorthand for the set $\{(i,j)\colon 1 \leq i < j \leq n\}$, in other words the set of all unique pairs $(i,j)$ where order doesn’t matter. The weights $\omega_{ij}$ are between zero and one and sum to one over all pairs. Pairs with $x_i = x_j$ are given zero weight; pairs in which $x_i$ is far from $x_j$ are given more weight than pairs where these values are closer. And you don’t have to wait for my next post to see that it works:

regression_pairs <- regression_pairs %>%
  mutate(x_dist = abs(x2 - x1),
         weight = x_dist^2 / sum(x_dist^2))

regression_pairs %>%
  filter(!is.infinite(slope) & !is.na(slope)) %>%
  summarize(sum(weight * slope)) %>%
  pull

## [1] 1.266961

coef(lm(height ~ handspan, dat))[2]

## handspan 
## 1.266961

So there you have it: in a simple linear regression, the OLS slope estimate is a weighted average of the slopes of the line segments between all pairs of observations. The weights are proportional to the squared Euclidean distance between $x$-coordinates. I’ll leave things here for today, but there’s much more to say on this topic. Stay tuned for the next installment!

If you play the piano, this may help: I can play parallel 10ths but only just.↩︎
Another way to show this is by “jittering” the data: simply replace geom_point(alpha = 0.3) with geom_jitter(alpha = 0.3).↩︎