Street Fighting Numerical Analysis - Part 1

Computing is a crucial part of modern applied and theoretical econometrics but most economists, myself included, have little if any formal training numerical analysis and computer science. This means that we often learn things the hard way: by making boneheaded mistakes and spending hours browsing stackoverflow to try to figure out what went wrong. In preparation for my upcoming course on Empirical Research Methods1 I’ve started trying to collect and organize the various nuggets of computational wisdom that I’ve picked up over the years. This post is the first of several that I plan to write on that theme. Its origin is an enigmatic bug that I detected in a seemingly trivial line of my R code involving rep().

Is R broken?

For no particular reason, let’s use the R function rep() to print out the string "econometrics.blog" four times:

rep("econometrics.blog", times = 4)
## [1] "econometrics.blog" "econometrics.blog" "econometrics.blog"
## [4] "econometrics.blog"

Since 0.2 multiplied by 20 equals 4, it comes no surprise that replacing times = 4 with times = 0.2 * 20 gives the same result:

rep("econometrics.blog", times = 0.2 * 20)
## [1] "econometrics.blog" "econometrics.blog" "econometrics.blog"
## [4] "econometrics.blog"

Now let’s try times = (1 - 0.8) * 20 instead. Since 0.2 equals (1 - 0.8) this couldn’t possibly make a difference, could it? Distressingly, it does: we obtain only three copies of "econometrics.blog"

rep("econometrics.blog", times = (1 - 0.8) * 20)
## [1] "econometrics.blog" "econometrics.blog" "econometrics.blog"

What on earth is going on here? Has R made some kind of mistake? Let’s try a sanity check. First we’ll calculate (1 - 0.8) * 20 and call it x. Then we’ll check that x really does equal four:

x <- (1 - 0.8) * 20
x
## [1] 4

What a relief: surely setting times = x should give us four copies of "econometrics.blog". Alas, it does not:

rep('econometrics.blog', times = x) 
## [1] "econometrics.blog" "econometrics.blog" "econometrics.blog"

Clearly using open-source software like R is a bad idea and I should switch to STATA.2

Numeric Types in R

Because R is a dynamically-typed programming language, we can almost always ignore the question of precisely how it stores numeric values “under the hood.” In fact R has two numeric types: integer and double. Integers are rare in practice. The operator : returns an integer vector

y <- 1:5
typeof(y)
## [1] "integer"

and the length of a vector is always an integer

n <- length(y)
typeof(y)
## [1] "integer"

but nearly every other numeric value you encounter in R will be stored as a double, i.e. a double precision floating point number:

typeof(4)
## [1] "double"
typeof(4.0)
## [1] "double"
typeof(cos(0))
## [1] "double"

To force R to store a value as an integer rather than double, we can either append an L

z <- 4L
typeof(z)
## [1] "integer"

or coerce, i.e. convert, a double to an integer using as.integer()

a <- 4
typeof(a)
## [1] "double"
b <- as.integer(a)
typeof(b)
## [1] "integer"

The trade-off between integers and doubles is between precision and range. Calculations carried out with integers are always exact, but integers can only be used to represent a fairly limited number of values. Calculations with doubles, on the other hand, are not always exact, but doubles can store a much larger range of values, including decimals.

This post isn’t the right place to delve into the details of floating point numbers, of which doubles are an instance, but there are two points worth emphasizing. First, it’s generally safe to store a value that is “truly” an integer, e.g. 4, as double. As explained in the help file integer {base}

current implementations of R use 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly.

This explains why you may never have encountered the L suffix in the wild. Because doubles can represent very large integers exactly, calculations with whole numbers stored as doubles will also be exact. Notice that R automatically converts integers that are “too big” into doubles:

x <- 999999999L
typeof(x)
## [1] "integer"
y <- 9999999999L
typeof(y)
## [1] "double"

While converting integers to doubles in innocuous, you need to be careful when converting doubles to integers. This turns out to be the root of our problem with rep() from above. Both 0.2 * 20 and (1 - 0.8) * 20 are doubles, and both appear to equal 4

x <- 0.2 * 20
x
## [1] 4
typeof(x)
## [1] "double"
y <- (1 - 0.8) * 20
y
## [1] 4

But x and y are coerced to different integer values:

as.integer(x)
## [1] 4
as.integer(y)
## [1] 3

The function rep() expects its second argument times to be an integer. If we supply a double instead, then rep() makes a conversion in the same way as the function as.integer(), namely by truncating. Far down in the help file for rep() we find this crucial caveat:

Non-integer values of times will be truncated towards zero. If times is a computed quantity it is prudent to add a small fuzz or use round().

But wait: aren’t x and y precisely equal to each other? How can one truncate to 4 while the other truncates to 3? As it turns out, appearances can be deceiving:

identical(x, y)
## [1] FALSE

To find out why these values aren’t equal, we need to learn a bit more about how computers approximate real numbers using doubles.

What you see isn’t always what you get.

R has various handy built-in constants, including \(\pi\)

pi
## [1] 3.141593

Notwithstanding bill number 246 of the 1897 sitting of the Indiana General Assembly, \(\pi\) is an irrational number. By default, however, R only shows us a small number of its digits. To display twenty digits of \(\pi\), we can specify the argument digits to the function print() like so

print(pi, digits = 20)
## [1] 3.141592653589793116

To see even more digits, we can use the function sprintf(). Let’s try to display 60 digits of \(\pi\):

sprintf("%.60f", pi)
## [1] "3.141592653589793115997963468544185161590576171875000000000000"

Why do the last twelve decimal points display as zero? The answer is that computers cannot represent real numbers to infinite precision. At some point, the remaining digits of \(\pi\) get chopped off, and we’re left with an approximation that’s more than sufficient for any practical application.

At first glance, the number 0.8 seems nothing like \(\pi\). It is, after all, a rational number: 4/5. But sprintf() reveals that there’s more here than meets the eye:

sprintf("%.54f", 0.8)
## [1] "0.800000000000000044408920985006261616945266723632812500"

The fraction 4/5 cannot be represented exactly as a double; it can only be approximated. The same is true of 1/5. Because 0.2 and 0.8 can only be represented approximately, 1 - 0.8 and 0.2 turn out not to be equal from the computer’s perspective:

identical(1 - 0.8, 0.2)
## [1] FALSE

This in turn explains why (1 - 0.8) * 20 and 0.2 * 20 truncate to different integer values:

sprintf("%.54f", 20 * c(1 - 0.8, 0.2))
## [1] "3.999999999999999111821580299874767661094665527343750000"
## [2] "4.000000000000000000000000000000000000000000000000000000"

Which decimals have an exact representation?

The fraction 1/3 lacks a finite decimal expansion. You might guess that R would simply store 1/3 as a zero, followed by a decimal, followed by a large number of 3s. But in fact it does not:

sprintf("%.54f", 1/3)
## [1] "0.333333333333333314829616256247390992939472198486328125"

Every digit from the 1 onward is wrong. The fraction 1/10, on the other hand, clearly does have a finite decimal expansion, 0.1, but R gets this one wrong as well:

sprintf("%.54f", 1/10)
## [1] "0.100000000000000005551115123125782702118158340454101562"

At the same time, it handles 1/4 with perfect accuracy:

sprintf("%.54f", 1/4)
## [1] "0.250000000000000000000000000000000000000000000000000000"

What’s going on? Here’s a clue: the fraction 1/32 can also be represented exactly as a double. See if you can figure out why before reading further.

The reason why 1/3 lacks a terminating decimal expansion is that it can’t be written as a counting number divided by a power of ten. In other words, we can’t find values \(m, n \in \mathbb{N}\) such that \(m/10^n\) equals \(1/3\). In contrast, 3/4 has a finite decimal expansion because it equals 75/100, corresponding to \(m = 75\) and \(n = 2\). Of course I’ve left off a crucial qualification: wherever I wrote “finite decimal expansion” above, I should have added “in base 10.” The same number could have a terminating decimal expansion in one base and not in another.

Although R displays numbers on the screen in base 10, it represents and computes with them in binary. So the question becomes: which values have a terminating decimal expansion in base 2? To find out, simply replace 10 with 2 in the expression from above. A rational number has a terminating decimal expansion base 2 if it can be written as \(m/2^n\) for some \(m, n \in \mathbb{N}\). Since 1/4 equals \(1/2^2\), it has an exact representation. Since 3/4 can be written as 3/2^2 it also has an exact representation. In contrast, 1/5 lacks an exact representation because there are no natural numbers \(m,n\) such that \(5 = 2^n/m\). We can get close by making \(n\) large and choosing \(m\) carefully, but we can never satisfy this equation exactly.

Take-home Lesson

High-level programming languages like R and Python are extremely convenient: they allow us to focus on the big picture rather than writing line after line of boilerplate code. But we should never forget that computers cannot represent all numeric values with perfect accuracy. Sometimes this matters. In R, coercing from integer to double is safe but the reverse can be risky, as we have gleaned from a deceptively simple example involving rep(). To learn more about the subtleties of R, I highly recommend The R Inferno by Patrick Burns and Advanced R by Hadley Wickham. Despite what you may have heard, R is quite a capable language, but it does have some quirks!


  1. Content coming soon!↩︎

  2. Health warning: this sentence is satire. The author does not condone the use of STATA or other closed-source statistical packages.↩︎

Related