Syntactic Translations of Berry-Esséen and the CLT

Probably the most studied problem in probability theory is the following: suppose X_1, X_2, X_3, \dots, X_n are independent, identically distributed random variables from some distribution \mathcal{D} on \mathbb{R}, and we look at the random variable

S_n = X_1 + X_2 + \cdots + X_n

How is S_n distributed?

Typically the answer is the central limit theorem (CLT) amplified with the Berry-Esséen theorem, “a sum of independent variables is approximately normally distributed”. This post explains two symbolic translations of that sentence which I find easier to write down at a moment’s notice. The first takes the viewpoint of “asymptotic approximation of S_n“,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(1)

Here \mu is the mean of \mathcal{D} and \sigma is the standard deviation of \mathcal{D}. The second is

S_n = Y_1 + Y_2 + \cdots + Y_n + O(1)

where the Y_i are i.i.d Gaussian random variables with the same mean and variance as the X_i, Y_i \sim \mathcal{N}(\mu, \sigma^2). This can be thought of as an “invariance principle” or “replacement lemma” (and this is the viewpoint taken by Lindeburg in his proof of the CLT). The invariance principle is now a tool used in Boolean fourier analysis.

Crucial remark: Unfortunately, the equality of these “syntactic forms” is NOT any convergence in distribution. We’ve “blown up” errors by a factor of \sqrt{n} from the convergence guaranteed by the CLT. Dividing both sides by \sqrt{n} normalizes both sides to have finite variance and gives convergence in probability (or the stronger convergence guaranteed by an appropriate CLT).

Asymptotics for S_n

The law of large numbers says that, if \mu is finite, “the sample mean converges to the true mean”, i.e.

S_n / n \to \mu

In our notation, this is a “linear approximation to S_n“,

S_n = n\mu + o(n)

The central limit theorem refines this by saying that, in fact, the sample mean is about \Theta(\frac{1}{\sqrt{n}}) away from the true mean. If \mu and \sigma are finite,

\frac{(S_n - n \mu)}{\sqrt{n}} \to \mathcal{N}(0,\sigma^2)

In our notation, this is a “\sqrt{n} approximation to S_n“,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + o(\sqrt{n})

The Berry-Esséen theorem strengthens the convergence to the normal distribution. If \rho = E(|X_1|^3) denotes (an upper bound on) the third moment of the distribution \mathcal{D}, which we assume to be finite,

|S_n - n\mu - \sqrt{n}\mathcal{N}(0,\sigma^2)| \leq O(\frac{\rho}{\sigma^2})

In our notation, this is an improvement up to a constant to the asymptotic distribution of S_n,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(\frac{\rho}{\sigma^2})

It should be noted that the Berry-Esséen theorem is tight up to the big-Oh. The binomial distribution achieves this rate of convergence; see the end of these lecture notes for a proof.

Invariance Principles

The previous analysis ends with the distribution n \mu + \sqrt{n} \mathcal{N}(0, \sigma^2). We can incorporate everything into one Gaussian to produce the equivalent distribution \mathcal{N}(n\mu, n\sigma^2). Interestingly, because the sum of independent Gaussians is again Gaussian, this distribution has the same PDF as

Y_1 + Y_2 + \cdots + Y_n

for independent Gaussian random variables Y_i \sim \mathcal{N}(\mu, \sigma^2). This leads us to the equality

S_n =X_1 + X_2 + \cdots + X_n =  Y_1 + Y_2 + \cdots + Y_n + O(1)

in which we’ve simply replaced each X_i by a Gaussian random variable with the same mean and variance.

As noted in Remark 29 here, one can improve the constant term by changing the replacement variables. In particular, it can be made o(1) if the first, second, and third moments of the Y_i agree with the X_i.

The idea of “replacement invariance” surfaces in theoretical CS in the context of Boolean fourier analysis. Here we generalize the summation of Boolean (\{\pm1\}-valued) variables X_1 + X_2 + \cdots + X_n to an arbitrary Boolean function

f(X_1, X_2, \dots, X_n)

The invariance principle states that the random variable f: \{\pm 1\} \to \mathbb{R} for X_i uniformly drawn from \{\pm 1\} is close in probability to the random variable

f(Y_1, Y_2, \dots, Y_n)

for Y_i \sim \mathcal{N}(0, 1) (assuming we normalize both sides to have variance 1). In this case the “closeness” is determined by the maximum influence of a variable X_i on the value of f, as well as the complexity of f (its degree as a multilinear polynomial); see the previously linked lecture notes for an exact quantitative statement.