# Syntactic Translations of Berry-Esséen and the CLT

Probably the most studied problem in probability theory is the following: suppose $X_1, X_2, X_3, \dots, X_n$ are independent, identically distributed random variables from some distribution $\mathcal{D}$ on $\mathbb{R}$, and we look at the random variable

$S_n = X_1 + X_2 + \cdots + X_n$

How is $S_n$ distributed?

Typically the answer is the central limit theorem (CLT) amplified with the Berry-Esséen theorem, “a sum of independent variables is approximately normally distributed”. This post explains two symbolic translations of that sentence which I find easier to write down at a moment’s notice. The first takes the viewpoint of “asymptotic approximation of $S_n$“,

$S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(1)$

Here $\mu$ is the mean of $\mathcal{D}$ and $\sigma$ is the standard deviation of $\mathcal{D}$. The second is

$S_n = Y_1 + Y_2 + \cdots + Y_n + O(1)$

where the $Y_i$ are i.i.d Gaussian random variables with the same mean and variance as the $X_i$, $Y_i \sim \mathcal{N}(\mu, \sigma^2)$. This can be thought of as an “invariance principle” or “replacement lemma” (and this is the viewpoint taken by Lindeburg in his proof of the CLT). The invariance principle is now a tool used in Boolean fourier analysis.

Crucial remark: Unfortunately, the equality of these “syntactic forms” is NOT any convergence in distribution. We’ve “blown up” errors by a factor of $\sqrt{n}$ from the convergence guaranteed by the CLT. Dividing both sides by $\sqrt{n}$ normalizes both sides to have finite variance and gives convergence in probability (or the stronger convergence guaranteed by an appropriate CLT).

## Asymptotics for $S_n$

The law of large numbers says that, if $\mu$ is finite, “the sample mean converges to the true mean”, i.e.

$S_n / n \to \mu$

In our notation, this is a “linear approximation to $S_n$“,

$S_n = n\mu + o(n)$

The central limit theorem refines this by saying that, in fact, the sample mean is about $\Theta(\frac{1}{\sqrt{n}})$ away from the true mean. If $\mu$ and $\sigma$ are finite,

$\frac{(S_n - n \mu)}{\sqrt{n}} \to \mathcal{N}(0,\sigma^2)$

In our notation, this is a “$\sqrt{n}$ approximation to $S_n$“,

$S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + o(\sqrt{n})$

The Berry-Esséen theorem strengthens the convergence to the normal distribution. If $\rho = E(|X_1|^3)$ denotes (an upper bound on) the third moment of the distribution $\mathcal{D}$, which we assume to be finite,

$|S_n - n\mu - \sqrt{n}\mathcal{N}(0,\sigma^2)| \leq O(\frac{\rho}{\sigma^2})$

In our notation, this is an improvement up to a constant to the asymptotic distribution of $S_n$,

$S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(\frac{\rho}{\sigma^2})$

It should be noted that the Berry-Esséen theorem is tight up to the big-Oh. The binomial distribution achieves this rate of convergence; see the end of these lecture notes for a proof.

## Invariance Principles

The previous analysis ends with the distribution $n \mu + \sqrt{n} \mathcal{N}(0, \sigma^2)$. We can incorporate everything into one Gaussian to produce the equivalent distribution $\mathcal{N}(n\mu, n\sigma^2)$. Interestingly, because the sum of independent Gaussians is again Gaussian, this distribution has the same PDF as

$Y_1 + Y_2 + \cdots + Y_n$

for independent Gaussian random variables $Y_i \sim \mathcal{N}(\mu, \sigma^2)$. This leads us to the equality

$S_n =X_1 + X_2 + \cdots + X_n = Y_1 + Y_2 + \cdots + Y_n + O(1)$

in which we’ve simply replaced each $X_i$ by a Gaussian random variable with the same mean and variance.

As noted in Remark 29 here, one can improve the constant term by changing the replacement variables. In particular, it can be made $o(1)$ if the first, second, and third moments of the $Y_i$ agree with the $X_i$.

The idea of “replacement invariance” surfaces in theoretical CS in the context of Boolean fourier analysis. Here we generalize the summation of Boolean ($\{\pm1\}$-valued) variables $X_1 + X_2 + \cdots + X_n$ to an arbitrary Boolean function

$f(X_1, X_2, \dots, X_n)$

The invariance principle states that the random variable $f: \{\pm 1\} \to \mathbb{R}$ for $X_i$ uniformly drawn from $\{\pm 1\}$ is close in probability to the random variable

$f(Y_1, Y_2, \dots, Y_n)$

for $Y_i \sim \mathcal{N}(0, 1)$ (assuming we normalize both sides to have variance $1$). In this case the “closeness” is determined by the maximum influence of a variable $X_i$ on the value of $f$, as well as the complexity of $f$ (its degree as a multilinear polynomial); see the previously linked lecture notes for an exact quantitative statement.