Syntactic Translations of Berry-Esséen and the CLT

Probably the most studied problem in probability theory is the following: suppose $X_1, X_2, X_3, \dots, X_n$ are independent, identically distributed random variables from some distribution $\mathcal{D}$ on $\mathbb{R}$, and we look at the random variable

$S_n = X_1 + X_2 + \cdots + X_n$

How is $S_n$ distributed?

Typically the answer is the central limit theorem (CLT) amplified with the Berry-Esséen theorem, “a sum of independent variables is approximately normally distributed”. This post explains two symbolic translations of that sentence which I find easier to write down at a moment’s notice. The first takes the viewpoint of “asymptotic approximation of $S_n$“,

$S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(1)$

Here $\mu$ is the mean of $\mathcal{D}$ and $\sigma$ is the standard deviation of $\mathcal{D}$. The second is

$S_n = Y_1 + Y_2 + \cdots + Y_n + O(1)$

where the $Y_i$ are i.i.d Gaussian random variables with the same mean and variance as the $X_i$, $Y_i \sim \mathcal{N}(\mu, \sigma^2)$. This can be thought of as an “invariance principle” or “replacement lemma” (and this is the viewpoint taken by Lindeburg in his proof of the CLT). The invariance principle is now a tool used in Boolean fourier analysis.

Crucial remark: Unfortunately, the equality of these “syntactic forms” is NOT any convergence in distribution. We’ve “blown up” errors by a factor of $\sqrt{n}$ from the convergence guaranteed by the CLT. Dividing both sides by $\sqrt{n}$ normalizes both sides to have finite variance and gives convergence in probability (or the stronger convergence guaranteed by an appropriate CLT).

Asymptotics for $S_n$

The law of large numbers says that, if $\mu$ is finite, “the sample mean converges to the true mean”, i.e.

$S_n / n \to \mu$

In our notation, this is a “linear approximation to $S_n$“,

$S_n = n\mu + o(n)$

The central limit theorem refines this by saying that, in fact, the sample mean is about $\Theta(\frac{1}{\sqrt{n}})$ away from the true mean. If $\mu$ and $\sigma$ are finite,

$\frac{(S_n - n \mu)}{\sqrt{n}} \to \mathcal{N}(0,\sigma^2)$

In our notation, this is a “$\sqrt{n}$ approximation to $S_n$“,

$S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + o(\sqrt{n})$

The Berry-Esséen theorem strengthens the convergence to the normal distribution. If $\rho = E(|X_1|^3)$ denotes (an upper bound on) the third moment of the distribution $\mathcal{D}$, which we assume to be finite,

$|S_n - n\mu - \sqrt{n}\mathcal{N}(0,\sigma^2)| \leq O(\frac{\rho}{\sigma^2})$

In our notation, this is an improvement up to a constant to the asymptotic distribution of $S_n$,

$S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(\frac{\rho}{\sigma^2})$

It should be noted that the Berry-Esséen theorem is tight up to the big-Oh. The binomial distribution achieves this rate of convergence; see the end of these lecture notes for a proof.

Invariance Principles

The previous analysis ends with the distribution $n \mu + \sqrt{n} \mathcal{N}(0, \sigma^2)$. We can incorporate everything into one Gaussian to produce the equivalent distribution $\mathcal{N}(n\mu, n\sigma^2)$. Interestingly, because the sum of independent Gaussians is again Gaussian, this distribution has the same PDF as

$Y_1 + Y_2 + \cdots + Y_n$

for independent Gaussian random variables $Y_i \sim \mathcal{N}(\mu, \sigma^2)$. This leads us to the equality

$S_n =X_1 + X_2 + \cdots + X_n = Y_1 + Y_2 + \cdots + Y_n + O(1)$

in which we’ve simply replaced each $X_i$ by a Gaussian random variable with the same mean and variance.

As noted in Remark 29 here, one can improve the constant term by changing the replacement variables. In particular, it can be made $o(1)$ if the first, second, and third moments of the $Y_i$ agree with the $X_i$.

The idea of “replacement invariance” surfaces in theoretical CS in the context of Boolean fourier analysis. Here we generalize the summation of Boolean ($\{\pm1\}$-valued) variables $X_1 + X_2 + \cdots + X_n$ to an arbitrary Boolean function

$f(X_1, X_2, \dots, X_n)$

The invariance principle states that the random variable $f: \{\pm 1\} \to \mathbb{R}$ for $X_i$ uniformly drawn from $\{\pm 1\}$ is close in probability to the random variable

$f(Y_1, Y_2, \dots, Y_n)$

for $Y_i \sim \mathcal{N}(0, 1)$ (assuming we normalize both sides to have variance $1$). In this case the “closeness” is determined by the maximum influence of a variable $X_i$ on the value of $f$, as well as the complexity of $f$ (its degree as a multilinear polynomial); see the previously linked lecture notes for an exact quantitative statement.

Approximating Binomial Coefficients with Entropy

This post recounts the story of one of my favorite “probabilistic method” proofs,

$\log \binom{n}{k} \leq n H(k/n)$

where $H(p)$ is the binary entropy function i.e. the entropy of a coin flipped with bias $p$. Maybe more properly the proof is just counting in two ways?

By a tiny tweak to our entropy argument as done here, one can get a better bound

$\log\displaystyle\sum_{i = 0}^{k} \binom{n}{k} \leq nH(k / n)$

This is what’s typically used in practice to bound e.g. the sum of the first $1/3$ of the binomial coefficients by $2^{n H(1/3)}$. The proof here gives all the essential ideas for that proof. With Stirling’s approximation used to give error bounds, both approximations are tight when $k = \Theta(n)$,

$\binom{n}{k} = (1+o(1))2^{n H(k / n)}$

The experiment: Let the random variable $X$ denote the result of flipping $n$ fair coins i.e. $X$ is binomially distributed. Now suppose we know $X$ has exactly $k$ Heads.

The question: How much entropy is in the conditional distribution of $X$ given $k$ Heads? I.e. what is $H(X \mid k\text{ Heads})$?

The left side: The conditional distribution of $X$ is uniform over $\binom{n}{k}$ possible results. The entropy of the uniform distribution on $N$ objects is $\log N$, therefore

$H(X \mid k\text{ Heads}) = \log \binom{n}{k}$.

The right side: On the other hand, $X$ is specified by its bits $X_i$, so

$H(X \mid k \text{ Heads}) = H(X_1, X_2, \dots, X_n \mid k \text{ Heads})$

$H(X_1, X_2, \dots, X_n \mid k \text{ Heads}) \leq H(X_1 \mid k \text{ Heads}) + H(X_2 \mid k \text{ Heads}) + \cdots + H(X_n \mid k \text{ Heads})$

This expression is easy to compute: when flipping $k$ Heads of $n$ flips, each $X_i$ looks like a coin with bias $k /n$, therefore its entropy is $H(k/n)$. So

$H(X_1 \mid k \text{ Heads}) + H(X_2 \mid k \text{ Heads}) + \cdots + H(X_n \mid k \text{ Heads})= n H(k/n)$

giving the right-hand side.

Further Generalization: Multinomial Coefficients

We can naturally generalize the above argument to provide an approximation for multinomial coefficients. Fix natural numbers $k_1, k_2, \dots, k_m$ with $\sum_i k_i = n$. Let $K$ be a distribution over $[m]$ taking $i$ with probability $k_i / n$. Then

$\log \binom{n}{k_1, k_2, \dots, k_m} \leq nH(K)$

The proof generalizes perfectly: the underlying set is $[m]^{n}$ (place a letter from $\{1, 2, \dots, m\}$ at each of $n$ indices), and we consider the entropy of a uniform choice of a string constrained to have the fixed letter histogram $k_1, k_2, \dots, k_m$. This is exactly the left-hand side, and subadditivity yields the right-hand side.

Moreover, by Stirling again the approximation is tight when $k_i = \Theta(n)$.

Calculus-Free Asymptotic for Harmonic Numbers

Here we prove that $H_n = \Theta(\log_2 n)$ without touching calculus (the most common proof interprets $H_n$ as a Riemann sum of $\int 1/x$). Seeing as this is probably the most fundamental asymptotic for computer scientists, I was surprised to see this Stack Exchange question didn’t mention the calculus-free approach. The proof here pretty much matches my own answer to the Stack Exchange post.

We first consider powers of 2, $n = 2^k$. For these $n$, we can break up $\sum \frac{1}{j}$ into the “chunks”

$1$

$\frac{1}{2} + \frac{1}{3}$

$\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7}$

$\dots$

$\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1}$

and we have one extra term $1/2^k$. There are $k + 1 = \log_2 n + 1$ chunks, and of course $H_n$ is the sum of these chunks (plus the extra term), hence to show $H_n = \Theta(\log_2 n)$ we show each chunk is $\Theta(1)$. Inside each chunk we bound above by the power of 2:

$1 \leq 1$

$\frac{1}{2} + \frac{1}{3} \leq \frac{1}{2} + \frac{1}{2} = 1$

$\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7} \leq \frac{1}{4} + \frac{1}{4} + \frac{1}{4} + \frac{1}{4} =1$

$\dots$

$\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1} \leq \frac{1}{2^{k-1}} + \cdots + \frac{1}{2^{k - 1}} = 1$

Thus each chunk is at most 1, and taken together with the extra term $\frac{1}{2^k}$, we have $H_n \leq \log_2 n + \frac{1}{2^k} \leq \log_2 n + 1$.

On the other hand, if we lower bound the elements by the next power of 2,

$1 \geq 1$

$\frac{1}{2} + \frac{1}{3} \geq \frac{1}{4} + \frac{1}{4} = \frac{1}{2}$

$\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7} \geq \frac{1}{8} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8} =\frac{1}{2}$

$\dots$

$\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1} \geq \frac{1}{2^k} + \cdots + \frac{1}{2^k} = \frac{1}{2}$

Every chunk is at least $\frac{1}{2}$, hence $H_n \geq \frac{1}{2}\log_2 n$.

This essentially is the proof. One technicality: we’ve only shown $H_n = \Theta (\log_2 n)$ for $n$ a power of 2. Monotonicity of $H_n$ completes the proof for all $n$: taking the nearest powers of 2 above and below $n$, call them $n_-$ and $n_+$,

$n/2 < n_- < n < n_+ < 2n$

Applying our bounds on $H_{n_-}$ and $H_{n_+}$,

$H_n \leq H_{n_+} \leq \log_2 n_+ + 1 < \log_2 (2n) +1= \log_2 n + 2$

$H_n \geq H_{n_-} \geq \frac{1}{2}\log_2 n_- > \frac{1}{2}\log_2 (n/2) = \frac{1}{2}\log_2 n - \frac{1}{2}$

Thus $H_n = \Theta(\log_2 n)$.

(In general, for any monotonic linear or sublinear function, it suffices to show $\Theta(\cdot)$ on only the powers of 2).