Syntactic Translations of Berry-Esséen and the CLT

Probably the most studied problem in probability theory is the following: suppose X_1, X_2, X_3, \dots, X_n are independent, identically distributed random variables from some distribution \mathcal{D} on \mathbb{R}, and we look at the random variable

S_n = X_1 + X_2 + \cdots + X_n

How is S_n distributed?

Typically the answer is the central limit theorem (CLT) amplified with the Berry-Esséen theorem, “a sum of independent variables is approximately normally distributed”. This post explains two symbolic translations of that sentence which I find easier to write down at a moment’s notice. The first takes the viewpoint of “asymptotic approximation of S_n“,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(1)

Here \mu is the mean of \mathcal{D} and \sigma is the standard deviation of \mathcal{D}. The second is

S_n = Y_1 + Y_2 + \cdots + Y_n + O(1)

where the Y_i are i.i.d Gaussian random variables with the same mean and variance as the X_i, Y_i \sim \mathcal{N}(\mu, \sigma^2). This can be thought of as an “invariance principle” or “replacement lemma” (and this is the viewpoint taken by Lindeburg in his proof of the CLT). The invariance principle is now a tool used in Boolean fourier analysis.

Crucial remark: Unfortunately, the equality of these “syntactic forms” is NOT any convergence in distribution. We’ve “blown up” errors by a factor of \sqrt{n} from the convergence guaranteed by the CLT. Dividing both sides by \sqrt{n} normalizes both sides to have finite variance and gives convergence in probability (or the stronger convergence guaranteed by an appropriate CLT).

Asymptotics for S_n

The law of large numbers says that, if \mu is finite, “the sample mean converges to the true mean”, i.e.

S_n / n \to \mu

In our notation, this is a “linear approximation to S_n“,

S_n = n\mu + o(n)

The central limit theorem refines this by saying that, in fact, the sample mean is about \Theta(\frac{1}{\sqrt{n}}) away from the true mean. If \mu and \sigma are finite,

\frac{(S_n - n \mu)}{\sqrt{n}} \to \mathcal{N}(0,\sigma^2)

In our notation, this is a “\sqrt{n} approximation to S_n“,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + o(\sqrt{n})

The Berry-Esséen theorem strengthens the convergence to the normal distribution. If \rho = E(|X_1|^3) denotes (an upper bound on) the third moment of the distribution \mathcal{D}, which we assume to be finite,

|S_n - n\mu - \sqrt{n}\mathcal{N}(0,\sigma^2)| \leq O(\frac{\rho}{\sigma^2})

In our notation, this is an improvement up to a constant to the asymptotic distribution of S_n,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(\frac{\rho}{\sigma^2})

It should be noted that the Berry-Esséen theorem is tight up to the big-Oh. The binomial distribution achieves this rate of convergence; see the end of these lecture notes for a proof.

Invariance Principles

The previous analysis ends with the distribution n \mu + \sqrt{n} \mathcal{N}(0, \sigma^2). We can incorporate everything into one Gaussian to produce the equivalent distribution \mathcal{N}(n\mu, n\sigma^2). Interestingly, because the sum of independent Gaussians is again Gaussian, this distribution has the same PDF as

Y_1 + Y_2 + \cdots + Y_n

for independent Gaussian random variables Y_i \sim \mathcal{N}(\mu, \sigma^2). This leads us to the equality

S_n =X_1 + X_2 + \cdots + X_n =  Y_1 + Y_2 + \cdots + Y_n + O(1)

in which we’ve simply replaced each X_i by a Gaussian random variable with the same mean and variance.

As noted in Remark 29 here, one can improve the constant term by changing the replacement variables. In particular, it can be made o(1) if the first, second, and third moments of the Y_i agree with the X_i.

The idea of “replacement invariance” surfaces in theoretical CS in the context of Boolean fourier analysis. Here we generalize the summation of Boolean (\{\pm1\}-valued) variables X_1 + X_2 + \cdots + X_n to an arbitrary Boolean function

f(X_1, X_2, \dots, X_n)

The invariance principle states that the random variable f: \{\pm 1\} \to \mathbb{R} for X_i uniformly drawn from \{\pm 1\} is close in probability to the random variable

f(Y_1, Y_2, \dots, Y_n)

for Y_i \sim \mathcal{N}(0, 1) (assuming we normalize both sides to have variance 1). In this case the “closeness” is determined by the maximum influence of a variable X_i on the value of f, as well as the complexity of f (its degree as a multilinear polynomial); see the previously linked lecture notes for an exact quantitative statement.

How I Remember the Chernoff Bound

The Chernoff bound is of critical importance in computer science and probability. When I need it, though, I almost always have to look it up to make sure I get all the parameters correct. Not any more, though. I’ll explain how the Chernoff bound is essentially the PDF of the “Gaussian you would expect” from the Central Limit Theorem.

The most common form of the Chernoff bound is the following: suppose you have n independent and identically distributed coin flips X_1, X_2, \dots, X_n, say the result of repeatedly flipping a coin that comes up Heads with probability p. The number of Heads is by definition a Binomial distribution; let S denote this random variable. Then with all but exponentially small probability, S is within O(\sqrt{n}) of its mean:

\text{Pr}[\left|S - np\right| > \delta \sqrt{n}] \leq 2\exp(-\delta^2 / 2p(1-p))

The Central Limit Theorem (CLT) tells us that as n \to \infty, the distribution (S - \mathbb{E}(S)) / \sqrt{\text{Var}(S)} approaches a standard normal distribution. Philosophically, S approaches a normal distribution with the parameters you would expect: mean \mathbb{E}(S) and variance \text{Var}(S) (though the “scaled up” distributions may not necessarily converge from the CLT alone).

In our case, \mathbb{E}(S) = np and \text{Var}(S) = np(1-p). The PDF of the “Gaussian you would expect” is

p(S = x) \approx C \cdot \exp(-(x - np)^2 / 2np(1-p))

for an appropriate normalizing constant C = 1/\sqrt{2\pi n p(1-p)}. Let’s rewrite that with x = np + \delta\sqrt{n}:

p(S = np + \delta\sqrt{n}) \approx C \cdot \exp(-(\delta \sqrt{n})^2 / 2np(1-p)) =C \cdot \exp(-\delta^2/2p(1-p))

This is (up to the factor of C) the expression appearing in the Chernoff bound! Thus we can think of the Chernoff bound as expressing an “even when n is small” version of the CLT, with a little bit of loss from 1/\sqrt{2\pi} to 2 in the multiplicative factor.

Using the CDF instead of the PDF

We lost some credibility at one point in the technique above: we actually should have been looking at the cumulative probability

\text{Pr}[\left|S - np \right| \geq \delta \sqrt{n}]

Instead we noticed that the Chernoff bound can be remembered by looking at the PDF (and ignoring a nonconstant factor)

p(S = np+\delta\sqrt{n})

But using the CDF instead of the PDF actually gives the same expression, and with a truly constant factor. Let’s compute:

\text{Pr}[\left|S - np \right| \geq \delta \sqrt{n}] = 2\int_{np+\delta\sqrt{n}}^\infty p(S = x) d x

= 2\int_{\delta\sqrt{n}}^\infty 1/ \sqrt{2\pi np(1-p)} \cdot \exp(-x^2 / 2np(1-p)) d x

If we break up the integral into little \sqrt{n} size pieces from \delta \sqrt{n} to \infty, the integral on a piece [k\sqrt{n}, (k+1)\sqrt{n} looks like

\int_{k\sqrt{n}}^{(k+1)\sqrt{n}} 1/ \sqrt{2\pi np(1-p)} \cdot \exp(-x^2 / 2np(1-p)) d x

\leq1/ \sqrt{2\pi np(1-p)} \int_{k\sqrt{n}}^{(k+1)\sqrt{n}} \exp(-k^2 / 2p(1-p)) d x

= 1/ \sqrt{2\pi p(1-p)} \exp(-k^2 / 2p(1-p))

The exponent increases faster than linearly,  and it is the only thing that changes in k, so this contribution to the integral goes to 0 faster than geometrically.  Therefore the integral is bounded by an infinite geometric series with initial term proportional to \exp( -\delta^2/2p(1-p)). Therefore the cumulative probability is (up to a constant) the expression in the Chernoff bound


and this time the constant is actually constant!


How to Easily Compute Matrix Derivatives, Fréchet Style

The routine, traditional interpretation of a derivative is as a slope of the tangent line. This is great for a function of a single variable, but if you have a multivariate function f: \mathbb{R}^m \to \mathbb{R}^n, how can you talk about slopes? One approach is to make a conceptual leap: a derivative is a linear approximation to your function; the tangent line itself is the derivative, rather than the slope of the tangent line. This is called the Fréchet derivative.

Suppose you have a multivariate function f: \mathbb{R}^m \to \mathbb{R}^n. At some point in the domain x_0 \in \mathbb{R}^m, we want to say that if you make a small change \epsilon \in \mathbb{R}^m around x_0, the change in f is approximately a linear function in \epsilon:

f(x_0 + \epsilon) \approx f(x_0) + Df_{x_0}(\epsilon)

Here Df_{x_0} : \mathbb{R}^m \to \mathbb{R}^n is our notation for the derivative function (which is linear, and can be represented as a matrix applied to \epsilon). Df_{x_0} could depend highly on x_0; after all, f will probably have different slopes at different points. More formally, we want a function Df_{x_0} such that

f(x_0 + \epsilon) = f(x_0) + Df_{x_0}(\epsilon) + o(\epsilon)

where o(\epsilon) is a function that “is tiny compared to \epsilon” e.g. every entry of the vector is the square of the largest entry of \epsilon.

Computing Fréchet Derivatives is Incredibly Elegant

To compute Df_{x_0}(\epsilon), just plug in x_0 + \epsilon to f, collect all the terms that are linear in \epsilon, and drop all the terms that “get small as \epsilon \to 0“. In practice, this means we drop all “higher-order terms” in which we multiply two coordinates of \epsilon. The best way to illustrate is with examples.

Example 1: f(x) = x^Tx

If we change x a little, how does x^\top x change? Here f: \mathbb{R}^n \to \mathbb{R} is f(x) = x^\top x.

f(x + \epsilon) = (x + \epsilon)^\top (x + \epsilon) = x^\top x + x^\top \epsilon + \epsilon^\top x + \epsilon^\top \epsilon

= f(x) +x^\top \epsilon + \epsilon^\top x + \epsilon^\top \epsilon

The part linear in \epsilon here is x^\top \epsilon + \epsilon^\top x, and \epsilon^\top \epsilon is tiny compared to \epsilon. So

Df_x(\epsilon) = x^\top \epsilon + \epsilon^\top x = 2x^\top \epsilon

f(x+\epsilon) \approx f(x) + 2x^\top \epsilon

If we want to see where x^\top x is minimized, the critical points are found by finding where the function Df_x is 0; that is, where we can’t move a little bit to go up or down. In this case,

\forall \epsilon.\; 2x^\top \epsilon = 0 \iff x = 0

Example 2: Derivative of the Matrix Inverse

Let’s look at a function of a matrix,

f: \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times n}

f(M) = M^{-1}

As we change M by a matrix X with small entries (in particular, small enough to ensure M+X is still invertible), how does (M+X)^{-1} change from M^{-1}? Just thinking about dimensions, the answer to that question should be by adding a small matrix to M^{-1}.

f(M+X) = (M+X)^{-1} \approx M^{-1} + Df_M(X)

The hard part is massaging (M+X)^{-1} to look like M^{-1} plus a linear function of X. To start,

(M+X)^{-1} = M^{-1}(I + XM^{-1})^{-1}

The matrix I + XM^{-1} can be inverted using the Neumann series, which is just a fancy name for the geometric series summation formula

{1 \over I + XM^{-1}} = I - XM^{-1} + (XM^{-1})^2 - (XM^{-1})^3 + \cdots

Note that the series converges for small enough X because XM^{-1} has all eigenvalues less than 1.

Because X is small, X^2 is negligibly tiny. For our linear approximation, we can drop the terms that are too tiny to be noticed by the linear approximation, which is all but the first two terms of this series.

{1 \over I + XM^{-1}} \approx I - XM^{-1}

M^{-1}(I + XM^{-1})^{-1} \approx M^{-1}(I - XM^{-1}) = M^{-1} - M^{-1}XM^{-1}

Remembering f(M) = M^{-1}, altogether we just computed

f(M+X) \approx f(M) -  M^{-1}XM^{-1}

Noting that this second term is linear in X, we have the derivative

Df_{M}(X) = -M^{-1}XM^{-1}

As a remark: whenever we wrote \approx, we could have written + O(X^2) to be a little more formal about carrying the higher-order terms through the computation.

Example 3: Derivative of the Log of the Determinant

Now let’s do another example of a matrix function, the log of the determinant

f: \mathbb{R}^{n \times n} \to \mathbb{R}

f(M) = \log |M|

This function arises frequently in machine learning and statistics when computing log likelihoods.

Change M by a matrix X which has small entries

f(M + X) = \log |M + X| = \log |M| |I + M^{-1}X| = \log |M| + \log |I + M^{-1}X|

Now let’s look at |I + M^{-1}X|. Each of the entries of M^{-1}X is “small”, though they’re essentially the same size as elements of X (just differing by a few constants depending on M^{-1}). The determinant of a matrix A is a sum over all permutations

|A| = \displaystyle\sum_{\sigma \in S_n} \prod_{i = 1}^n A_{i, \sigma(i)}

When performing this sum for I + M^{-1}X, if a product contains two off-diagonal terms, those will be two “small” terms from M^{-1}X multiplied together, and multiplying two small terms makes a negligible term that we drop. So we only need to sum up \sigma that include at least all but 1 diagonal elements. The only permutation that does this is the identity permutation.

|I + M^{_1}X| \approx \prod_{i = 1}^n (I + M^{-1}X)_{i,i} =\prod_{i = 1}^n (1 + M^{-1}X_{i,i})

Same trick: this sum has 2^n terms, but any term that multiples an M^{-1}X_{i,i} with a M^{-1}X_{j,j} is negligibly small. To get a term with at most one of the M^{-1}X_{j,j}, either we (1) take 1 from every factor, or (2) take 1 from all but one factor

\prod_{i = 1}^n (1 + M^{-1}X_{i,i}) \approx 1^n + 1^{n-1}M^{-1}X_{1,1} +1^{n-1}M^{-1}X_{2,2} + \cdots + 1^{n-1}M^{-1}X_{n,n}

= 1 + \text{tr}(M^{-1}X)

This is our linear approximation |I + M^{-1}X| \approx 1 +\text{tr}(M^{-1}X). To incorporate the log, note that we’re taking the log of something that is pretty much 1. The normal derivative of log at 1 is 1, and in our Fréchet derivative formulation

\log (1 + \epsilon) = \epsilon +  O(\epsilon^2)

\log |I + M^{-1}X| \approx  \log(1 + \text{tr}(M^{-1}X)) \approx\text{tr}(M^{-1}X)

And so the derivative is Df_{M}(X) = \text{tr}(M^{-1}X).

As a remark: incorporating log illustrates the composability of Fréchet derivatives. This makes sense when spoken aloud: if we have a linear approximation Df to f and a linear approximation Dg to g, a linear approximation of f \circ g is Df \circ Dg.

Other Viewpoints

Actually, Df_{x_0}(\epsilon) = J\epsilon where J is the Jacobian matrix of partial derivatives {\partial f_i \over \partial x_j}. But it’s often easier to not compute the Jacobian entirely, and keep Df_{x_0} in an “implicit” form like in Example 2.

See another description of Frechet derivatives in statistics.