Syntactic Translations of Berry-Esséen and the CLT

Probably the most studied problem in probability theory is the following: suppose X_1, X_2, X_3, \dots, X_n are independent, identically distributed random variables from some distribution \mathcal{D} on \mathbb{R}, and we look at the random variable

S_n = X_1 + X_2 + \cdots + X_n

How is S_n distributed?

Typically the answer is the central limit theorem (CLT) amplified with the Berry-Esséen theorem, “a sum of independent variables is approximately normally distributed”. This post explains two symbolic translations of that sentence which I find easier to write down at a moment’s notice. The first takes the viewpoint of “asymptotic approximation of S_n“,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(1)

Here \mu is the mean of \mathcal{D} and \sigma is the standard deviation of \mathcal{D}. The second is

S_n = Y_1 + Y_2 + \cdots + Y_n + O(1)

where the Y_i are i.i.d Gaussian random variables with the same mean and variance as the X_i, Y_i \sim \mathcal{N}(\mu, \sigma^2). This can be thought of as an “invariance principle” or “replacement lemma” (and this is the viewpoint taken by Lindeburg in his proof of the CLT). The invariance principle is now a tool used in Boolean fourier analysis.

Crucial remark: Unfortunately, the equality of these “syntactic forms” is NOT any convergence in distribution. We’ve “blown up” errors by a factor of \sqrt{n} from the convergence guaranteed by the CLT. Dividing both sides by \sqrt{n} normalizes both sides to have finite variance and gives convergence in probability (or the stronger convergence guaranteed by an appropriate CLT).

Asymptotics for S_n

The law of large numbers says that, if \mu is finite, “the sample mean converges to the true mean”, i.e.

S_n / n \to \mu

In our notation, this is a “linear approximation to S_n“,

S_n = n\mu + o(n)

The central limit theorem refines this by saying that, in fact, the sample mean is about \Theta(\frac{1}{\sqrt{n}}) away from the true mean. If \mu and \sigma are finite,

\frac{(S_n - n \mu)}{\sqrt{n}} \to \mathcal{N}(0,\sigma^2)

In our notation, this is a “\sqrt{n} approximation to S_n“,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + o(\sqrt{n})

The Berry-Esséen theorem strengthens the convergence to the normal distribution. If \rho = E(|X_1|^3) denotes (an upper bound on) the third moment of the distribution \mathcal{D}, which we assume to be finite,

|S_n - n\mu - \sqrt{n}\mathcal{N}(0,\sigma^2)| \leq O(\frac{\rho}{\sigma^2})

In our notation, this is an improvement up to a constant to the asymptotic distribution of S_n,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(\frac{\rho}{\sigma^2})

It should be noted that the Berry-Esséen theorem is tight up to the big-Oh. The binomial distribution achieves this rate of convergence; see the end of these lecture notes for a proof.

Invariance Principles

The previous analysis ends with the distribution n \mu + \sqrt{n} \mathcal{N}(0, \sigma^2). We can incorporate everything into one Gaussian to produce the equivalent distribution \mathcal{N}(n\mu, n\sigma^2). Interestingly, because the sum of independent Gaussians is again Gaussian, this distribution has the same PDF as

Y_1 + Y_2 + \cdots + Y_n

for independent Gaussian random variables Y_i \sim \mathcal{N}(\mu, \sigma^2). This leads us to the equality

S_n =X_1 + X_2 + \cdots + X_n =  Y_1 + Y_2 + \cdots + Y_n + O(1)

in which we’ve simply replaced each X_i by a Gaussian random variable with the same mean and variance.

As noted in Remark 29 here, one can improve the constant term by changing the replacement variables. In particular, it can be made o(1) if the first, second, and third moments of the Y_i agree with the X_i.

The idea of “replacement invariance” surfaces in theoretical CS in the context of Boolean fourier analysis. Here we generalize the summation of Boolean (\{\pm1\}-valued) variables X_1 + X_2 + \cdots + X_n to an arbitrary Boolean function

f(X_1, X_2, \dots, X_n)

The invariance principle states that the random variable f: \{\pm 1\} \to \mathbb{R} for X_i uniformly drawn from \{\pm 1\} is close in probability to the random variable

f(Y_1, Y_2, \dots, Y_n)

for Y_i \sim \mathcal{N}(0, 1) (assuming we normalize both sides to have variance 1). In this case the “closeness” is determined by the maximum influence of a variable X_i on the value of f, as well as the complexity of f (its degree as a multilinear polynomial); see the previously linked lecture notes for an exact quantitative statement.

Wreath Products: Sum and Product Actions

During Laci Babai’s course on graph isomorphism, one of the tools we encountered is the wreath product. Here I’ll give some intuitive descriptions of the wreath product L \wr M.

To make sure we’re all on the same page, the wreath product L \wr M for an arbitrary group L and a permutation group M \leq S_k is the semidirect product (L^k) \rtimes M, where each M is an automorphism of L^k by permuting the coordinates.

That is, we fix some number k and some group of permutations on [k], M \leq S_k. The group we construct takes k disjoint and independent versions of L, L^k, but also allows you to make “limited rearrangements according to M“. There are two distinct ideas going on here; it’s the semidirect product that, if you tell it how two groups should interact, joins those two ideas into a single group.

L\wr S_k is a Graph Automorphism Group

There are two “trivial” wreath products, L \wr 1 (where 1 denotes the trivial group) and L \wr S_k. L \wr 1 is k disjoint and independent versions of L, with no interdependencies introduced by M; this is just L^k.

Let G be a graph with automorphism group L. Claim: if you stick k disjoint copies of G next to each other, this new graph has automorphism group L \wr S_k. To see this, of course some automorphisms apply L elements “in parallel” to the different copies of G. These automorphisms correspond to L^k. But any rearrangement of the copies of G is also an automorphism. All copies are isomorphic, so rearranging can be done without restriction by S_k.

Product Actions of Wreath Products

The previous paragraph shows that, if L acts on a set \Omega, the wreath product L \wr K acts naturally on k\cdot \Omega. This is called the sum action (or “imprimitive action”, because it’s nearly always an imprimitive group action)  of the wreath product. There is another natural action, called the product action (or “primitive action”). Here the domain is the set \Omega^k, and the action is defined by (1) applying elements of L^k coordinate-wise and (2) permuting the coordinates with M.

I think of this like a “simultaneous tracking” version of the sum action. In the sum action, we kept track of one point/vertex across k copies of our graphs, and looked at where that point went when hit with a group element. In the product action, we track one point from each of the k graphs simultaneously, and look at how as a whole those elements move.

H \wr K contains all extensions of H by K

It would be remiss to mention wreath products without mentioning the Kaluzhnin-Krasner theorem (English version). We say that the group G is an extension of H by K if there is an exact sequence

1 \rightarrow K \rightarrow G \rightarrow H \rightarrow 1

Theorem (Kaluzhnin-Krasner, 1951) If G is an extension of H by K, then G is isomorphic to a subgroup of H \wr K.

That is, H \wr K is a group that’s big enough to contain all extensions of the two groups. In a sense that can probably be made more formal through order theory, H \wr K is an “upper bound” on joining H and K together. Not bad!