Syntactic Translations of Berry-Esséen and the CLT

Probably the most studied problem in probability theory is the following: suppose $X_1, X_2, X_3, \dots, X_n$ are independent, identically distributed random variables from some distribution $\mathcal{D}$ on $\mathbb{R}$, and we look at the random variable

$S_n = X_1 + X_2 + \cdots + X_n$

How is $S_n$ distributed?

Typically the answer is the central limit theorem (CLT) amplified with the Berry-Esséen theorem, “a sum of independent variables is approximately normally distributed”. This post explains two symbolic translations of that sentence which I find easier to write down at a moment’s notice. The first takes the viewpoint of “asymptotic approximation of $S_n$“,

$S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(1)$

Here $\mu$ is the mean of $\mathcal{D}$ and $\sigma$ is the standard deviation of $\mathcal{D}$. The second is

$S_n = Y_1 + Y_2 + \cdots + Y_n + O(1)$

where the $Y_i$ are i.i.d Gaussian random variables with the same mean and variance as the $X_i$, $Y_i \sim \mathcal{N}(\mu, \sigma^2)$. This can be thought of as an “invariance principle” or “replacement lemma” (and this is the viewpoint taken by Lindeburg in his proof of the CLT). The invariance principle is now a tool used in Boolean fourier analysis.

Crucial remark: Unfortunately, the equality of these “syntactic forms” is NOT any convergence in distribution. We’ve “blown up” errors by a factor of $\sqrt{n}$ from the convergence guaranteed by the CLT. Dividing both sides by $\sqrt{n}$ normalizes both sides to have finite variance and gives convergence in probability (or the stronger convergence guaranteed by an appropriate CLT).

Asymptotics for $S_n$

The law of large numbers says that, if $\mu$ is finite, “the sample mean converges to the true mean”, i.e.

$S_n / n \to \mu$

In our notation, this is a “linear approximation to $S_n$“,

$S_n = n\mu + o(n)$

The central limit theorem refines this by saying that, in fact, the sample mean is about $\Theta(\frac{1}{\sqrt{n}})$ away from the true mean. If $\mu$ and $\sigma$ are finite,

$\frac{(S_n - n \mu)}{\sqrt{n}} \to \mathcal{N}(0,\sigma^2)$

In our notation, this is a “$\sqrt{n}$ approximation to $S_n$“,

$S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + o(\sqrt{n})$

The Berry-Esséen theorem strengthens the convergence to the normal distribution. If $\rho = E(|X_1|^3)$ denotes (an upper bound on) the third moment of the distribution $\mathcal{D}$, which we assume to be finite,

$|S_n - n\mu - \sqrt{n}\mathcal{N}(0,\sigma^2)| \leq O(\frac{\rho}{\sigma^2})$

In our notation, this is an improvement up to a constant to the asymptotic distribution of $S_n$,

$S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(\frac{\rho}{\sigma^2})$

It should be noted that the Berry-Esséen theorem is tight up to the big-Oh. The binomial distribution achieves this rate of convergence; see the end of these lecture notes for a proof.

Invariance Principles

The previous analysis ends with the distribution $n \mu + \sqrt{n} \mathcal{N}(0, \sigma^2)$. We can incorporate everything into one Gaussian to produce the equivalent distribution $\mathcal{N}(n\mu, n\sigma^2)$. Interestingly, because the sum of independent Gaussians is again Gaussian, this distribution has the same PDF as

$Y_1 + Y_2 + \cdots + Y_n$

for independent Gaussian random variables $Y_i \sim \mathcal{N}(\mu, \sigma^2)$. This leads us to the equality

$S_n =X_1 + X_2 + \cdots + X_n = Y_1 + Y_2 + \cdots + Y_n + O(1)$

in which we’ve simply replaced each $X_i$ by a Gaussian random variable with the same mean and variance.

As noted in Remark 29 here, one can improve the constant term by changing the replacement variables. In particular, it can be made $o(1)$ if the first, second, and third moments of the $Y_i$ agree with the $X_i$.

The idea of “replacement invariance” surfaces in theoretical CS in the context of Boolean fourier analysis. Here we generalize the summation of Boolean ($\{\pm1\}$-valued) variables $X_1 + X_2 + \cdots + X_n$ to an arbitrary Boolean function

$f(X_1, X_2, \dots, X_n)$

The invariance principle states that the random variable $f: \{\pm 1\} \to \mathbb{R}$ for $X_i$ uniformly drawn from $\{\pm 1\}$ is close in probability to the random variable

$f(Y_1, Y_2, \dots, Y_n)$

for $Y_i \sim \mathcal{N}(0, 1)$ (assuming we normalize both sides to have variance $1$). In this case the “closeness” is determined by the maximum influence of a variable $X_i$ on the value of $f$, as well as the complexity of $f$ (its degree as a multilinear polynomial); see the previously linked lecture notes for an exact quantitative statement.

Wreath Products: Sum and Product Actions

During Laci Babai’s course on graph isomorphism, one of the tools we encountered is the wreath product. Here I’ll give some intuitive descriptions of the wreath product $L \wr M$.

To make sure we’re all on the same page, the wreath product $L \wr M$ for an arbitrary group $L$ and a permutation group $M \leq S_k$ is the semidirect product $(L^k) \rtimes M$, where each $M$ is an automorphism of $L^k$ by permuting the coordinates.

That is, we fix some number $k$ and some group of permutations on $[k]$, $M \leq S_k$. The group we construct takes $k$ disjoint and independent versions of $L$, $L^k$, but also allows you to make “limited rearrangements according to $M$“. There are two distinct ideas going on here; it’s the semidirect product that, if you tell it how two groups should interact, joins those two ideas into a single group.

$L\wr S_k$ is a Graph Automorphism Group

There are two “trivial” wreath products, $L \wr 1$ (where $1$ denotes the trivial group) and $L \wr S_k$. $L \wr 1$ is $k$ disjoint and independent versions of $L$, with no interdependencies introduced by $M$; this is just $L^k$.

Let $G$ be a graph with automorphism group $L$. Claim: if you stick $k$ disjoint copies of $G$ next to each other, this new graph has automorphism group $L \wr S_k$. To see this, of course some automorphisms apply $L$ elements “in parallel” to the different copies of $G$. These automorphisms correspond to $L^k$. But any rearrangement of the copies of $G$ is also an automorphism. All copies are isomorphic, so rearranging can be done without restriction by $S_k$.

Product Actions of Wreath Products

The previous paragraph shows that, if $L$ acts on a set $\Omega$, the wreath product $L \wr K$ acts naturally on $k\cdot \Omega$. This is called the sum action (or “imprimitive action”, because it’s nearly always an imprimitive group action)  of the wreath product. There is another natural action, called the product action (or “primitive action”). Here the domain is the set $\Omega^k$, and the action is defined by (1) applying elements of $L^k$ coordinate-wise and (2) permuting the coordinates with $M$.

I think of this like a “simultaneous tracking” version of the sum action. In the sum action, we kept track of one point/vertex across $k$ copies of our graphs, and looked at where that point went when hit with a group element. In the product action, we track one point from each of the $k$ graphs simultaneously, and look at how as a whole those elements move.

$H \wr K$ contains all extensions of $H$ by $K$

It would be remiss to mention wreath products without mentioning the Kaluzhnin-Krasner theorem (English version). We say that the group $G$ is an extension of $H$ by $K$ if there is an exact sequence

$1 \rightarrow K \rightarrow G \rightarrow H \rightarrow 1$

Theorem (Kaluzhnin-Krasner, 1951) If $G$ is an extension of $H$ by $K$, then $G$ is isomorphic to a subgroup of $H \wr K$.

That is, $H \wr K$ is a group that’s big enough to contain all extensions of the two groups. In a sense that can probably be made more formal through order theory, $H \wr K$ is an “upper bound” on joining $H$ and $K$ together. Not bad!