Recursion vs. Computability, and Kleene’s First Recursion Theorem

The Kleene recursion theorems are two basic (and often confused) results in computability theory. The first theorem guarantees that recursive definitions make sense, while the second one shows (among other things) the existence of quines. This post will explain the first recursion theorem.

Recursion vs. Computability

There are many equivalent definitions of the class of “computable functions” (cf. the Church-Turing thesis). The definition used in the Kleene recursion theorems, and in recursion theory in general, uses the mu minimization operator. In contrast, a “programmer-friendly” definition of computable functions says that computable functions are exactly the functions computed in one’s favorite programming language. Though they define the same class, “recursive” functions and “computable” functions have different definitions, and coupled with the fact that “recursion” is a common tool in computer programming, this has led to some confusion (see this excellent Stack Exchange answer for some discussion). The Kleene recursion theorem is a statement from the “recursion theory” perspective; on the way to it we’ll state it in the computability perspective, and we’ll start by talking about it from the mathematical perspective. I think the discussion of this theorem emphasizes the clarity that can be gained from keeping in mind real-world programming languages even when thinking about “mathematically friendly” objects like the μ-recursive functions.

Say we want a function f that satisfies some recursive equalities,

f(0) = 0\\ f(1) = 1\\ f(n) = f(n-1) + f(n-2)

It’s not a priori clear that such an f exists. There is a slightly more primitive mathematical object that always exists, though, which is the recursive operator \Phi : g \mapsto f obtained by replacing recursive calls with calls to g,

f(0) = 0\\ f(1) = 1\\ f(n) = g(n - 1) + g(n - 2)

Given g, we can always define a new function f in this way. Now, notice that the f satisfying our recurrence are exactly fixed points of our recursive operator \Phi. Kleene’s first recursion theorem says that recursive operators always have fixed points. In this way, the theorem says that our recurrence equations do, in fact, always have solutions.

Theorem (Kleene’s First Recursion Theorem, mathematical version)
Every (mathematical) recursive operator has a fixed point.

In the mathematical case, we require that every recursive operator only evaluate g on smaller natural numbers. What this condition says is that our recurrences for a desired fixed point f: \mathbb{N}\to \mathbb{N} only use previous values of f. In this case we can define f by induction like a normal recursive definition. \hfill\square

Kleene’s Recursion Theorem in Any Programming Language

Let’s now talk about recursive operators from the computability/programming language perspective. Say we’re writing a recursive function in our favorite language,

f(0) = 0
f(1) = 1
f(n) = f(n - 1) + f(n - 2)

As before we have a natural notion of “recursive operator” \Phi : g \mapsto f, which is just a higher-order function,

f(0) = 0
f(1) = 1
f(n) = g(n - 1) + g(n - 2)

Theorem (Kleene’s First Recursion Theorem, programming language version)
Every (programming language) recursive operator has a fixed point.

This proof is rather different from the mathematical case, and it will be different from the full Kleene recursion theorem, but it is philosophically interesting. Compile the block of code

f(0) = 0
f(1) = 1
f(n) = f(n - 1) + f(n - 2)

The compiler automatically makes this a valid definition of the function f. Note that we didn’t need the restriction we had in the mathematical case, that recursive calls are only made on smaller natural numbers. Consequently the output of compilation might be a partial function (undefined, \bot, on some inputs), but it’s a well-defined mathematical object. \square

Kleene’s First Recursion Theorem

In recursion theory we have a different definition of recursive functions via the mu minimization operator. Let R be the set of μ-recursive functions. This calls for a new definition of a recursive operator. This seems like a hard task: we don’t have any “code” to speak of, a recursive operator should be a special case of a function \Phi :R \to R. We can extract one key mathematical property of programming language recursive operators to be our guide: we require that a recursive operator \Phi :R \to R allows \Phi(g)(n) to be computed from finitely many values of g (notice that every programming language recursive operator i.e. higher order function has this property). More discussion of recursive operators is at the end of this post; the intuition here suffices to understand the proof of Kleene’s first recursion theorem,

Theorem (Kleene’s First Recursion Theorem)
Every (recursion theory) recursive operator \Phi : R \to R has a fixed point. Furthermore, there is a least fixed point i.e. a function f \in R such that \Phi(f) = f, and for any function g \in R satisfying \Phi(g) = g, f \sqsubset g.

The notation f \sqsubset g (pronounced “f is less defined than g“) says that g(n) = \bot implies f(n) = \bot. Notice that the function defined nowhere (denoted \Lambda) satisfies \Lambda \sqsubset f for every f.

The proof can be thought of as “making recursive calls until termination”. Say we want to know what value we should assign to f(n). If we want \Phi(f) = f, this means we want to know the value of \Phi(f)(n). \Phi(f) tells us a finite list of values of f that, if we define those, will define \Phi(f)(n); this is our recurrence for computing f(n). By repeatedly asking for and evaluating those values, either all recursive calls for f(n) will terminate at base cases, or they won’t; if it ever terminates, we have f(n)‘s value, otherwise f(n) = \bot is undefined. To use our running example of the Fibonacci recurrence, \Phi(f) tells us that in order to define f(n), it suffices to define f(n-1) and f(n-2), which in turn require f(n-3) and f(n-4), and so on until we meet the base cases 0 and 1, which don’t require any other values of f.

A bit more formally, start with the function that loops on every input, \Lambda. Define functions f_i that terminate after i recursive calls by f_0 = \Lambda and f_i = \Phi (f_{i-1}). Finally, let f = \bigcup_{i: \mathbb{N}} f_i i.e. if an input terminates in any finite number of recursive calls, it is defined.

It’s not too hard to see that this f is least: any g satisfying the recursion must have carried out any finite number of recursive calls. \square

Ultimately, the proof is nothing more than “compiling our recursive operator,” in the same way the compiler in the programming language perspective “automatically” defines a function that terminates if all recursive calls terminate, or is undefined if there’s an infinite loop.

Finally, we should mention the idea behind the definition of recursive operators. As seen in the proof, and as holds true when thinking of programming language recursion, the key idea is that computing some value \Phi(f)(n) should only require finitely many values of f. This can be formalized by saying that \Phi is a continuous function on R.

We define the positive information topology on R. Let \mathcal{F} be those functions that are defined on only finitely many inputs, and are \bot elsewhere. We think of functions in \mathcal{F} as specifying finitely many values of functions in R. A basis of open sets of the positive information topology is U_\theta, for \theta \in \mathcal{F}, defined by

U_\theta = \{ g \in R \mid \theta \sqsubset g\}

That is, U_\theta is exactly those functions that meet the specified (finitely many) output values of \theta. A continuous function \Phi has that the preimage of every basis set U_\xi contains a basis set U_\nu. Taking \xi to be a singleton function, and so U_{\xi} is exactly those functions with a specified single output, continuity says that defining the finitely many values in \nu ensures that the single output is fixed.

A recursive operator is necessarily a continuous function on this space. A “coherence condition” is also necessary to pin down the proper definition, but continuity suffices for intuition. For full definitions, see Cutman’s book on computability theory, Chapter 10.




Eigenfunctions of the Boolean Fourier Transform

The Fourier transform, for \mathbb{R}, finite abelian groups and other groups that are self-dual, is a linear operator. One can ask the next natural question: what are the eigenfunctions of this operator, i.e. the functions f such that \mathcal{F} f  = C \cdot f for some C \in \mathbb{C}?

For the standard Fourier transform on \mathbb{R}, one produces some interesting functions, most interestingly any Gaussian, and in general the Hermite functions. Mysteriously, *why* the Gaussian is an eigenfunction remains “a minor miracle”.

For the Boolean Fourier transform,  which takes a function f: \mathbb{Z}_2^n \to \mathbb{C} to its transform \widehat f : \mathbb{Z}_2^n \to \mathbb{C}, I don’t see how exciting the eigenfunctions are. But here they are:

  • There are 2^{n-1} eigenfunctions to \frac{1}{\sqrt{2}^n}. Indexed by the even parity bit strings \sigma \in \{0,1\}^n, they are

f(x) = (-1)^{\langle \sigma, x\rangle} \alpha^{2|x \oplus \sigma| - n}

where \alpha = \sqrt{\sqrt{2} - 1} and |x \oplus \sigma| represents the Hamming weight of the bitwise XOR of x and \sigma.

  • There are 2^{n-1} eigenfunctions to -\frac{1}{\sqrt{2}^n}. Indexed by the odd parity bit strings \tau \in \{0,1\}^n, they use the same formula.

(This is using the textbook CS definition of the Boolean Fourier transform; if it’s instead normalized to be unitary the eigenvalues will be \pm 1.)

Let’s prove this. The Boolean Fourier transform is implemented by the Hadamard matrix, itself defined by the recurrence H_n = H \otimes H_{n-1}, where

H = \frac{1}{2}\begin{pmatrix}  1 & 1\\  1 & -1  \end{pmatrix}

The eigenvectors of this 2-\times-2 matrix are

\begin{bmatrix}\frac{1}{\sqrt{\sqrt{2} - 1}} \\\\ \sqrt{\sqrt{2} - 1} \end{bmatrix} = \begin{bmatrix}\frac{1}{\alpha} \\\\ \alpha \end{bmatrix} \qquad \text{ to }\frac{1}{\sqrt{2}}

\begin{bmatrix} \sqrt{\sqrt{2} - 1}\\\\ \frac{-1}{\sqrt{\sqrt{2} - 1}} \end{bmatrix} =\begin{bmatrix}\alpha \\\\ -\frac{1}{\alpha} \end{bmatrix}\qquad \text{ to} -\frac{1}{\sqrt{2}}

using \alpha = \sqrt{\sqrt{2} - 1} as before.

Inductively, the eigenvectors of the tensor product recurrence H_n = H \otimes H_{n-1} are tensored to produce the eigenvectors we listed above.

Eigenfunctions of the Discrete Fourier Transform

It seems that computing the eigenfunctions of the Fourier transform over \mathbb{Z}_N gets quite a bit messier. This is the question of computing an orthogonal eigenbasis for the DFT matrix

K = \frac{1}{\sqrt{N}}\begin{bmatrix}  \\  & \omega^{-ij} & \\  & &  \end{bmatrix}

A canonical set of eigenvectors can be generated computationally as the solutions to a simpler matrix equation using this seminal work, but it seems that in general there isn’t a simple description of the eigenvectors themselves.

What seems to make these mathematically weird is that we’re working with maps between a space and its dual. These spaces are not typically isomorphic, and in these cases we can’t even talk about eigenvectors. In the cases where they are (canonically) isomorphic, the eigenfunctions feel like a spontaneous beast.

Perhaps this will be the subject of a blog post in the future.

If you know of any further interpretations of the eigenfunctions (in either the real or finite abelian case), leave a comment!

Proving the Coupling Inequality with LP Duality

This post details the first in a collection of everyday inequalities that arise as special cases of duality in linear programming. We’ll show that the usual coupling inequality is such an example,

d_{TV}(\mu, \nu) \leq \Pr[X \neq Y]

coupling of two distributions \mu and \nu on \Omega is a joint distribution on \Omega \times \Omega such that the two marginal distributions are \mu and \nu respectively. We write (X, Y) for the joint random variable, with X the marginal in the first coordinate, and Y the marginal in the second coordinate.  We say that \mu and \nu are “coupled”, with the coupling (X, Y) defining the interaction, or dependence, between them.

One of the basic facts about couplings compares the statistical distance between \mu and \nu to the probability the coupled variables agree. For any coupling (X, Y) of \mu and \nu, we have

d_{TV}(\mu, \nu) \leq \Pr[X \neq Y]

Furthermore, there is a coupling that achieves this bound. The above inequality is typically used to upper bound the statistical distance between \mu and \nu by constructing a coupling that’s likely to agree e.g. in the proof of convergence of an ergodic Markov chain to its limiting distribution.

In this post we’ll prove this inequality (and its tightness!) using an LP duality argument.

The Coupling LP

A coupling is specified by a set of linear constraints: for x,y \in \Omega, let p_{x,y} be the probability of seeing the pair (x,y). The constraints are

\forall x. \quad \displaystyle\sum_{y \in \Omega} p_{x,y} = \mu(x)

\forall y. \quad \displaystyle\sum_{x \in \Omega} p_{x,y} = \nu(y)

\forall x,y. \quad p_{x,y} \geq 0

\displaystyle\sum_{x,y \in \Omega} p_{x,y} = 1

This defines a polytope in \mathbb{R}^{|\Omega|^2}, called a transportation polytope. In fact, the last constraint is redundant, as summing the constraints with \mu(x) over all x‘s will also give 1. Over this polytope, the linear constraint to minimize is the probability X and Y disagree,

\min\displaystyle\sum_{\substack{x,y \in \Omega \\ x \neq y}} p_{x, y}

Ok, now that we have the LP, its dual (whose computation is explained e.g. here… or perhaps in a future blog post) is

\max \displaystyle\sum_{\omega \in \Omega} q^\mu_\omega \mu(\omega) + q^\nu_\omega \nu(\omega)

\forall \omega. \quad q^\mu_\omega + q^\nu_\omega \leq 0

\forall \omega_1 \neq \omega_2. \quad q^\mu_\omega + q^\nu_\omega \leq 1

By the duality theorems, \Pr[X \neq Y] is always at least as large as the solution to the dual, and furthermore, for some X, Y the two solutions agree. To finish our argument, we’ll show that the value of the dual LP is the statistical distance between \mu and \nu. The definition of the statistical distance we’ll use is the “optimization” version: the maximum difference in probabilities of events in \Omega,

d_{TV}(\mu, \nu) = \max_{A \subseteq \Omega} |\mu (A) - \nu (A)|

= \max_{A \subseteq \Omega} \left|\displaystyle\sum_{a \in A} \mu(a) - \nu(a)\right|

First we argue that the optimum is \{-1, 0, +1\}-valued. The dual problem’s matrix is totally unimodular (Prop. 3.2) and the right-hand sides of the constraints are integral, hence the LP is integral. Fix an optimal solution q^{\mu*}_\omega. From the constraints we see only one of the sets \{q^{\mu*}_\omega : \omega \in \Omega\} and \{q^{\nu*}_\omega : \omega \in \Omega\} can contain positive values. WLOG these are the variables for \mu. By optimality and looking at the largest (resp. smallest) of the q^{\mu*}_\omega (resp. q^{\nu*}_\omega), the \{q^{\mu*}_\omega : \omega \in \Omega \} are within 1 of one another, and the q^{\nu*}_\omega are their negations. Now we can shift the solution to be \{-1, 0, +1\}-valued without changing the value.

Now we are in a position to compare the \{-1, 0, +1\}-optimum with d_{TV}(\mu, \nu). As noted above, the set of variables for exactly one of \mu, \nu is positive, which gives the sign in the absolute value. The +1-valued variables indicate the event to take. This shows the value of the dual is exactly d_{TV}(\mu, \nu). As explained, this finishes the proof that

d_{TV}(\mu, \nu) \leq \Pr[X \neq Y]


Transportation Polytopes

We finish with a few extra details on transportation polytopes.

Fixing in mind \mu and \nu, the collection of possible couplings Z forms a convex polytope in \mathbb{R}^{|\Omega|^2} from a very special class of polytopes, the transportation polytopes. The name comes from the following “transportation” problem in optimization: suppose you have m factories producing a good, such as lawn mowers, which produce quantities s_1, s_2, \dots, s_m. You want to distribute the lawn mowers among n stores in quantities t_1, t_2, \dots, t_n. The question is, how many lawn mowers should factory i send to store j? A solution that achieves the desired stocking quantity is exactly a matrix (a.k.a joint probability distribution) with the given marginals.

Vertices of a transportation polytope can be enumerated and checked algorithmically. Furthermore, when the marginals \mu and \nu are uniform, the corresponding transportation polytope is exactly the set of doubly stochastic matrices. By the classical Birkhoff-Von Neumann theorem, the vertices of this polytope are exactly the permutation matrices i.e. doubly stochastic matrices are exactly convex combinations of permutation matrices (along with the zero matrix).

Syntactic Translations of Berry-Esséen and the CLT

Probably the most studied problem in probability theory is the following: suppose X_1, X_2, X_3, \dots, X_n are independent, identically distributed random variables from some distribution \mathcal{D} on \mathbb{R}, and we look at the random variable

S_n = X_1 + X_2 + \cdots + X_n

How is S_n distributed?

Typically the answer is the central limit theorem (CLT) amplified with the Berry-Esséen theorem, “a sum of independent variables is approximately normally distributed”. This post explains two symbolic translations of that sentence which I find easier to write down at a moment’s notice. The first takes the viewpoint of “asymptotic approximation of S_n“,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(1)

Here \mu is the mean of \mathcal{D} and \sigma is the standard deviation of \mathcal{D}. The second is

S_n = Y_1 + Y_2 + \cdots + Y_n + O(1)

where the Y_i are i.i.d Gaussian random variables with the same mean and variance as the X_i, Y_i \sim \mathcal{N}(\mu, \sigma^2). This can be thought of as an “invariance principle” or “replacement lemma” (and this is the viewpoint taken by Lindeburg in his proof of the CLT). The invariance principle is now a tool used in Boolean fourier analysis.

Crucial remark: Unfortunately, the equality of these “syntactic forms” is NOT any convergence in distribution. We’ve “blown up” errors by a factor of \sqrt{n} from the convergence guaranteed by the CLT. Dividing both sides by \sqrt{n} normalizes both sides to have finite variance and gives convergence in probability (or the stronger convergence guaranteed by an appropriate CLT).

Asymptotics for S_n

The law of large numbers says that, if \mu is finite, “the sample mean converges to the true mean”, i.e.

S_n / n \to \mu

In our notation, this is a “linear approximation to S_n“,

S_n = n\mu + o(n)

The central limit theorem refines this by saying that, in fact, the sample mean is about \Theta(\frac{1}{\sqrt{n}}) away from the true mean. If \mu and \sigma are finite,

\frac{(S_n - n \mu)}{\sqrt{n}} \to \mathcal{N}(0,\sigma^2)

In our notation, this is a “\sqrt{n} approximation to S_n“,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + o(\sqrt{n})

The Berry-Esséen theorem strengthens the convergence to the normal distribution. If \rho = E(|X_1|^3) denotes (an upper bound on) the third moment of the distribution \mathcal{D}, which we assume to be finite,

|S_n - n\mu - \sqrt{n}\mathcal{N}(0,\sigma^2)| \leq O(\frac{\rho}{\sigma^2})

In our notation, this is an improvement up to a constant to the asymptotic distribution of S_n,

S_n = n\mu + \sqrt{n} \mathcal{N}(0, \sigma^2) + O(\frac{\rho}{\sigma^2})

It should be noted that the Berry-Esséen theorem is tight up to the big-Oh. The binomial distribution achieves this rate of convergence; see the end of these lecture notes for a proof.

Invariance Principles

The previous analysis ends with the distribution n \mu + \sqrt{n} \mathcal{N}(0, \sigma^2). We can incorporate everything into one Gaussian to produce the equivalent distribution \mathcal{N}(n\mu, n\sigma^2). Interestingly, because the sum of independent Gaussians is again Gaussian, this distribution has the same PDF as

Y_1 + Y_2 + \cdots + Y_n

for independent Gaussian random variables Y_i \sim \mathcal{N}(\mu, \sigma^2). This leads us to the equality

S_n =X_1 + X_2 + \cdots + X_n =  Y_1 + Y_2 + \cdots + Y_n + O(1)

in which we’ve simply replaced each X_i by a Gaussian random variable with the same mean and variance.

As noted in Remark 29 here, one can improve the constant term by changing the replacement variables. In particular, it can be made o(1) if the first, second, and third moments of the Y_i agree with the X_i.

The idea of “replacement invariance” surfaces in theoretical CS in the context of Boolean fourier analysis. Here we generalize the summation of Boolean (\{\pm1\}-valued) variables X_1 + X_2 + \cdots + X_n to an arbitrary Boolean function

f(X_1, X_2, \dots, X_n)

The invariance principle states that the random variable f: \{\pm 1\} \to \mathbb{R} for X_i uniformly drawn from \{\pm 1\} is close in probability to the random variable

f(Y_1, Y_2, \dots, Y_n)

for Y_i \sim \mathcal{N}(0, 1) (assuming we normalize both sides to have variance 1). In this case the “closeness” is determined by the maximum influence of a variable X_i on the value of f, as well as the complexity of f (its degree as a multilinear polynomial); see the previously linked lecture notes for an exact quantitative statement.

The Ackermann Function Defined By Hyperoperation

One of my associates recently remarked, “I can write down the Ackermann recursion, but I have zero intuition where it comes from”. This expository post explains the answer I gave him, that the Ackermann function arises via hyperoperation.

The Ackermann function A(m, n) is variously defined, but the most popular these days is the Ackermann-Péter function:

A(0, n) = n+1

A(m, 0) = A(m-1, 1)

A(m, n) = A(m - 1, A(m, n-1))

One way to build up to the Ackermann function is through hyperoperation. We’ll show that A(m, n) is pretty much 2 \wedge_m n for the m-th hyperoperator \wedge_m: in fact A(m, n) = 2 \wedge_m (n+3) -3.

Fixing the first argument of A(m, n) yields a function based on the m-th hyperoperator, which is why we often think of the Ackermann function as a family of functions parameterized by the first argument. In retrospect, maybe the better definition for the Ackermann function is the simpler A(m,n) = 2 \wedge_m n, so that the ubiquitous exercise to compute the first few functions x \mapsto A(m, x) becomes just a little bit less of an exercise.


Interpret multiplication x * y as repeated addition: add x to itself, y times. Interpret exponentiation x^y as repeated multiplication: multiply x with itself, y times. Going one step further, interpret tetration x \uparrow y as repeated exponentiation: exponentiate x with itself, y times. For example,

x \uparrow 6 = x^{x^{x^{x^{x^{x}}}}}

Going infinitely many steps further, we can define the k-th hyperoperation x \wedge_k y to “recursively apply the previous hyperoperation on x with itself, y times” (this of course stops being commutative at and beyond exponentiation, though the first hyperoperators constructed were actually commutative). To actually implement this we should use the recursion “compute x hyperoperated with itself y-1 times, then apply x one more time”:

x \wedge_k y \overset{\text{def}}{=} x \wedge_{k-1}(x \wedge_k (y-1))

Notice we fold right here i.e. compute x \wedge_{k-1} (x \wedge_{k-1} (x \wedge_{k-1}\cdots (x \wedge_{k-1} x)\cdots)) rather than ((\cdots(x \wedge_{k-1} x) \cdots\wedge_{k-1} x) \wedge_{k-1} x) \wedge_{k-1} x.

We also need a trio of base cases for what happens when you “apply x to itself 0 times” in order to align with the normal arithmetic operations:

x \wedge_1 0 \overset{\text{def}}{=} x

x \wedge_ 2 0 \overset{\text{def}}{=} 0

x \wedge_k 0 \overset{\text{def}}{=} 1 for k \geq 3

Finally, we set x \wedge_0 y\overset{\text{def}}{=} y+1, to capture that addition, \wedge_1, arises as the iterated successor function, our \wedge_0. After making all these definitions, we have our hyperoperators. These extend the normal arithmetic operators in the following way:

x \wedge_1 y = x + y

x \wedge_2 y = x * y

x \wedge_3 y = x^y

x \wedge_4 y = x \uparrow y

Proof of Ackermann/Hyperoperator Identity

The point of defining these hyperoperations is to note that the Ackermann function is itself a specific case: A(m, n) = 2 \wedge_m (n+3) -3. The proof of this isn’t too exciting, but let’s prove it by induction. Check the base case m = 0,

A(0,n) = n+1

2 \wedge_0 (n+3) - 3 = n+1

For the base case n = 0, we need key properties of using 2 instead of a different constant:

2 \wedge_k 1 = 2 for k \geq 3

2 \wedge_k 2 = 4 for k \geq 3

Now we have

A(m, 0) = A(m-1, 1)

2 \wedge_m (0+3) - 3 = 2 \wedge_{m-1} (2 \wedge_m 2) - 3 = 2 \wedge_{m-1} 4 -3

And the primary recurrence for 2 \wedge_m n and A(m, n):

A(m, n) = A(m-1, A(m, n-1))

2 \wedge_m (n+3) - 3 =2 \wedge_{m-1}(2 \wedge_{m} (n+2)) - 3 =2 \wedge_{m-1}((2 \wedge_{m} (n+2) - 3) + 3) - 3

Substituting 2 \wedge_m (n+2) - 3 for A(m, n-1) by induction, we see they are in fact the same recursion. \square


Approximating Binomial Coefficients with Entropy

This post recounts the story of one of my favorite “probabilistic method” proofs,

\log \binom{n}{k} \leq n H(k/n)

where H(p) is the binary entropy function i.e. the entropy of a coin flipped with bias p. Maybe more properly the proof is just counting in two ways?

By a tiny tweak to our entropy argument as done here, one can get a better bound

\log\displaystyle\sum_{i = 0}^{k} \binom{n}{k} \leq nH(k / n)

This is what’s typically used in practice to bound e.g. the sum of the first 1/3 of the binomial coefficients by 2^{n H(1/3)}. The proof here gives all the essential ideas for that proof. With Stirling’s approximation used to give error bounds, both approximations are tight when k = \Theta(n),

\binom{n}{k} = (1+o(1))2^{n H(k / n)}

The experiment: Let the random variable X denote the result of flipping n fair coins i.e. X is binomially distributed. Now suppose we know X has exactly k Heads.

The question: How much entropy is in the conditional distribution of X given k Heads? I.e. what is H(X \mid k\text{ Heads})?

The left side: The conditional distribution of X is uniform over \binom{n}{k} possible results. The entropy of the uniform distribution on N objects is \log N, therefore

H(X \mid k\text{ Heads}) = \log \binom{n}{k}.

The right side: On the other hand, X is specified by its bits X_i, so

H(X \mid k \text{ Heads}) = H(X_1, X_2, \dots, X_n \mid k \text{ Heads})

By subaddivitity of entropy,

H(X_1, X_2, \dots, X_n \mid k \text{ Heads}) \leq H(X_1 \mid k \text{ Heads}) + H(X_2 \mid k \text{ Heads}) + \cdots + H(X_n \mid k \text{ Heads})

This expression is easy to compute: when flipping k Heads of n flips, each X_i looks like a coin with bias k /n, therefore its entropy is H(k/n). So

H(X_1 \mid k \text{ Heads}) + H(X_2 \mid k \text{ Heads}) + \cdots + H(X_n \mid k \text{ Heads})= n H(k/n)

giving the right-hand side.

Further Generalization: Multinomial Coefficients

We can naturally generalize the above argument to provide an approximation for multinomial coefficients. Fix natural numbers k_1, k_2, \dots, k_m with \sum_i k_i = n. Let K be a distribution over [m] taking i with probability k_i / n. Then

\log \binom{n}{k_1, k_2, \dots, k_m} \leq nH(K)

The proof generalizes perfectly: the underlying set is [m]^{n} (place a letter from \{1, 2, \dots, m\} at each of n indices), and we consider the entropy of a uniform choice of a string constrained to have the fixed letter histogram k_1, k_2, \dots, k_m. This is exactly the left-hand side, and subadditivity yields the right-hand side.

Moreover, by Stirling again the approximation is tight when k_i = \Theta(n).

A Heuristic Argument for the Effectiveness of the Union Bound

The union bound is ubiquitous in theoretical computer science, and perhaps in justification of this Andy Drucker mentioned in a lecture once a heuristic argument for its “tightness”. Just to write it out, the union bound takes

P(A \cup B) = P(A) + P(B) - P(A \cap B) \qquad\qquad P\left(\bigcup_i A_i\right)

and approximates it as

P(A \cup B) \leq P(A) +P(B)\qquad \qquad P\left(\bigcup_i A_i\right) \leq \displaystyle\sum_i P(A_i)

Imagine if \{A_i \mid 1 \leq i \leq n\} are “bad events” that occur with some tiny probability \delta. The union bound says

P(\text{a bad event happens}) \leq n \delta

If the events were fully independent, we would have the equality

P(\text{a bad event happens}) = 1 - (1 - \delta)^n

Now (1 - \delta)^n \approx 1 - n\delta when \delta = o(n). So we have

P(\text{a bad event happens}) \approx 1 - (1 - n\delta) = n\delta

This is equal to the bound spit out by the application of the union bound! In the case of fully independent events with small probabilities, we see the union bound is essentially tight. Full independence is rare, but in applications where low-probability events are slightly correlated, the union bound is also a good approximation.