Approximating Binomial Coefficients with Entropy

This post recounts the story of one of my favorite “probabilistic method” proofs,

\log \binom{n}{k} \leq n H(k/n)

where H(p) is the binary entropy function i.e. the entropy of a coin flipped with bias p. Maybe more properly the proof is just counting in two ways?

By a tiny tweak to our entropy argument as done here, one can get a better bound

\log\displaystyle\sum_{i = 0}^{k} \binom{n}{k} \leq nH(k / n)

This is what’s typically used in practice to bound e.g. the sum of the first 1/3 of the binomial coefficients by 2^{n H(1/3)}. The proof here gives all the essential ideas for that proof. With Stirling’s approximation used to give error bounds, both approximations are tight when k = \Theta(n),

\binom{n}{k} = (1+o(1))2^{n H(k / n)}

The experiment: Let the random variable X denote the result of flipping n fair coins i.e. X is binomially distributed. Now suppose we know X has exactly k Heads.

The question: How much entropy is in the conditional distribution of X given k Heads? I.e. what is H(X \mid k\text{ Heads})?

The left side: The conditional distribution of X is uniform over \binom{n}{k} possible results. The entropy of the uniform distribution on N objects is \log N, therefore

H(X \mid k\text{ Heads}) = \log \binom{n}{k}.

The right side: On the other hand, X is specified by its bits X_i, so

H(X \mid k \text{ Heads}) = H(X_1, X_2, \dots, X_n \mid k \text{ Heads})

By subaddivitity of entropy,

H(X_1, X_2, \dots, X_n \mid k \text{ Heads}) \leq H(X_1 \mid k \text{ Heads}) + H(X_2 \mid k \text{ Heads}) + \cdots + H(X_n \mid k \text{ Heads})

This expression is easy to compute: when flipping k Heads of n flips, each X_i looks like a coin with bias k /n, therefore its entropy is H(k/n). So

H(X_1 \mid k \text{ Heads}) + H(X_2 \mid k \text{ Heads}) + \cdots + H(X_n \mid k \text{ Heads})= n H(k/n)

giving the right-hand side.

Further Generalization: Multinomial Coefficients

We can naturally generalize the above argument to provide an approximation for multinomial coefficients. Fix natural numbers k_1, k_2, \dots, k_m with \sum_i k_i = n. Let K be a distribution over [m] taking i with probability k_i / n. Then

\log \binom{n}{k_1, k_2, \dots, k_m} \leq nH(K)

The proof generalizes perfectly: the underlying set is [m]^{n} (place a letter from \{1, 2, \dots, m\} at each of n indices), and we consider the entropy of a uniform choice of a string constrained to have the fixed letter histogram k_1, k_2, \dots, k_m. This is exactly the left-hand side, and subadditivity yields the right-hand side.

Moreover, by Stirling again the approximation is tight when k_i = \Theta(n).

Calculus-Free Asymptotic for Harmonic Numbers

Here we prove that H_n = \Theta(\log_2 n) without touching calculus (the most common proof interprets H_n as a Riemann sum of \int 1/x ). Seeing as this is probably the most fundamental asymptotic for computer scientists, I was surprised to see this Stack Exchange question didn’t mention the calculus-free approach. The proof here pretty much matches my own answer to the Stack Exchange post.

We first consider powers of 2, n = 2^k. For these n, we can break up \sum \frac{1}{j} into the “chunks”

1

\frac{1}{2} + \frac{1}{3}

\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7}

\dots

\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1}

and we have one extra term 1/2^k. There are k + 1 = \log_2 n + 1 chunks, and of course H_n is the sum of these chunks (plus the extra term), hence to show H_n = \Theta(\log_2 n) we show each chunk is \Theta(1). Inside each chunk we bound above by the power of 2:

1 \leq 1

\frac{1}{2} + \frac{1}{3} \leq \frac{1}{2} + \frac{1}{2} = 1

\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7} \leq \frac{1}{4} + \frac{1}{4} + \frac{1}{4} + \frac{1}{4}  =1

\dots

\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1} \leq \frac{1}{2^{k-1}} + \cdots + \frac{1}{2^{k - 1}} = 1

Thus each chunk is at most 1, and taken together with the extra term \frac{1}{2^k}, we have H_n \leq \log_2 n + \frac{1}{2^k} \leq \log_2 n + 1.

On the other hand, if we lower bound the elements by the next power of 2,

1 \geq 1

\frac{1}{2} + \frac{1}{3} \geq \frac{1}{4} + \frac{1}{4} = \frac{1}{2}

\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7} \geq \frac{1}{8} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8}  =\frac{1}{2}

\dots

\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1} \geq \frac{1}{2^k} + \cdots + \frac{1}{2^k} = \frac{1}{2}

Every chunk is at least \frac{1}{2}, hence H_n \geq \frac{1}{2}\log_2 n.

This essentially is the proof. One technicality: we’ve only shown H_n = \Theta (\log_2 n) for n a power of 2. Monotonicity of H_n completes the proof for all n: taking the nearest powers of 2 above and below n, call them n_- and n_+,

n/2 < n_- < n < n_+ < 2n

Applying our bounds on H_{n_-} and H_{n_+},

H_n \leq H_{n_+} \leq \log_2 n_+ + 1 < \log_2 (2n) +1= \log_2 n + 2

H_n \geq H_{n_-} \geq \frac{1}{2}\log_2 n_- > \frac{1}{2}\log_2 (n/2) = \frac{1}{2}\log_2 n - \frac{1}{2}

Thus H_n = \Theta(\log_2 n).

(In general, for any monotonic linear or sublinear function, it suffices to show \Theta(\cdot) on only the powers of 2).