The Ackermann Function Defined By Hyperoperation

One of my associates recently remarked, “I can write down the Ackermann recursion, but I have zero intuition where it comes from”. This expository post explains the answer I gave him, that the Ackermann function arises via hyperoperation.

The Ackermann function A(m, n) is variously defined, but the most popular these days is the Ackermann-Péter function:

A(0, n) = n+1

A(m, 0) = A(m-1, 1)

A(m, n) = A(m - 1, A(m, n-1))

One way to build up to the Ackermann function is through hyperoperation. We’ll show that A(m, n) is pretty much 2 \wedge_m n for the m-th hyperoperator \wedge_m: in fact A(m, n) = 2 \wedge_m (n+3) -3.

Fixing the first argument of A(m, n) yields a function based on the m-th hyperoperator, which is why we often think of the Ackermann function as a family of functions parameterized by the first argument. In retrospect, maybe the better definition for the Ackermann function is the simpler A(m,n) = 2 \wedge_m n, so that the ubiquitous exercise to compute the first few functions x \mapsto A(m, x) becomes just a little bit less of an exercise.

Hyperoperators

Interpret multiplication x * y as repeated addition: add x to itself, y times. Interpret exponentiation x^y as repeated multiplication: multiply x with itself, y times. Going one step further, interpret tetration x \uparrow y as repeated exponentiation: exponentiate x with itself, y times. For example,

x \uparrow 6 = x^{x^{x^{x^{x^{x}}}}}

Going infinitely many steps further, we can define the k-th hyperoperation x \wedge_k y to “recursively apply the previous hyperoperation on x with itself, y times” (this of course stops being commutative at and beyond exponentiation, though the first hyperoperators constructed were actually commutative). To actually implement this we should use the recursion “compute x hyperoperated with itself y-1 times, then apply x one more time”:

x \wedge_k y \overset{\text{def}}{=} x \wedge_{k-1}(x \wedge_k (y-1))

Notice we fold right here i.e. compute x \wedge_{k-1} (x \wedge_{k-1} (x \wedge_{k-1}\cdots (x \wedge_{k-1} x)\cdots)) rather than ((\cdots(x \wedge_{k-1} x) \cdots\wedge_{k-1} x) \wedge_{k-1} x) \wedge_{k-1} x.

We also need a trio of base cases for what happens when you “apply x to itself 0 times” in order to align with the normal arithmetic operations:

x \wedge_1 0 \overset{\text{def}}{=} x

x \wedge_ 2 0 \overset{\text{def}}{=} 0

x \wedge_k 0 \overset{\text{def}}{=} 1 for k \geq 3

Finally, we set x \wedge_0 y\overset{\text{def}}{=} y+1, to capture that addition, \wedge_1, arises as the iterated successor function, our \wedge_0. After making all these definitions, we have our hyperoperators. These extend the normal arithmetic operators in the following way:

x \wedge_1 y = x + y

x \wedge_2 y = x * y

x \wedge_3 y = x^y

x \wedge_4 y = x \uparrow y

Proof of Ackermann/Hyperoperator Identity

The point of defining these hyperoperations is to note that the Ackermann function is itself a specific case: A(m, n) = 2 \wedge_m (n+3) -3. The proof of this isn’t too exciting, but let’s prove it by induction. Check the base case m = 0,

A(0,n) = n+1

2 \wedge_0 (n+3) - 3 = n+1

For the base case n = 0, we need key properties of using 2 instead of a different constant:

2 \wedge_k 1 = 2 for k \geq 3

2 \wedge_k 2 = 4 for k \geq 3

Now we have

A(m, 0) = A(m-1, 1)

2 \wedge_m (0+3) - 3 = 2 \wedge_{m-1} (2 \wedge_m 2) - 3 = 2 \wedge_{m-1} 4 -3

And the primary recurrence for 2 \wedge_m n and A(m, n):

A(m, n) = A(m-1, A(m, n-1))

2 \wedge_m (n+3) - 3 =2 \wedge_{m-1}(2 \wedge_{m} (n+2)) - 3 =2 \wedge_{m-1}((2 \wedge_{m} (n+2) - 3) + 3) - 3

Substituting 2 \wedge_m (n+2) - 3 for A(m, n-1) by induction, we see they are in fact the same recursion. \square

 

Approximating Binomial Coefficients with Entropy

This post recounts the story of one of my favorite “probabilistic method” proofs,

\log \binom{n}{k} \leq n H(k/n)

where H(p) is the binary entropy function i.e. the entropy of a coin flipped with bias p. Maybe more properly the proof is just counting in two ways?

By a tiny tweak to our entropy argument as done here, one can get a better bound

\log\displaystyle\sum_{i = 0}^{k} \binom{n}{k} \leq nH(k / n)

This is what’s typically used in practice to bound e.g. the sum of the first 1/3 of the binomial coefficients by 2^{n H(1/3)}. The proof here gives all the essential ideas for that proof. With Stirling’s approximation used to give error bounds, both approximations are tight when k = \Theta(n),

\binom{n}{k} = (1+o(1))2^{n H(k / n)}

The experiment: Let the random variable X denote the result of flipping n fair coins i.e. X is binomially distributed. Now suppose we know X has exactly k Heads.

The question: How much entropy is in the conditional distribution of X given k Heads? I.e. what is H(X \mid k\text{ Heads})?

The left side: The conditional distribution of X is uniform over \binom{n}{k} possible results. The entropy of the uniform distribution on N objects is \log N, therefore

H(X \mid k\text{ Heads}) = \log \binom{n}{k}.

The right side: On the other hand, X is specified by its bits X_i, so

H(X \mid k \text{ Heads}) = H(X_1, X_2, \dots, X_n \mid k \text{ Heads})

By subaddivitity of entropy,

H(X_1, X_2, \dots, X_n \mid k \text{ Heads}) \leq H(X_1 \mid k \text{ Heads}) + H(X_2 \mid k \text{ Heads}) + \cdots + H(X_n \mid k \text{ Heads})

This expression is easy to compute: when flipping k Heads of n flips, each X_i looks like a coin with bias k /n, therefore its entropy is H(k/n). So

H(X_1 \mid k \text{ Heads}) + H(X_2 \mid k \text{ Heads}) + \cdots + H(X_n \mid k \text{ Heads})= n H(k/n)

giving the right-hand side.

Further Generalization: Multinomial Coefficients

We can naturally generalize the above argument to provide an approximation for multinomial coefficients. Fix natural numbers k_1, k_2, \dots, k_m with \sum_i k_i = n. Let K be a distribution over [m] taking i with probability k_i / n. Then

\log \binom{n}{k_1, k_2, \dots, k_m} \leq nH(K)

The proof generalizes perfectly: the underlying set is [m]^{n} (place a letter from \{1, 2, \dots, m\} at each of n indices), and we consider the entropy of a uniform choice of a string constrained to have the fixed letter histogram k_1, k_2, \dots, k_m. This is exactly the left-hand side, and subadditivity yields the right-hand side.

Moreover, by Stirling again the approximation is tight when k_i = \Theta(n).

A Heuristic Argument for the Effectiveness of the Union Bound

The union bound is ubiquitous in theoretical computer science, and perhaps in justification of this Andy Drucker mentioned in a lecture once a heuristic argument for its “tightness”. Just to write it out, the union bound takes

P(A \cup B) = P(A) + P(B) - P(A \cap B) \qquad\qquad P\left(\bigcup_i A_i\right)

and approximates it as

P(A \cup B) \leq P(A) +P(B)\qquad \qquad P\left(\bigcup_i A_i\right) \leq \displaystyle\sum_i P(A_i)

Imagine if \{A_i \mid 1 \leq i \leq n\} are “bad events” that occur with some tiny probability \delta. The union bound says

P(\text{a bad event happens}) \leq n \delta

If the events were fully independent, we would have the equality

P(\text{a bad event happens}) = 1 - (1 - \delta)^n

Now (1 - \delta)^n \approx 1 - n\delta when \delta = o(n). So we have

P(\text{a bad event happens}) \approx 1 - (1 - n\delta) = n\delta

This is equal to the bound spit out by the application of the union bound! In the case of fully independent events with small probabilities, we see the union bound is essentially tight. Full independence is rare, but in applications where low-probability events are slightly correlated, the union bound is also a good approximation.

Wreath Products: Sum and Product Actions

During Laci Babai’s course on graph isomorphism, one of the tools we encountered is the wreath product. Here I’ll give some intuitive descriptions of the wreath product L \wr M.

To make sure we’re all on the same page, the wreath product L \wr M for an arbitrary group L and a permutation group M \leq S_k is the semidirect product (L^k) \rtimes M, where each M is an automorphism of L^k by permuting the coordinates.

That is, we fix some number k and some group of permutations on [k], M \leq S_k. The group we construct takes k disjoint and independent versions of L, L^k, but also allows you to make “limited rearrangements according to M“. There are two distinct ideas going on here; it’s the semidirect product that, if you tell it how two groups should interact, joins those two ideas into a single group.

L\wr S_k is a Graph Automorphism Group

There are two “trivial” wreath products, L \wr 1 (where 1 denotes the trivial group) and L \wr S_k. L \wr 1 is k disjoint and independent versions of L, with no interdependencies introduced by M; this is just L^k.

Let G be a graph with automorphism group L. Claim: if you stick k disjoint copies of G next to each other, this new graph has automorphism group L \wr S_k. To see this, of course some automorphisms apply L elements “in parallel” to the different copies of G. These automorphisms correspond to L^k. But any rearrangement of the copies of G is also an automorphism. All copies are isomorphic, so rearranging can be done without restriction by S_k.

Product Actions of Wreath Products

The previous paragraph shows that, if L acts on a set \Omega, the wreath product L \wr K acts naturally on k\cdot \Omega. This is called the sum action (or “imprimitive action”, because it’s nearly always an imprimitive group action)  of the wreath product. There is another natural action, called the product action (or “primitive action”). Here the domain is the set \Omega^k, and the action is defined by (1) applying elements of L^k coordinate-wise and (2) permuting the coordinates with M.

I think of this like a “simultaneous tracking” version of the sum action. In the sum action, we kept track of one point/vertex across k copies of our graphs, and looked at where that point went when hit with a group element. In the product action, we track one point from each of the k graphs simultaneously, and look at how as a whole those elements move.

H \wr K contains all extensions of H by K

It would be remiss to mention wreath products without mentioning the Kaluzhnin-Krasner theorem (English version). We say that the group G is an extension of H by K if there is an exact sequence

1 \rightarrow K \rightarrow G \rightarrow H \rightarrow 1

Theorem (Kaluzhnin-Krasner, 1951) If G is an extension of H by K, then G is isomorphic to a subgroup of H \wr K.

That is, H \wr K is a group that’s big enough to contain all extensions of the two groups. In a sense that can probably be made more formal through order theory, H \wr K is an “upper bound” on joining H and K together. Not bad!

How I Remember the Chernoff Bound

The Chernoff bound is of critical importance in computer science and probability. When I need it, though, I almost always have to look it up to make sure I get all the parameters correct. Not any more, though. I’ll explain how the Chernoff bound is essentially the PDF of the “Gaussian you would expect” from the Central Limit Theorem.

The most common form of the Chernoff bound is the following: suppose you have n independent and identically distributed coin flips X_1, X_2, \dots, X_n, say the result of repeatedly flipping a coin that comes up Heads with probability p. The number of Heads is by definition a Binomial distribution; let S denote this random variable. Then with all but exponentially small probability, S is within O(\sqrt{n}) of its mean:

\text{Pr}[\left|S - np\right| > \delta \sqrt{n}] \leq 2\exp(-\delta^2 / 2p(1-p))

The Central Limit Theorem (CLT) tells us that as n \to \infty, the distribution (S - \mathbb{E}(S)) / \sqrt{\text{Var}(S)} approaches a standard normal distribution. Philosophically, S approaches a normal distribution with the parameters you would expect: mean \mathbb{E}(S) and variance \text{Var}(S) (though the “scaled up” distributions may not necessarily converge from the CLT alone).

In our case, \mathbb{E}(S) = np and \text{Var}(S) = np(1-p). The PDF of the “Gaussian you would expect” is

p(S = x) \approx C \cdot \exp(-(x - np)^2 / 2np(1-p))

for an appropriate normalizing constant C = 1/\sqrt{2\pi n p(1-p)}. Let’s rewrite that with x = np + \delta\sqrt{n}:

p(S = np + \delta\sqrt{n}) \approx C \cdot \exp(-(\delta \sqrt{n})^2 / 2np(1-p)) =C \cdot \exp(-\delta^2/2p(1-p))

This is (up to the factor of C) the expression appearing in the Chernoff bound! Thus we can think of the Chernoff bound as expressing an “even when n is small” version of the CLT, with a little bit of loss from 1/\sqrt{2\pi} to 2 in the multiplicative factor.

Using the CDF instead of the PDF

We lost some credibility at one point in the technique above: we actually should have been looking at the cumulative probability

\text{Pr}[\left|S - np \right| \geq \delta \sqrt{n}]

Instead we noticed that the Chernoff bound can be remembered by looking at the PDF (and ignoring a nonconstant factor)

p(S = np+\delta\sqrt{n})

But using the CDF instead of the PDF actually gives the same expression, and with a truly constant factor. Let’s compute:

\text{Pr}[\left|S - np \right| \geq \delta \sqrt{n}] = 2\int_{np+\delta\sqrt{n}}^\infty p(S = x) d x

= 2\int_{\delta\sqrt{n}}^\infty 1/ \sqrt{2\pi np(1-p)} \cdot \exp(-x^2 / 2np(1-p)) d x

If we break up the integral into little \sqrt{n} size pieces from \delta \sqrt{n} to \infty, the integral on a piece [k\sqrt{n}, (k+1)\sqrt{n} looks like

\int_{k\sqrt{n}}^{(k+1)\sqrt{n}} 1/ \sqrt{2\pi np(1-p)} \cdot \exp(-x^2 / 2np(1-p)) d x

\leq1/ \sqrt{2\pi np(1-p)} \int_{k\sqrt{n}}^{(k+1)\sqrt{n}} \exp(-k^2 / 2p(1-p)) d x

= 1/ \sqrt{2\pi p(1-p)} \exp(-k^2 / 2p(1-p))

The exponent increases faster than linearly,  and it is the only thing that changes in k, so this contribution to the integral goes to 0 faster than geometrically.  Therefore the integral is bounded by an infinite geometric series with initial term proportional to \exp( -\delta^2/2p(1-p)). Therefore the cumulative probability is (up to a constant) the expression in the Chernoff bound

\exp(-\delta^2/2p(1-p))

and this time the constant is actually constant!

 

How to Easily Compute Matrix Derivatives, Fréchet Style

The routine, traditional interpretation of a derivative is as a slope of the tangent line. This is great for a function of a single variable, but if you have a multivariate function f: \mathbb{R}^m \to \mathbb{R}^n, how can you talk about slopes? One approach is to make a conceptual leap: a derivative is a linear approximation to your function; the tangent line itself is the derivative, rather than the slope of the tangent line. This is called the Fréchet derivative.

Suppose you have a multivariate function f: \mathbb{R}^m \to \mathbb{R}^n. At some point in the domain x_0 \in \mathbb{R}^m, we want to say that if you make a small change \epsilon \in \mathbb{R}^m around x_0, the change in f is approximately a linear function in \epsilon:

f(x_0 + \epsilon) \approx f(x_0) + Df_{x_0}(\epsilon)

Here Df_{x_0} : \mathbb{R}^m \to \mathbb{R}^n is our notation for the derivative function (which is linear, and can be represented as a matrix applied to \epsilon). Df_{x_0} could depend highly on x_0; after all, f will probably have different slopes at different points. More formally, we want a function Df_{x_0} such that

f(x_0 + \epsilon) = f(x_0) + Df_{x_0}(\epsilon) + o(\epsilon)

where o(\epsilon) is a function that “is tiny compared to \epsilon” e.g. every entry of the vector is the square of the largest entry of \epsilon.

Computing Fréchet Derivatives is Incredibly Elegant

To compute Df_{x_0}(\epsilon), just plug in x_0 + \epsilon to f, collect all the terms that are linear in \epsilon, and drop all the terms that “get small as \epsilon \to 0“. In practice, this means we drop all “higher-order terms” in which we multiply two coordinates of \epsilon. The best way to illustrate is with examples.

Example 1: f(x) = x^Tx

If we change x a little, how does x^\top x change? Here f: \mathbb{R}^n \to \mathbb{R} is f(x) = x^\top x.

f(x + \epsilon) = (x + \epsilon)^\top (x + \epsilon) = x^\top x + x^\top \epsilon + \epsilon^\top x + \epsilon^\top \epsilon

= f(x) +x^\top \epsilon + \epsilon^\top x + \epsilon^\top \epsilon

The part linear in \epsilon here is x^\top \epsilon + \epsilon^\top x, and \epsilon^\top \epsilon is tiny compared to \epsilon. So

Df_x(\epsilon) = x^\top \epsilon + \epsilon^\top x = 2x^\top \epsilon

f(x+\epsilon) \approx f(x) + 2x^\top \epsilon

If we want to see where x^\top x is minimized, the critical points are found by finding where the function Df_x is 0; that is, where we can’t move a little bit to go up or down. In this case,

\forall \epsilon.\; 2x^\top \epsilon = 0 \iff x = 0

Example 2: Derivative of the Matrix Inverse

Let’s look at a function of a matrix,

f: \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times n}

f(M) = M^{-1}

As we change M by a matrix X with small entries (in particular, small enough to ensure M+X is still invertible), how does (M+X)^{-1} change from M^{-1}? Just thinking about dimensions, the answer to that question should be by adding a small matrix to M^{-1}.

f(M+X) = (M+X)^{-1} \approx M^{-1} + Df_M(X)

The hard part is massaging (M+X)^{-1} to look like M^{-1} plus a linear function of X. To start,

(M+X)^{-1} = M^{-1}(I + XM^{-1})^{-1}

The matrix I + XM^{-1} can be inverted using the Neumann series, which is just a fancy name for the geometric series summation formula

{1 \over I + XM^{-1}} = I - XM^{-1} + (XM^{-1})^2 - (XM^{-1})^3 + \cdots

Note that the series converges for small enough X because XM^{-1} has all eigenvalues less than 1.

Because X is small, X^2 is negligibly tiny. For our linear approximation, we can drop the terms that are too tiny to be noticed by the linear approximation, which is all but the first two terms of this series.

{1 \over I + XM^{-1}} \approx I - XM^{-1}

M^{-1}(I + XM^{-1})^{-1} \approx M^{-1}(I - XM^{-1}) = M^{-1} - M^{-1}XM^{-1}

Remembering f(M) = M^{-1}, altogether we just computed

f(M+X) \approx f(M) -  M^{-1}XM^{-1}

Noting that this second term is linear in X, we have the derivative

Df_{M}(X) = -M^{-1}XM^{-1}

As a remark: whenever we wrote \approx, we could have written + O(X^2) to be a little more formal about carrying the higher-order terms through the computation.

Example 3: Derivative of the Log of the Determinant

Now let’s do another example of a matrix function, the log of the determinant

f: \mathbb{R}^{n \times n} \to \mathbb{R}

f(M) = \log |M|

This function arises frequently in machine learning and statistics when computing log likelihoods.

Change M by a matrix X which has small entries

f(M + X) = \log |M + X| = \log |M| |I + M^{-1}X| = \log |M| + \log |I + M^{-1}X|

Now let’s look at |I + M^{-1}X|. Each of the entries of M^{-1}X is “small”, though they’re essentially the same size as elements of X (just differing by a few constants depending on M^{-1}). The determinant of a matrix A is a sum over all permutations

|A| = \displaystyle\sum_{\sigma \in S_n} \prod_{i = 1}^n A_{i, \sigma(i)}

When performing this sum for I + M^{-1}X, if a product contains two off-diagonal terms, those will be two “small” terms from M^{-1}X multiplied together, and multiplying two small terms makes a negligible term that we drop. So we only need to sum up \sigma that include at least all but 1 diagonal elements. The only permutation that does this is the identity permutation.

|I + M^{_1}X| \approx \prod_{i = 1}^n (I + M^{-1}X)_{i,i} =\prod_{i = 1}^n (1 + M^{-1}X_{i,i})

Same trick: this sum has 2^n terms, but any term that multiples an M^{-1}X_{i,i} with a M^{-1}X_{j,j} is negligibly small. To get a term with at most one of the M^{-1}X_{j,j}, either we (1) take 1 from every factor, or (2) take 1 from all but one factor

\prod_{i = 1}^n (1 + M^{-1}X_{i,i}) \approx 1^n + 1^{n-1}M^{-1}X_{1,1} +1^{n-1}M^{-1}X_{2,2} + \cdots + 1^{n-1}M^{-1}X_{n,n}

= 1 + \text{tr}(M^{-1}X)

This is our linear approximation |I + M^{-1}X| \approx 1 +\text{tr}(M^{-1}X). To incorporate the log, note that we’re taking the log of something that is pretty much 1. The normal derivative of log at 1 is 1, and in our Fréchet derivative formulation

\log (1 + \epsilon) = \epsilon +  O(\epsilon^2)

\log |I + M^{-1}X| \approx  \log(1 + \text{tr}(M^{-1}X)) \approx\text{tr}(M^{-1}X)

And so the derivative is Df_{M}(X) = \text{tr}(M^{-1}X).

As a remark: incorporating log illustrates the composability of Fréchet derivatives. This makes sense when spoken aloud: if we have a linear approximation Df to f and a linear approximation Dg to g, a linear approximation of f \circ g is Df \circ Dg.

Other Viewpoints

Actually, Df_{x_0}(\epsilon) = J\epsilon where J is the Jacobian matrix of partial derivatives {\partial f_i \over \partial x_j}. But it’s often easier to not compute the Jacobian entirely, and keep Df_{x_0} in an “implicit” form like in Example 2.

See another description of Frechet derivatives in statistics.

Calculus-Free Asymptotic for Harmonic Numbers

Here we prove that H_n = \Theta(\log_2 n) without touching calculus (the most common proof interprets H_n as a Riemann sum of \int 1/x ). Seeing as this is probably the most fundamental asymptotic for computer scientists, I was surprised to see this Stack Exchange question didn’t mention the calculus-free approach. The proof here pretty much matches my own answer to the Stack Exchange post.

We first consider powers of 2, n = 2^k. For these n, we can break up \sum \frac{1}{j} into the “chunks”

1

\frac{1}{2} + \frac{1}{3}

\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7}

\dots

\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1}

and we have one extra term 1/2^k. There are k + 1 = \log_2 n + 1 chunks, and of course H_n is the sum of these chunks (plus the extra term), hence to show H_n = \Theta(\log_2 n) we show each chunk is \Theta(1). Inside each chunk we bound above by the power of 2:

1 \leq 1

\frac{1}{2} + \frac{1}{3} \leq \frac{1}{2} + \frac{1}{2} = 1

\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7} \leq \frac{1}{4} + \frac{1}{4} + \frac{1}{4} + \frac{1}{4}  =1

\dots

\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1} \leq \frac{1}{2^{k-1}} + \cdots + \frac{1}{2^{k - 1}} = 1

Thus each chunk is at most 1, and taken together with the extra term \frac{1}{2^k}, we have H_n \leq \log_2 n + \frac{1}{2^k} \leq \log_2 n + 1.

On the other hand, if we lower bound the elements by the next power of 2,

1 \geq 1

\frac{1}{2} + \frac{1}{3} \geq \frac{1}{4} + \frac{1}{4} = \frac{1}{2}

\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7} \geq \frac{1}{8} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8}  =\frac{1}{2}

\dots

\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1} \geq \frac{1}{2^k} + \cdots + \frac{1}{2^k} = \frac{1}{2}

Every chunk is at least \frac{1}{2}, hence H_n \geq \frac{1}{2}\log_2 n.

This essentially is the proof. One technicality: we’ve only shown H_n = \Theta (\log_2 n) for n a power of 2. Monotonicity of H_n completes the proof for all n: taking the nearest powers of 2 above and below n, call them n_- and n_+,

n/2 < n_- < n < n_+ < 2n

Applying our bounds on H_{n_-} and H_{n_+},

H_n \leq H_{n_+} \leq \log_2 n_+ + 1 < \log_2 (2n) +1= \log_2 n + 2

H_n \geq H_{n_-} \geq \frac{1}{2}\log_2 n_- > \frac{1}{2}\log_2 (n/2) = \frac{1}{2}\log_2 n - \frac{1}{2}

Thus H_n = \Theta(\log_2 n).

(In general, for any monotonic linear or sublinear function, it suffices to show \Theta(\cdot) on only the powers of 2).