A coupling of two distributions and on is a joint distribution on such that the two marginal distributions are and respectively. We write for the joint random variable, with the marginal in the first coordinate, and the marginal in the second coordinate. We say that and are “coupled”, with the coupling defining the interaction, or dependence, between them.

One of the basic facts about couplings compares the statistical distance between and to the probability the coupled variables agree. For any coupling of and , we have

Furthermore, there is a coupling that achieves this bound. The above inequality is typically used to upper bound the statistical distance between and by constructing a coupling that’s likely to agree e.g. in the proof of convergence of an ergodic Markov chain to its limiting distribution.

In this post we’ll prove this inequality (and its tightness!) using an LP duality argument.

A coupling is specified by a set of linear constraints: for , let be the probability of seeing the pair . The constraints are

This defines a polytope in , called a transportation polytope. In fact, the last constraint is redundant, as summing the constraints with over all ‘s will also give . Over this polytope, the linear constraint to minimize is the probability and disagree,

Ok, now that we have the LP, its dual (whose computation is explained e.g. here… or perhaps in a future blog post) is

By the duality theorems, is always at least as large as the solution to the dual, and furthermore, for some the two solutions agree. To finish our argument, we’ll show that the value of the dual LP is the statistical distance between and . The definition of the statistical distance we’ll use is the “optimization” version: the maximum difference in probabilities of events in ,

First we argue that the optimum is -valued. The dual problem’s matrix is totally unimodular (Prop. 3.2) and the right-hand sides of the constraints are integral, hence the LP is integral. Fix an optimal solution . From the constraints we see only one of the sets and can contain positive values. WLOG these are the variables for . By optimality and looking at the largest (resp. smallest) of the (resp. ), the are within of one another, and the are their negations. Now we can shift the solution to be -valued without changing the value.

Now we are in a position to compare the -optimum with . As noted above, the set of variables for exactly one of is positive, which gives the sign in the absolute value. The -valued variables indicate the event to take. This shows the value of the dual is exactly . As explained, this finishes the proof that

We finish with a few extra details on transportation polytopes.

Fixing in mind and , the collection of possible couplings forms a convex polytope in from a very special class of polytopes, the transportation polytopes. The name comes from the following “transportation” problem in optimization: suppose you have factories producing a good, such as lawn mowers, which produce quantities . You want to distribute the lawn mowers among stores in quantities . The question is, how many lawn mowers should factory send to store ? A solution that achieves the desired stocking quantity is exactly a matrix (a.k.a joint probability distribution) with the given marginals.

Vertices of a transportation polytope can be enumerated and checked algorithmically. Furthermore, when the marginals and are uniform, the corresponding transportation polytope is exactly the set of doubly stochastic matrices. By the classical Birkhoff-Von Neumann theorem, the vertices of this polytope are exactly the permutation matrices i.e. doubly stochastic matrices are exactly convex combinations of permutation matrices (along with the zero matrix).

]]>

How is distributed?

Typically the answer is the central limit theorem (CLT) amplified with the Berry-Esséen theorem, “a sum of independent variables is approximately normally distributed”. This post explains two symbolic translations of that sentence which I find easier to write down at a moment’s notice. The first takes the viewpoint of “asymptotic approximation of “,

Here is the mean of and is the standard deviation of . The second is

where the are i.i.d Gaussian random variables with the same mean and variance as the , . This can be thought of as an “invariance principle” or “replacement lemma” (and this is the viewpoint taken by Lindeburg in his proof of the CLT). The invariance principle is now a tool used in Boolean fourier analysis.

**Crucial remark:** Unfortunately, the equality of these “syntactic forms” is NOT any convergence in distribution. We’ve “blown up” errors by a factor of from the convergence guaranteed by the CLT. Dividing both sides by normalizes both sides to have finite variance and gives convergence in probability (or the stronger convergence guaranteed by an appropriate CLT).

The law of large numbers says that, if is finite, “the sample mean converges to the true mean”, i.e.

In our notation, this is a “linear approximation to “,

The central limit theorem refines this by saying that, in fact, the sample mean is about away from the true mean. If and are finite,

In our notation, this is a “ approximation to “,

The Berry-Esséen theorem strengthens the convergence to the normal distribution. If denotes (an upper bound on) the third moment of the distribution , which we assume to be finite,

In our notation, this is an improvement up to a constant to the asymptotic distribution of ,

It should be noted that the Berry-Esséen theorem is tight up to the big-Oh. The binomial distribution achieves this rate of convergence; see the end of these lecture notes for a proof.

The previous analysis ends with the distribution . We can incorporate everything into one Gaussian to produce the equivalent distribution . Interestingly, because the sum of independent Gaussians is again Gaussian, this distribution has the same PDF as

for independent Gaussian random variables . This leads us to the equality

in which we’ve simply replaced each by a Gaussian random variable with the same mean and variance.

As noted in Remark 29 here, one can improve the constant term by changing the replacement variables. In particular, it can be made if the first, second, **and third** moments of the agree with the .

The idea of “replacement invariance” surfaces in theoretical CS in the context of Boolean fourier analysis. Here we generalize the summation of Boolean (-valued) variables to an arbitrary Boolean function

The invariance principle states that the random variable for uniformly drawn from is close in probability to the random variable

for (assuming we normalize both sides to have variance ). In this case the “closeness” is determined by the maximum influence of a variable on the value of , as well as the complexity of (its degree as a multilinear polynomial); see the previously linked lecture notes for an exact quantitative statement.

]]>

The Ackermann function is variously defined, but the most popular these days is the Ackermann-Péter function:

One way to build up to the Ackermann function is through hyperoperation. We’ll show that is pretty much for the -th hyperoperator : in fact .

Fixing the first argument of yields a function based on the -th hyperoperator, which is why we often think of the Ackermann function as a family of functions parameterized by the first argument. In retrospect, maybe the better definition for the Ackermann function is the simpler , so that the ubiquitous exercise to compute the first few functions becomes just a little bit less of an exercise.

Interpret multiplication as repeated addition: add to itself, times. Interpret exponentiation as repeated multiplication: multiply with itself, times. Going one step further, interpret tetration as repeated exponentiation: exponentiate with itself, times. For example,

Going infinitely many steps further, we can define the -th hyperoperation to “recursively apply the previous hyperoperation on with itself, times” (this of course stops being commutative at and beyond exponentiation, though the first hyperoperators constructed were actually commutative). To actually implement this we should use the recursion “compute hyperoperated with itself times, then apply one more time”:

Notice we fold right here i.e. compute rather than .

We also need a trio of base cases for what happens when you “apply to itself times” in order to align with the normal arithmetic operations:

for

Finally, we set , to capture that addition, , arises as the iterated successor function, our . After making all these definitions, we have our hyperoperators. These extend the normal arithmetic operators in the following way:

The point of defining these hyperoperations is to note that the Ackermann function is itself a specific case: . The proof of this isn’t too exciting, but let’s prove it by induction. Check the base case ,

For the base case , we need key properties of using instead of a different constant:

for

for

Now we have

And the primary recurrence for and :

Substituting for by induction, we see they are in fact the same recursion.

]]>

where is the binary entropy function i.e. the entropy of a coin flipped with bias . Maybe more properly the proof is just counting in two ways?

By a tiny tweak to our entropy argument as done here, one can get a better bound

This is what’s typically used in practice to bound e.g. the sum of the first of the binomial coefficients by . The proof here gives all the essential ideas for that proof. With Stirling’s approximation used to give error bounds, both approximations are tight when ,

**The experiment:** Let the random variable denote the result of flipping fair coins i.e. is binomially distributed. Now suppose we know has exactly Heads.

**The question:** How much entropy is in the conditional distribution of given Heads? I.e. what is ?

**The left side: **The conditional distribution of is uniform over possible results. The entropy of the uniform distribution on objects is , therefore

.

**The right side: **On the other hand, is specified by its bits , so

By subaddivitity of entropy,

This expression is easy to compute: when flipping Heads of flips, each looks like a coin with bias , therefore its entropy is . So

giving the right-hand side.

We can naturally generalize the above argument to provide an approximation for multinomial coefficients. Fix natural numbers with . Let be a distribution over taking with probability . Then

The proof generalizes perfectly: the underlying set is (place a letter from at each of indices), and we consider the entropy of a uniform choice of a string constrained to have the fixed letter histogram . This is exactly the left-hand side, and subadditivity yields the right-hand side.

Moreover, by Stirling again the approximation is tight when .

]]>

and approximates it as

Imagine if are “bad events” that occur with some tiny probability . The union bound says

If the events were fully independent, we would have the equality

Now when . So we have

This is equal to the bound spit out by the application of the union bound! In the case of fully independent events with small probabilities, we see the union bound is essentially tight. Full independence is rare, but in applications where low-probability events are slightly correlated, the union bound is also a good approximation.

]]>

To make sure we’re all on the same page, the wreath product for an arbitrary group and a permutation group is the semidirect product , where each is an automorphism of by permuting the coordinates.

That is, we fix some number and some group of permutations on , . The group we construct takes disjoint and independent versions of , , but also allows you to make “limited rearrangements according to “. There are two distinct ideas going on here; it’s the semidirect product that, if you tell it how two groups should interact, joins those two ideas into a single group.

There are two “trivial” wreath products, (where denotes the trivial group) and . is disjoint and independent versions of , with **no** interdependencies introduced by ; this is just .

Let be a graph with automorphism group . Claim: if you stick disjoint copies of next to each other, this new graph has automorphism group . To see this, of course some automorphisms apply elements “in parallel” to the different copies of . These automorphisms correspond to . But any rearrangement of the copies of is also an automorphism. All copies are isomorphic, so rearranging can be done without restriction by .

The previous paragraph shows that, if acts on a set , the wreath product acts naturally on . This is called the **sum action** (or “imprimitive action”, because it’s nearly always an imprimitive group action) of the wreath product. There is another natural action, called the **product action **(or “primitive action”). Here the domain is the set , and the action is defined by (1) applying elements of coordinate-wise and (2) permuting the coordinates with .

I think of this like a “simultaneous tracking” version of the sum action. In the sum action, we kept track of one point/vertex across copies of our graphs, and looked at where that point went when hit with a group element. In the product action, we track one point from each of the graphs simultaneously, and look at how as a whole those elements move.

It would be remiss to mention wreath products without mentioning the Kaluzhnin-Krasner theorem (English version). We say that the group is an extension of by if there is an exact sequence

**Theorem (Kaluzhnin-Krasner, 1951)** If is an extension of by , then is isomorphic to a subgroup of .

That is, is a group that’s big enough to contain all extensions of the two groups. In a sense that can probably be made more formal through order theory, is an “upper bound” on joining and together. Not bad!

]]>

The most common form of the Chernoff bound is the following: suppose you have independent and identically distributed coin flips , say the result of repeatedly flipping a coin that comes up Heads with probability . The number of Heads is by definition a Binomial distribution; let denote this random variable. Then with all but exponentially small probability, is within of its mean:

The Central Limit Theorem (CLT) tells us that as , the distribution approaches a standard normal distribution. Philosophically, approaches a normal distribution with the parameters you would expect: mean and variance (though the “scaled up” distributions may not necessarily converge from the CLT alone).

In our case, and . The PDF of the “Gaussian you would expect” is

for an appropriate normalizing constant . Let’s rewrite that with :

This is (up to the factor of ) the expression appearing in the Chernoff bound! Thus we can think of the Chernoff bound as expressing an “even when is small” version of the CLT, with a little bit of loss from to in the multiplicative factor.

We lost some credibility at one point in the technique above: we actually should have been looking at the cumulative probability

Instead we noticed that the Chernoff bound can be remembered by looking at the PDF (and ignoring a nonconstant factor)

But using the CDF instead of the PDF actually gives the same expression, and with a truly constant factor. Let’s compute:

If we break up the integral into little size pieces from to , the integral on a piece looks like

The exponent increases faster than linearly, and it is the only thing that changes in , so this contribution to the integral goes to 0 faster than geometrically. Therefore the integral is bounded by an infinite geometric series with initial term proportional to . Therefore the cumulative probability is (up to a constant) the expression in the Chernoff bound

and this time the constant is actually constant!

]]>

Suppose you have a multivariate function . At some point in the domain , we want to say that if you make a small change around , the change in is approximately a linear function in :

Here is our notation for the derivative function (which is linear, and can be represented as a matrix applied to ). could depend highly on ; after all, will probably have different slopes at different points. More formally, we want a function such that

where is a function that “is tiny compared to ” e.g. every entry of the vector is the square of the largest entry of .

To compute , just plug in to , collect all the terms that are linear in , and drop all the terms that “get small as “. In practice, this means we drop all “higher-order terms” in which we multiply two coordinates of . The best way to illustrate is with examples.

If we change a little, how does change? Here is .

The part linear in here is , and is tiny compared to . So

If we want to see where is minimized, the critical points are found by finding where the **function** is 0; that is, where we can’t move a little bit to go up or down. In this case,

Let’s look at a function of a matrix,

As we change by a matrix with small entries (in particular, small enough to ensure is still invertible), how does change from ? Just thinking about dimensions, the answer to that question should be by adding a small matrix to .

The hard part is massaging to look like plus a linear function of . To start,

The matrix can be inverted using the Neumann series, which is just a fancy name for the geometric series summation formula

Note that the series converges for small enough because has all eigenvalues less than 1.

Because is small, is negligibly tiny. For our linear approximation, we can drop the terms that are too tiny to be noticed by the linear approximation, which is all but the first two terms of this series.

Remembering , altogether we just computed

Noting that this second term is linear in , we have the derivative

As a remark: whenever we wrote , we could have written to be a little more formal about carrying the higher-order terms through the computation.

Now let’s do another example of a matrix function, the log of the determinant

This function arises frequently in machine learning and statistics when computing log likelihoods.

Change by a matrix which has small entries

Now let’s look at . Each of the entries of is “small”, though they’re essentially the same size as elements of (just differing by a few constants depending on ). The determinant of a matrix is a sum over all permutations

When performing this sum for , if a product contains two off-diagonal terms, those will be two “small” terms from multiplied together, and multiplying two small terms makes a negligible term that we drop. So we only need to sum up that include at least all but 1 diagonal elements. The only permutation that does this is the identity permutation.

Same trick: this sum has terms, but any term that multiples an with a is negligibly small. To get a term with at most one of the , either we (1) take 1 from every factor, or (2) take 1 from all but one factor

This is our linear approximation . To incorporate the log, note that we’re taking the log of something that is pretty much 1. The normal derivative of log at 1 is 1, and in our Fréchet derivative formulation

And so the derivative is .

As a remark: incorporating log illustrates the composability of Fréchet derivatives. This makes sense when spoken aloud: if we have a linear approximation to and a linear approximation to , a linear approximation of is .

Actually, where is the Jacobian matrix of partial derivatives . But it’s often easier to not compute the Jacobian entirely, and keep in an “implicit” form like in Example 2.

See another description of Frechet derivatives in statistics.

]]>

We first consider powers of 2, . For these , we can break up into the “chunks”

and we have one extra term . There are chunks, and of course is the sum of these chunks (plus the extra term), hence to show we show each chunk is . Inside each chunk we bound above by the power of 2:

Thus each chunk is at most 1, and taken together with the extra term , we have .

On the other hand, if we lower bound* *the elements by the next power of 2,

Every chunk is at least , hence .

This essentially is the proof. One technicality: we’ve only shown for a power of 2. Monotonicity of completes the proof for all : taking the nearest powers of 2 above and below , call them and ,

Applying our bounds on and ,

Thus .

(In general, for any monotonic linear or sublinear function, it suffices to show on only the powers of 2).

]]>

There are actually only two reasons I started this blog:

- Organization
- Public availability

Right now, I’m a mathematical scribbler. When I write math, it fills up tons of pages, proves theorems, but is often mostly indecipherable afterwards. This blog is to be my “digital paper”, filled with high-quality scribbling only. In this way, this blog is for me: I can keep track of the things I work on.

On my messy sheets of paper are many interesting ideas that are invariably lost to the confines of my desk (and eventually, my recycling bin). Through this blog I can share these ideas with my friends and colleagues. Mathematics is an activity reliant on communication; this blog has ready-made content for me to let others know what I’ve been thinking about.

]]>