Proving the Coupling Inequality with LP Duality

This post details the first in a collection of everyday inequalities that arise as special cases of duality in linear programming. We’ll show that the usual coupling inequality is such an example,

d_{TV}(\mu, \nu) \leq \Pr[X \neq Y]

coupling of two distributions \mu and \nu on \Omega is a joint distribution on \Omega \times \Omega such that the two marginal distributions are \mu and \nu respectively. We write (X, Y) for the joint random variable, with X the marginal in the first coordinate, and Y the marginal in the second coordinate.  We say that \mu and \nu are “coupled”, with the coupling (X, Y) defining the interaction, or dependence, between them.

One of the basic facts about couplings compares the statistical distance between \mu and \nu to the probability the coupled variables agree. For any coupling (X, Y) of \mu and \nu, we have

d_{TV}(\mu, \nu) \leq \Pr[X \neq Y]

Furthermore, there is a coupling that achieves this bound. The above inequality is typically used to upper bound the statistical distance between \mu and \nu by constructing a coupling that’s likely to agree e.g. in the proof of convergence of an ergodic Markov chain to its limiting distribution.

In this post we’ll prove this inequality (and its tightness!) using an LP duality argument.

The Coupling LP

A coupling is specified by a set of linear constraints: for x,y \in \Omega, let p_{x,y} be the probability of seeing the pair (x,y). The constraints are

\forall x. \quad \displaystyle\sum_{y \in \Omega} p_{x,y} = \mu(x)

\forall y. \quad \displaystyle\sum_{x \in \Omega} p_{x,y} = \nu(y)

\forall x,y. \quad p_{x,y} \geq 0

\displaystyle\sum_{x,y \in \Omega} p_{x,y} = 1

This defines a polytope in \mathbb{R}^{|\Omega|^2}, called a transportation polytope. In fact, the last constraint is redundant, as summing the constraints with \mu(x) over all x‘s will also give 1. Over this polytope, the linear constraint to minimize is the probability X and Y disagree,

\min\displaystyle\sum_{\substack{x,y \in \Omega \\ x \neq y}} p_{x, y}

Ok, now that we have the LP, its dual (whose computation is explained e.g. here… or perhaps in a future blog post) is

\max \displaystyle\sum_{\omega \in \Omega} q^\mu_\omega \mu(\omega) + q^\nu_\omega \nu(\omega)

\forall \omega. \quad q^\mu_\omega + q^\nu_\omega \leq 0

\forall \omega_1 \neq \omega_2. \quad q^\mu_\omega + q^\nu_\omega \leq 1

By the duality theorems, \Pr[X \neq Y] is always at least as large as the solution to the dual, and furthermore, for some X, Y the two solutions agree. To finish our argument, we’ll show that the value of the dual LP is the statistical distance between \mu and \nu. The definition of the statistical distance we’ll use is the “optimization” version: the maximum difference in probabilities of events in \Omega,

d_{TV}(\mu, \nu) = \max_{A \subseteq \Omega} |\mu (A) - \nu (A)|

= \max_{A \subseteq \Omega} \left|\displaystyle\sum_{a \in A} \mu(a) - \nu(a)\right|

First we argue that the optimum is \{-1, 0, +1\}-valued. The dual problem’s matrix is totally unimodular (Prop. 3.2) and the right-hand sides of the constraints are integral, hence the LP is integral. Fix an optimal solution q^{\mu*}_\omega. From the constraints we see only one of the sets \{q^{\mu*}_\omega : \omega \in \Omega\} and \{q^{\nu*}_\omega : \omega \in \Omega\} can contain positive values. WLOG these are the variables for \mu. By optimality and looking at the largest (resp. smallest) of the q^{\mu*}_\omega (resp. q^{\nu*}_\omega), the \{q^{\mu*}_\omega : \omega \in \Omega \} are within 1 of one another, and the q^{\nu*}_\omega are their negations. Now we can shift the solution to be \{-1, 0, +1\}-valued without changing the value.

Now we are in a position to compare the \{-1, 0, +1\}-optimum with d_{TV}(\mu, \nu). As noted above, the set of variables for exactly one of \mu, \nu is positive, which gives the sign in the absolute value. The +1-valued variables indicate the event to take. This shows the value of the dual is exactly d_{TV}(\mu, \nu). As explained, this finishes the proof that

d_{TV}(\mu, \nu) \leq \Pr[X \neq Y]


Transportation Polytopes

We finish with a few extra details on transportation polytopes.

Fixing in mind \mu and \nu, the collection of possible couplings Z forms a convex polytope in \mathbb{R}^{|\Omega|^2} from a very special class of polytopes, the transportation polytopes. The name comes from the following “transportation” problem in optimization: suppose you have m factories producing a good, such as lawn mowers, which produce quantities s_1, s_2, \dots, s_m. You want to distribute the lawn mowers among n stores in quantities t_1, t_2, \dots, t_n. The question is, how many lawn mowers should factory i send to store j? A solution that achieves the desired stocking quantity is exactly a matrix (a.k.a joint probability distribution) with the given marginals.

Vertices of a transportation polytope can be enumerated and checked algorithmically. Furthermore, when the marginals \mu and \nu are uniform, the corresponding transportation polytope is exactly the set of doubly stochastic matrices. By the classical Birkhoff-Von Neumann theorem, the vertices of this polytope are exactly the permutation matrices i.e. doubly stochastic matrices are exactly convex combinations of permutation matrices (along with the zero matrix).