How to Easily Compute Matrix Derivatives, Fréchet Style

The routine, traditional interpretation of a derivative is as a slope of the tangent line. This is great for a function of a single variable, but if you have a multivariate function f: \mathbb{R}^m \to \mathbb{R}^n, how can you talk about slopes? One approach is to make a conceptual leap: a derivative is a linear approximation to your function; the tangent line itself is the derivative, rather than the slope of the tangent line. This is called the Fréchet derivative.

Suppose you have a multivariate function f: \mathbb{R}^m \to \mathbb{R}^n. At some point in the domain x_0 \in \mathbb{R}^m, we want to say that if you make a small change \epsilon \in \mathbb{R}^m around x_0, the change in f is approximately a linear function in \epsilon:

f(x_0 + \epsilon) \approx f(x_0) + Df_{x_0}(\epsilon)

Here Df_{x_0} : \mathbb{R}^m \to \mathbb{R}^n is our notation for the derivative function (which is linear, and can be represented as a matrix applied to \epsilon). Df_{x_0} could depend highly on x_0; after all, f will probably have different slopes at different points. More formally, we want a function Df_{x_0} such that

f(x_0 + \epsilon) = f(x_0) + Df_{x_0}(\epsilon) + o(\epsilon)

where o(\epsilon) is a function that “is tiny compared to \epsilon” e.g. every entry of the vector is the square of the largest entry of \epsilon.

Computing Fréchet Derivatives is Incredibly Elegant

To compute Df_{x_0}(\epsilon), just plug in x_0 + \epsilon to f, collect all the terms that are linear in \epsilon, and drop all the terms that “get small as \epsilon \to 0“. In practice, this means we drop all “higher-order terms” in which we multiply two coordinates of \epsilon. The best way to illustrate is with examples.

Example 1: f(x) = x^Tx

If we change x a little, how does x^\top x change? Here f: \mathbb{R}^n \to \mathbb{R} is f(x) = x^\top x.

f(x + \epsilon) = (x + \epsilon)^\top (x + \epsilon) = x^\top x + x^\top \epsilon + \epsilon^\top x + \epsilon^\top \epsilon

= f(x) +x^\top \epsilon + \epsilon^\top x + \epsilon^\top \epsilon

The part linear in \epsilon here is x^\top \epsilon + \epsilon^\top x, and \epsilon^\top \epsilon is tiny compared to \epsilon. So

Df_x(\epsilon) = x^\top \epsilon + \epsilon^\top x = 2x^\top \epsilon

f(x+\epsilon) \approx f(x) + 2x^\top \epsilon

If we want to see where x^\top x is minimized, the critical points are found by finding where the function Df_x is 0; that is, where we can’t move a little bit to go up or down. In this case,

\forall \epsilon.\; 2x^\top \epsilon = 0 \iff x = 0

Example 2: Derivative of the Matrix Inverse

Let’s look at a function of a matrix,

f: \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times n}

f(M) = M^{-1}

As we change M by a matrix X with small entries (in particular, small enough to ensure M+X is still invertible), how does (M+X)^{-1} change from M^{-1}? Just thinking about dimensions, the answer to that question should be by adding a small matrix to M^{-1}.

f(M+X) = (M+X)^{-1} \approx M^{-1} + Df_M(X)

The hard part is massaging (M+X)^{-1} to look like M^{-1} plus a linear function of X. To start,

(M+X)^{-1} = M^{-1}(I + XM^{-1})^{-1}

The matrix I + XM^{-1} can be inverted using the Neumann series, which is just a fancy name for the geometric series summation formula

{1 \over I + XM^{-1}} = I - XM^{-1} + (XM^{-1})^2 - (XM^{-1})^3 + \cdots

Note that the series converges for small enough X because XM^{-1} has all eigenvalues less than 1.

Because X is small, X^2 is negligibly tiny. For our linear approximation, we can drop the terms that are too tiny to be noticed by the linear approximation, which is all but the first two terms of this series.

{1 \over I + XM^{-1}} \approx I - XM^{-1}

M^{-1}(I + XM^{-1})^{-1} \approx M^{-1}(I - XM^{-1}) = M^{-1} - M^{-1}XM^{-1}

Remembering f(M) = M^{-1}, altogether we just computed

f(M+X) \approx f(M) -  M^{-1}XM^{-1}

Noting that this second term is linear in X, we have the derivative

Df_{M}(X) = -M^{-1}XM^{-1}

As a remark: whenever we wrote \approx, we could have written + O(X^2) to be a little more formal about carrying the higher-order terms through the computation.

Example 3: Derivative of the Log of the Determinant

Now let’s do another example of a matrix function, the log of the determinant

f: \mathbb{R}^{n \times n} \to \mathbb{R}

f(M) = \log |M|

This function arises frequently in machine learning and statistics when computing log likelihoods.

Change M by a matrix X which has small entries

f(M + X) = \log |M + X| = \log |M| |I + M^{-1}X| = \log |M| + \log |I + M^{-1}X|

Now let’s look at |I + M^{-1}X|. Each of the entries of M^{-1}X is “small”, though they’re essentially the same size as elements of X (just differing by a few constants depending on M^{-1}). The determinant of a matrix A is a sum over all permutations

|A| = \displaystyle\sum_{\sigma \in S_n} \prod_{i = 1}^n A_{i, \sigma(i)}

When performing this sum for I + M^{-1}X, if a product contains two off-diagonal terms, those will be two “small” terms from M^{-1}X multiplied together, and multiplying two small terms makes a negligible term that we drop. So we only need to sum up \sigma that include at least all but 1 diagonal elements. The only permutation that does this is the identity permutation.

|I + M^{_1}X| \approx \prod_{i = 1}^n (I + M^{-1}X)_{i,i} =\prod_{i = 1}^n (1 + M^{-1}X_{i,i})

Same trick: this sum has 2^n terms, but any term that multiples an M^{-1}X_{i,i} with a M^{-1}X_{j,j} is negligibly small. To get a term with at most one of the M^{-1}X_{j,j}, either we (1) take 1 from every factor, or (2) take 1 from all but one factor

\prod_{i = 1}^n (1 + M^{-1}X_{i,i}) \approx 1^n + 1^{n-1}M^{-1}X_{1,1} +1^{n-1}M^{-1}X_{2,2} + \cdots + 1^{n-1}M^{-1}X_{n,n}

= 1 + \text{tr}(M^{-1}X)

This is our linear approximation |I + M^{-1}X| \approx 1 +\text{tr}(M^{-1}X). To incorporate the log, note that we’re taking the log of something that is pretty much 1. The normal derivative of log at 1 is 1, and in our Fréchet derivative formulation

\log (1 + \epsilon) = \epsilon +  O(\epsilon^2)

\log |I + M^{-1}X| \approx  \log(1 + \text{tr}(M^{-1}X)) \approx\text{tr}(M^{-1}X)

And so the derivative is Df_{M}(X) = \text{tr}(M^{-1}X).

As a remark: incorporating log illustrates the composability of Fréchet derivatives. This makes sense when spoken aloud: if we have a linear approximation Df to f and a linear approximation Dg to g, a linear approximation of f \circ g is Df \circ Dg.

Other Viewpoints

Actually, Df_{x_0}(\epsilon) = J\epsilon where J is the Jacobian matrix of partial derivatives {\partial f_i \over \partial x_j}. But it’s often easier to not compute the Jacobian entirely, and keep Df_{x_0} in an “implicit” form like in Example 2.

See another description of Frechet derivatives in statistics.

Calculus-Free Asymptotic for Harmonic Numbers

Here we prove that H_n = \Theta(\log_2 n) without touching calculus (the most common proof interprets H_n as a Riemann sum of \int 1/x ). Seeing as this is probably the most fundamental asymptotic for computer scientists, I was surprised to see this Stack Exchange question didn’t mention the calculus-free approach. The proof here pretty much matches my own answer to the Stack Exchange post.

We first consider powers of 2, n = 2^k. For these n, we can break up \sum \frac{1}{j} into the “chunks”

1

\frac{1}{2} + \frac{1}{3}

\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7}

\dots

\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1}

and we have one extra term 1/2^k. There are k + 1 = \log_2 n + 1 chunks, and of course H_n is the sum of these chunks (plus the extra term), hence to show H_n = \Theta(\log_2 n) we show each chunk is \Theta(1). Inside each chunk we bound above by the power of 2:

1 \leq 1

\frac{1}{2} + \frac{1}{3} \leq \frac{1}{2} + \frac{1}{2} = 1

\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7} \leq \frac{1}{4} + \frac{1}{4} + \frac{1}{4} + \frac{1}{4}  =1

\dots

\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1} \leq \frac{1}{2^{k-1}} + \cdots + \frac{1}{2^{k - 1}} = 1

Thus each chunk is at most 1, and taken together with the extra term \frac{1}{2^k}, we have H_n \leq \log_2 n + \frac{1}{2^k} \leq \log_2 n + 1.

On the other hand, if we lower bound the elements by the next power of 2,

1 \geq 1

\frac{1}{2} + \frac{1}{3} \geq \frac{1}{4} + \frac{1}{4} = \frac{1}{2}

\frac{1}{4} + \frac{1}{5} + \frac{1}{6} + \frac{1}{7} \geq \frac{1}{8} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8}  =\frac{1}{2}

\dots

\frac{1}{2^{k-1}} + \cdots + \frac{1}{2^k - 1} \geq \frac{1}{2^k} + \cdots + \frac{1}{2^k} = \frac{1}{2}

Every chunk is at least \frac{1}{2}, hence H_n \geq \frac{1}{2}\log_2 n.

This essentially is the proof. One technicality: we’ve only shown H_n = \Theta (\log_2 n) for n a power of 2. Monotonicity of H_n completes the proof for all n: taking the nearest powers of 2 above and below n, call them n_- and n_+,

n/2 < n_- < n < n_+ < 2n

Applying our bounds on H_{n_-} and H_{n_+},

H_n \leq H_{n_+} \leq \log_2 n_+ + 1 < \log_2 (2n) +1= \log_2 n + 2

H_n \geq H_{n_-} \geq \frac{1}{2}\log_2 n_- > \frac{1}{2}\log_2 (n/2) = \frac{1}{2}\log_2 n - \frac{1}{2}

Thus H_n = \Theta(\log_2 n).

(In general, for any monotonic linear or sublinear function, it suffices to show \Theta(\cdot) on only the powers of 2).

Announcing: Digital Paper

Congratulations! You’ve found the very first post in this blog. There’s no computer science in this post—only a bit of philosophy on why this blog exists. And perhaps some justification for why you want a blog, too.

There are actually only two reasons I started this blog:

  1. Organization
  2. Public availability

1. Organization

Right now, I’m a mathematical scribbler. When I write math, it fills up tons of pages, proves theorems, but is often mostly indecipherable afterwards. This blog is to be my “digital paper”, filled with high-quality scribbling only. In this way, this blog is for me: I can keep track of the things I work on.

2. Public Availability

On my messy sheets of paper are many interesting ideas that are invariably lost to the confines of my desk (and eventually, my recycling bin). Through this blog I can share these ideas with my friends and colleagues. Mathematics is an activity reliant on communication; this blog has ready-made content for me to let others know what I’ve been thinking about.