back
·
·
machine-learning

In this note I prove that the InfoNCE loss is a lower bound on mutual information.
While the original work of (van der Oord *et al.*, 2019) already proves this statement, when first reading the paper it was unclear to me
ⅰ. how the authors proved the inequality step
(that is, why equation (9) implies inequality (10) in the appendix) and
ⅱ. why the bound gets tighter with the number of negative samples.
The following presentation offers a different (and hopefully clearer) perspective.

*Proof.* We start from the definition of the InfoNCE loss (which we denote by \(L\)):

$$
L = - \mathbb{E} \left[\log\frac{r(x,y)}{r(x,y) + \sum_{i} r(\bar{x}_i,y)}\right] \\
$$

where \(r\) denotes the score function, that a assigns a non-negative value to pairs of data points. The expectation is over the following random variables:

- a pair \((x, y)\) sampled from the joint distribution \(P_{XY}\) (we can think of this pair as the "positive" example);
- a sequence of \(n - 1\) variables \(\bar x_i\) sampled from the marginal distribution \(P_X\) (when paired with \(y\), the examples \((\bar x_i, y)\) are "negative").

In short:

$$
\begin{align}
x, y &\sim P_{XY} \\
\bar x_1, \dots, \bar x_{n-1} &\sim P_X
\end{align}
$$

We assume that the score function is optimal (that is, it minimizes the loss). As shown in the paper the optimal function is given by the following ratio of distributions:

$$
\begin{align}
r(x,y) &= \frac{p(x|y)}{p(x)} = \frac{p(x, y)}{p(x) p(y)}
\end{align}
$$

Using this observation together with the linarity of the expectation we get:

$$
L = - \mathbb{E}\log\frac{p(x,y)}{p(x)p(y)}
+ \mathbb{E}\log\left[\frac{p(x|y)}{p(x)} + \sum_{i=1}^{n-1} \frac{p(\bar{x}_i|y)}{p(\bar{x}_i)}\right] \\
$$

The first term is the mutual information, while the second term can be simplified, because the values \(\bar x_i\) are sampled independetly from \(y\) and thus we have \(p(\bar x|y) = p(\bar x)\):

$$
L = - I(X;Y) + \mathbb{E}\log\left[\frac{p(x|y)}{p(x)} + n - 1\right]
$$

Now we apply Jensen's inequality to the second term. Jensen's inequality states that \(\mathbb E\left[\phi(z)\right] \ge \phi\left(\mathbb E z \right)\) if the function \(\phi\) is convex. In our case we apply the inequality to the function

$$
\phi(z) = \log\left(\frac{1}{z} + n - 1\right),
$$

which is convex (because it is the composition of a concave and nondecreasing function, \(\log\) with a convex function, \(1/z\); see page 84 in Boyd and Vanderberghe). This implies that

$$
\mathbb{E}\log\left[\frac{p(x|y)}{p(x)} + n - 1\right] \ge \log\left[\frac{1}{\mathbb E\frac{p(x)}{p(x|y)}} + n - 1\right],
$$

but since

$$
\mathbb E \frac{p(x)}{p(x|y)} = \iint \frac{p(x)p(y)}{p(x,y)} p(x,y) \; dx \; dy = 1.
$$

we obtain

$$
L \ge - I(X;Y) + \log n
$$

or, equivalently,

$$
I(X; Y) \ge - L + \log n
$$

The paper also claims that the bound gets tighter with the minibatch size, \(n\). To me it was not obvious why this the case because both \(-L\) and \(\log n\) increase with \(n\). I believe the explanation is that Jensen's inequality becomes tighter when the function \(\phi\) is closer to a linear function; in the limit, if the function is linear we achieve equality: \(\mathbb E\left[\phi(z)\right] = \phi\left(\mathbb E z \right)\). In our case, the function \(\phi\) that we have previously defined is almost constant if \(n\) dominates the other terms, that is, \(n \gg p(x|y) / p(x)\). This happens either when \(n\) is very large or when the two random variables are losely correlated; a similar observation is made by (Poole et al, 2019):

"[A]ccurately estimating mutual information still needs a large batch size at test time if the mutual information is high."

Code that numerically simulates the estimation of the lower bound is available here.