Entropy

In this note I will discuss some properties of entropy, and how they relate to more combinatorial notions in additive combinatorics. This is heavily influenced by a paper [Ta] and blog post [Ta2] of Tao.

Let $X$ be a discrete random variable, with distribution function $\mu(x) = \mathbb{P}(X=x)$. The (Shannon) entropy of $X$ is defined as \[ H(X) = \sum_x \mu(x) \log\frac{1}{\mu(x)}.\] We use the convention that $x\log(1/x)$ is $0$ at $x=0$. This is not meant to be a formal treatment, and my interest is when $X$ takes on only finitely many values anyway, so I'm not going to worry about things like convergence of this sum. The relative entropy is defined to be \[H(X\vert Y) = H(X,Y) - H(Y) = \sum_{y} \mathbb{P}(Y=y)H(X\vert Y=y).\] By the concavity of the function $x\mapsto x\log(1/x)$, we have the fundamental inequality \[H(X\vert Y)\leq H(X).\] Furthermore, by concavity again, if $X$ takes on at most $N$ different values then $H(X) \leq \log N$, which is achieved if and only if $X$ is uniformly distributed on these $N$ different values. These basic facts tell you pretty everything you need to know about entropy. When actually working with it, however, I find it easier to forget the formal definition above and think of it loosely as follows.

Suppose that I know the distribution law of $X$, and somebody has selected $X$ according to this law. The entropy $H(X)$ measures how many bits of information, on average, someone needs to communicate to me before I know what $X$ is. The following facts are intuitively obvious:

  1. $H(X)\geq 0$, with equality if and only if $X$ is deterministic, for then I know what $X$ is without any communication needing to take place,
  2. if $X$ can take at most $N$ different values then $H(X)\leq \log N$, because a trivial way for me to know what $X$ is is to just communicate the full value of $X$, and indeed if $X$ is uniformly distributed this is the best possible,
  3. $H(X\vert Y)\leq H(X)$, because knowing the value of $Y$ can never make it more difficult to know what $X$ is, and
  4. $H(X\vert Y)+H(Y) = H(X,Y)$, because communicating what $(X,Y)$ is jointly is equivalent to communicating first what $Y$ is and then what $X$ is, having fixed the value of $Y$.
Let me record one more elementary fact: if a random variable $X$ determines another random variable $Y$, in the sense that once the value of $X$ is fixed then $Y$ is deterministic, then \[H(X,Y)=H(X\vert Y)+H(Y) = H(Y\vert X)+H(X)=H(X).\] This is often very useful and, indeed, can be taken as a definition of what it means for one random variable to determine another.

Submodularity

The submodularity inequality is the elementary fact that $H(X,Y)\leq H(X)+H(Y)$. The real power of this, as with many entropy inequalities, is that this continues to hold even with relative entropy, conditioning on any other random variable $Z$, which can depend on $X$ and $Y$ in any fashion, so that \[ H(X,Y)-H(Z)=H(X,Y\vert Z) \leq H(X\vert Z) + H(Y\vert Z).\] In particular, if $X$ determines $Z$ and also $Y$ determines $Z$ then the right hand side is $H(X)+H(Y)-2H(Z)$, so that \[H(X,Y)+H(Z) \leq H(X)+H(Y).\] In particular, if there is some $Z$ that is determined by both $X$ and $Y$ then we can squeeze in an extra term of $H(Z)$ into the left-hand side of the more trivial $H(X,Y)\leq H(X)+H(Y)$.

Ruzsa triangle inequality

As an example of how this can be applied, suppose $X,Y,Z$ are three independent discrete random variables taking values in some abelian group $G$. Then $(X,Z)$ and $(X-Y,Y-Z)$ both determine $X-Z$, and so \[H(X,Z,X-Y,Y-Z)+H(X-Z)\leq H(X,Z)+H(X-Y,Y-Z).\] The first term is clearly $H(X,Y,Z)=H(X)+H(Y)+H(Z)$, and separating out the other entropies similarly using independence, we arrive at the entropic equivalent of the Ruzsa triangle inequality, \[H(Y)+H(X-Z) \leq H(X-Y) + H(Y-Z).\] The more conventional form of the Ruzsa triangle inequality says that if $A,B,C$ are finite subsets of an abelian group $G$ then \[\lvert B\rvert \lvert A-C\rvert \leq \lvert A-B\rvert \lvert B-C\rvert,\] and the proof is similar: fixing a unique representative pair $(a,c)$ arbitrarily for each $a-c\in A-C$, the map $(b,a-c)\mapsto (a-b,b-c)$ is an injection.

Iterated sumset inequality

If the entropy $H(X+Y)$ is not much larger than $H(X)$, then this is roughly saying that adding $Y$ does not have much of a smoothing effect on $X$. Viewed this way, one would expect that the same thing holds even if I had more (independent) copies of $Y$, so we expect $H(X+Y+Y')$ to be not much larger than $H(X)$ also. To make this rigorous, we can use the submodularity inequality. Namely, we now observe that $(X+Y,Z)$ and $(X,Y+Z)$ both determine $X+Y+Z$, so that \[H(X+Y,Y+Z,X,Z)+H(X+Y+Z)\leq H(X+Y,Z)+ H(X,Y+Z).\] In particular, if $X,Y,Z$ are independent then rearranging the above gives \[H(X+Y+Z) - H(Y+Z)\leq H(X+Y) - H(X).\] This crucial inequality can be iterated, to obtain results like the following.
Kaimanovich-Vershik [KaVe]
If $X,Y$ are independent random variables with $H(X+Y)\leq H(X)+\log K$ and $Y_1,\ldots,Y_t$ are independently sampled copies of $Y$ then \[ H(X+Y_1+\cdots +Y_t) \leq H(X) + t\log K.\]
This is the entropic equivalent of Plünnecke's inequality: if $\lvert A+B\rvert \leq K\lvert A\rvert$ then $\lvert tB\rvert\leq K^{t}\lvert A\rvert$. Note that, unlike Ruzsa's triangle inequality, this is not an exact equivalence, which would be $\lvert A+tB\rvert\leq K^t\lvert A\rvert$. Indeed, the analogue of the key submodularity inequality, which would be \[ \lvert A+B+C\rvert \lvert A\rvert \leq \lvert A+B\rvert \lvert A+C\rvert,\] for finite sets $A,B,C$, is false. Petridis observed that this becomes true, however, if we pass to some subset $A'\subset A$ depending only on $B$ (but, importantly, independent of $C$).
Petridis [Pe]
For any sets $A,B$ there is $A'\subset A$ (depending on $B$) such that for all sets $C$ \[\lvert A'+B+C \rvert \lvert A\rvert\leq \lvert A+B\rvert \lvert A'+C\rvert.\]
Iterating this lemma gives Plünnecke's inequality, in the slightly stronger form that there exists some $A'\subset A$ such that $\lvert A'+tB\rvert \leq K^{t}\lvert A'\rvert$.

Non-Shannon entropy inequalities

The mutual information between two random variables $X$ and $Y$ is, intuitively, how much the uncertainty inherent in $X$ is reduced once we know the value of $Y$. More precisely, we define the conditional mutual information to be \[ I(X:Y \vert Z) = H(X\vert Z) - H(X\vert Y,Z) = H(X\vert Z) + H(Y\vert Z) - H(X,Y\vert Z).\] Observe that, from the second equality, this quantity is symmetric, so that $I(X:Y\vert Z)=I(Y:X\vert Z)$.

The submodularity inequalities, also known as the Shannon inequalities, are those that follow from the observation that $I(X:Y\vert Z)\geq 0$. One might wonder whether this captures all the interesting behaviour of entropy, and one can show that for collections of at most three random variables, this is it: every possible entropy inequality is just a positive linear combination of inequalities of the shape $I\geq 0$.

For four or more variables, however, there are inequalities that do not follow from submodularity. The first was proved by Zhang and Yeung in 1998. To state it we generalise the notion of mutual information in the obvious fashion, so that the mutual information between $X,Y,Z$ is how much the mutual information between $X$ and $Y$ changes once the value of $Z$ is revealed: that is, $I(X:Y:Z)=I(X:Y)-I(X:Y\vert Z)$, and so on. It is not obvious from the intuitive definition, but can be easily checked, that this notion is indeed (as the notation suggests) symmetric in the roles of $X$, $Y$, and $Z$.

Zhang-Yeung inequality [ZhYe]
\[2I(U:V:Z)\leq I(U:V\vert Z)+I(U:V\vert X)+I(Z:X)+I(U,V : Z).\]
This is a special case of a more general inequality, proved in 2002 by Makarychev, Makarychev, Romaschenko, and Vereshchagin.
MMRV inequality [MMRV]
If $n\geq 2$ then \[nI(U:V:Z) \leq \sum_{i=1}^nI(U:V \vert X_i) + \sum_{i=1}^nH(X_i) - H(X_1,\ldots,X_n)+I(U,V: Z).\]
To recover the Zhang-Yeung inequality, let $n=2$ and $X_2=Z$. The proof of the MMRV inequality largely uses submodularity, with a final twist at the end of introducing a new auxiliary random variable.

We first observe that \[I(X_i:Z)\geq I(X_i:Z:U:V) =I(U:V)-I(U:V\vert X_i)-I(U:V\vert Z)+I(U:V\vert X_i,Z)\geq I(U:V)-I(U:V\vert X_i)-I(U:V\vert Z)\] and so \[ I(U:V)\leq I(U:V \vert X_i) + I(U:V\vert Z) + I(X_i:Z).\] In particular, summing over $i$, \[nI(U:V)\leq \sum_{i=1}^nI(U:V \vert X_i) + nI(U:V\vert Z)+ \sum_{i=1}^n I(X_i:Z).\] Furthermore, \[\sum_{i=1}^n I(X_i:Z)=\sum_{i=1}^nH(X_i)-\sum_{i=1}^nH(X_i\vert Z)\leq \sum_{i=1}^n H(X_i)-H(X_1,\ldots,X_n)+I(X_1,\ldots,X_n\vert Z).\] We have proved the following submodularity inequality: \[nI(U:V:Z)\leq \sum_{i=1}^nI(U:V \vert X_i) + \sum_{i=1}^nH(X_i)-H(X_1,\ldots,X_n)+I(X_1,\ldots,X_n : Z).\] Now, if $X_1,\ldots,X_n$ and $Z$ were independent given $U$ and $V$, then $I(X_1,\ldots,X_n: Z)\leq I(U,V : Z)$. Why? One way to check this is to note that, due to independence, \[H(X_1,\ldots,X_n,Z\vert U,V)=H(X_1,\ldots,X_n,Z,U,V)-H(U,V)=H(X_1,\ldots,X_n\vert U,V)+H(Z\vert U,V) = H(X_1,\ldots,X_n,U,V)+H(Z,U,V)-2H(U,V)\] whence \[H(X_1,\ldots,X_n,Z)-H(X_1,\ldots,X_n)\geq H(X_1,\ldots,X_n,Z,U,V)-H(X_1,\ldots,X_n,U,V)=H(Z,U,V)-H(U,V)\] and so \[H(X_1,\ldots,X_n)+H(U,V,Z)\leq H(U,V)+H(X_1,\ldots,X_n,Z)\] and hence, expanding information out in terms of individual entropies, \[I(X_1,\ldots,X_n: Z) = H(X_1,\ldots,X_n)+H(Z)-H(X_1,\ldots,X_n,Z) \leq H(U,V)+H(Z)-H(U,V,Z)=I(U,V:Z).\] In particular, substituting this inequality into the above, we recover the MMRV inequality. That is, under this independence assumption, the MMRV inequality is a submodularity inequality.

To deduce the MMRV inequality in general, we observe that in the inequality no term involves both $Z$ and $X_i$ simultaneously. Let $Z'$ be a new random variable such that $(Z',U,V)$ has the same distribution as $(Z,U,V)$ and such that $X_1,\ldots,X_n$ and $Z'$ are independent given $U$ and $V$. More precisely, \[\mathbb{P}(Z'=z \vert X_1=x_1,\ldots,X_n=x_n,U=u,V=v)=\mathbb{P}(Z=z \vert U=u, V=v).\] We can now apply the above inequality with $Z'$ replaced by $Z$, and observe that no individual terms change, and the proof is complete.