Let $X$ be a discrete random variable, with distribution function $\mu(x) = \mathbb{P}(X=x)$. The (Shannon) entropy of $X$ is defined as \[ H(X) = \sum_x \mu(x) \log\frac{1}{\mu(x)}.\] We use the convention that $x\log(1/x)$ is $0$ at $x=0$. This is not meant to be a formal treatment, and my interest is when $X$ takes on only finitely many values anyway, so I'm not going to worry about things like convergence of this sum. The relative entropy is defined to be \[H(X\vert Y) = H(X,Y) - H(Y) = \sum_{y} \mathbb{P}(Y=y)H(X\vert Y=y).\] By the concavity of the function $x\mapsto x\log(1/x)$, we have the fundamental inequality \[H(X\vert Y)\leq H(X).\] Furthermore, by concavity again, if $X$ takes on at most $N$ different values then $H(X) \leq \log N$, which is achieved if and only if $X$ is uniformly distributed on these $N$ different values. These basic facts tell you pretty everything you need to know about entropy. When actually working with it, however, I find it easier to forget the formal definition above and think of it loosely as follows.

Suppose that I know the distribution law of $X$, and somebody has selected $X$ according to this law. The entropy $H(X)$ measures how many bits of information, on average, someone needs to communicate to me before I know what $X$ is. The following facts are intuitively obvious:

- $H(X)\geq 0$, with equality if and only if $X$ is deterministic, for then I know what $X$ is without any communication needing to take place,
- if $X$ can take at most $N$ different values then $H(X)\leq \log N$, because a trivial way for me to know what $X$ is is to just communicate the full value of $X$, and indeed if $X$ is uniformly distributed this is the best possible,
- $H(X\vert Y)\leq H(X)$, because knowing the value of $Y$ can never make it more difficult to know what $X$ is, and
- $H(X\vert Y)+H(Y) = H(X,Y)$, because communicating what $(X,Y)$ is jointly is equivalent to communicating first what $Y$ is and then what $X$ is, having fixed the value of $Y$.

Kaimanovich-Vershik [KaVe]

If $X,Y$ are independent random variables with $H(X+Y)\leq H(X)+\log K$ and $Y_1,\ldots,Y_t$ are independently sampled copies of $Y$ then
\[ H(X+Y_1+\cdots +Y_t) \leq H(X) + t\log K.\]
Petridis [Pe]

For any sets $A,B$ there is $A'\subset A$ (depending on $B$) such that for all sets $C$
\[\lvert A'+B+C \rvert \lvert A\rvert\leq \lvert A+B\rvert \lvert A'+C\rvert.\]
The submodularity inequalities, also known as the Shannon inequalities, are those that follow from the observation that $I(X:Y\vert Z)\geq 0$. One might wonder whether this captures all the interesting behaviour of entropy, and one can show that for collections of at most three random variables, this is it: every possible entropy inequality is just a positive linear combination of inequalities of the shape $I\geq 0$.

For four or more variables, however, there are inequalities that do not follow from submodularity. The first was proved by Zhang and Yeung in 1998. To state it we generalise the notion of mutual information in the obvious fashion, so that the mutual information between $X,Y,Z$ is how much the mutual information between $X$ and $Y$ changes once the value of $Z$ is revealed: that is, $I(X:Y:Z)=I(X:Y)-I(X:Y\vert Z)$, and so on. It is not obvious from the intuitive definition, but can be easily checked, that this notion is indeed (as the notation suggests) symmetric in the roles of $X$, $Y$, and $Z$.

Zhang-Yeung inequality [ZhYe]

\[2I(U:V:Z)\leq I(U:V\vert Z)+I(U:V\vert X)+I(Z:X)+I(U,V : Z).\]
MMRV inequality [MMRV]

If $n\geq 2$ then
\[nI(U:V:Z) \leq \sum_{i=1}^nI(U:V \vert X_i) + \sum_{i=1}^nH(X_i) - H(X_1,\ldots,X_n)+I(U,V: Z).\]
We first observe that \[I(X_i:Z)\geq I(X_i:Z:U:V) =I(U:V)-I(U:V\vert X_i)-I(U:V\vert Z)+I(U:V\vert X_i,Z)\geq I(U:V)-I(U:V\vert X_i)-I(U:V\vert Z)\] and so \[ I(U:V)\leq I(U:V \vert X_i) + I(U:V\vert Z) + I(X_i:Z).\] In particular, summing over $i$, \[nI(U:V)\leq \sum_{i=1}^nI(U:V \vert X_i) + nI(U:V\vert Z)+ \sum_{i=1}^n I(X_i:Z).\] Furthermore, \[\sum_{i=1}^n I(X_i:Z)=\sum_{i=1}^nH(X_i)-\sum_{i=1}^nH(X_i\vert Z)\leq \sum_{i=1}^n H(X_i)-H(X_1,\ldots,X_n)+I(X_1,\ldots,X_n\vert Z).\] We have proved the following submodularity inequality: \[nI(U:V:Z)\leq \sum_{i=1}^nI(U:V \vert X_i) + \sum_{i=1}^nH(X_i)-H(X_1,\ldots,X_n)+I(X_1,\ldots,X_n : Z).\] Now, if $X_1,\ldots,X_n$ and $Z$ were independent given $U$ and $V$, then $I(X_1,\ldots,X_n: Z)\leq I(U,V : Z)$. Why? One way to check this is to note that, due to independence, \[H(X_1,\ldots,X_n,Z\vert U,V)=H(X_1,\ldots,X_n,Z,U,V)-H(U,V)=H(X_1,\ldots,X_n\vert U,V)+H(Z\vert U,V) = H(X_1,\ldots,X_n,U,V)+H(Z,U,V)-2H(U,V)\] whence \[H(X_1,\ldots,X_n,Z)-H(X_1,\ldots,X_n)\geq H(X_1,\ldots,X_n,Z,U,V)-H(X_1,\ldots,X_n,U,V)=H(Z,U,V)-H(U,V)\] and so \[H(X_1,\ldots,X_n)+H(U,V,Z)\leq H(U,V)+H(X_1,\ldots,X_n,Z)\] and hence, expanding information out in terms of individual entropies, \[I(X_1,\ldots,X_n: Z) = H(X_1,\ldots,X_n)+H(Z)-H(X_1,\ldots,X_n,Z) \leq H(U,V)+H(Z)-H(U,V,Z)=I(U,V:Z).\] In particular, substituting this inequality into the above, we recover the MMRV inequality. That is, under this independence assumption, the MMRV inequality is a submodularity inequality.

To deduce the MMRV inequality in general, we observe that in the inequality no term involves both $Z$ and $X_i$ simultaneously. Let $Z'$ be a new random variable such that $(Z',U,V)$ has the same distribution as $(Z,U,V)$ and such that $X_1,\ldots,X_n$ and $Z'$ are independent given $U$ and $V$. More precisely, \[\mathbb{P}(Z'=z \vert X_1=x_1,\ldots,X_n=x_n,U=u,V=v)=\mathbb{P}(Z=z \vert U=u, V=v).\] We can now apply the above inequality with $Z'$ replaced by $Z$, and observe that no individual terms change, and the proof is complete.

**[KaVe]**V. Kaimanovich and A. Vershik, Random walks on discrete groups: boundary and entropy, Ann. Probab. 11 (1983), no. 3, 457-490.**[MMRV]**K. Makarychev, Y. Makarychev, A. Romashchenko, and N. Vereshchagin, A new class of non-Shannon-type inequalities for entropies, Communications in Information and Systems (2002), no. 2, 147-166.**[Pe]**G. Petridis, New Proofs of Plünnecke-type Estimates for Product Sets in Groups, 2011,arXiv:1101.3507**[Ta]**T. Tao, Sumset and inverse sumset theorems for Shannon entropy, 2009, arXiv:0906.4387**[Ta2]**T. Tao, An entropy Plünnecke-Ruzsa inequality, 2009, https://terrytao.wordpress.com/2009/10/27/an-entropy-plunnecke-ruzsa-inequality/**[Ta3]**T. Tao, Special cases of Shannon entropy, 2017, https://terrytao.wordpress.com/2017/03/01/special-cases-of-shannon-entropy/**[ZhYe]**Z. Zhang and R. W. Yeung, On characterization of entropy function via information inequalities, IEEE Transactions on Information Theory (1998), 1440-1450.