Bounds on families with small VC-dimension

Let $\mathcal{F}$ be a family of subsets of $\{1,\ldots,n\}$. We say that $\mathcal{F}$ shatters $X$ if for every $X'\subset X$ there is a set $A\in\mathcal{F}$ such that $A\cap X=X'$. The VC-dimension of $\mathcal{F}$ is the size of the largest set that $\mathcal{F}$ shatters. Roughly, this is one way to measure how much information a family of sets contains, by measuring how large of a set can the family distinguish between. Trivially, the VC-dimension of $\mathcal{F}$ is at most $\log_2\lvert \mathcal{F}\rvert$.

VC here stands for Vapnik and Chervonenkis, after the two authors who first introduced this concept and used it in probability theory [VaCh].

The following shifting operation is very useful: when we shift a family of sets by $x$ we remove $x$ from every set that we can, as long as this doesn't collapse two sets into one. Precisely, if $A\in\mathcal{F}$ then the shift of $A$ by $x$ is defined to be \[ S_x(A) = \begin{cases} A\backslash\{x\} &\textrm{if }x\in A\textrm{ and }A\backslash\{x\}\not\in \mathcal{F}\textrm{, and}\\ A&\textrm{otherwise.}\end{cases}\] We say that $A$ is shifted if $S_x(A)\neq A$. Extending the notation in the obvious fashion, $S_x(\mathcal{F})$ is the family that results from shifting every set down by $x$. We note the following facts:

  1. Shifting preserves the size of the family: $\lvert S_x(\mathcal{F})\rvert=\lvert \mathcal{F}\rvert$ for every $x$.
  2. If $S_x(\mathcal{F})$ shatters $X$ then $\mathcal{F}$ shatters $X$: let $X'\subset X$, and suppose $A\cap X=X'$ for some $A\in S_x(\mathcal{F})$. We claim that there is some $B\in \mathcal{F}$ such that $B\cap X=X'$. If $A\in\mathcal{F}$ then we're done; otherwise, it must be the case that $x\not\in A$ and $A\cup \{x\}\in \mathcal{F}$. If $x\not\in X$ then $(A\cup\{x\})\cap X=A\cap X=X'$. If $x\in X$ then let $B\in S_x(\mathcal{F})$ be such that $B\cap X=X'\cup \{x\}$. Since $x\in B$ and $B\in S_x(\mathcal{F})$ we must have $B\backslash\{x\}\in \mathcal{F}$ (or else $B$ would have been shifted away), and $(B\backslash\{x\})\cap X = X'$.
  3. If $S_x(\mathcal{F})=\mathcal{F}$ for every $x$ then $\mathcal{F}$ is closed under subsets. That is, if $A\in \mathcal{F}$ and $A'\subset A$ then $A'\in\mathcal{F}$.
In particular, shattering is preserved (one way, at least), under taking shifts. We can therefore assume, for many proofs, that $\mathcal{F}$ is closed under subsets, which is very convenient. For example, for such families, every set in the family is shattered by the family. We have proved the following fundamental lemma, due independently to Sauer and Shelah, with a slight strengthening due to Pajor. It was also proved a year earlier independently by Vapnik and Chervonenkis. Clearly it was an idea whose time had come.
Sauer-Shelah-Pajor [Sa] [Sh] [Pa]
Every family $\mathcal{F}$ shatters at least $\lvert \mathcal{F}\rvert$ different sets. In particular, if $\mathcal{F}\subset 2^{[n]}$ has VC-dimension at most $d$ then \[\lvert \mathcal{F}\rvert \leq \sum_{i=0}^d \binom{n}{i}\ll (cn/d)^d\] for some absolute constant $c>0$.
In the rest of this note I'll give an exposition of a proof of the following result of Haussler, giving an improved upper bound on the size of families with small VC-dimension under the additional assumption that this family is well-separated. My interest in this result comes from a recent application in arithmetic combinatorics by Alon, Fox, and Zhao [AlFoZh].
Haussler [Ha]
If $\mathcal{F}\subset 2^{[n]}$ has VC-dimension at most $d$ and $\lvert A\Delta B\rvert \geq k$ for all $A,B\in\mathcal{F}$ then \[\lvert \mathcal{F}\rvert \leq (Cn/k)^d\] for some absolute constant $C>0$.

Shifts and symmetric differences

What effect does shifting have on the symmetric difference $A\Delta B$? Clearly, shifting by $x$ can only affect at most whether $x$ lies in the symmetric difference. Furthermore, if either both or neither of $A$ and $B$ are shifted then the symmetric difference does not change at all. Otherwise, suppose $A$ is shifted, so that $x\in A$ and $S_x(A)=A\backslash\{x\}$ and $B$ is not shifted. If $x\not\in B$ then $S_x(A)\Delta S_x(B)=(A\Delta B)\backslash\{x\}$. If $x\in B$ then $S_x(A)\Delta S_x(B)=A\Delta (B\backslash\{x\})$. On the other hand, $S_x(A)\Delta S_x(B\backslash\{x\})=A\Delta B$.

This shows that, for fixed $A$, the vector $(A\Delta B)_{B\in \mathcal{F}}$ does not change after shifting, except that some coordinates may be transposed and $x$ may be lost from some elements of this vector, which will happen only if $A$ is shifted and $x\not\in B$.

In particular, the number of $B$ such that $\lvert A\Delta B\rvert=1$ can only increase, since $A$ shifted and $S_x(A)\Delta S_x(B)=\emptyset$ implies $S_x(A)=A\backslash \{x\}=S_x(B)=B$, which contradicts the fact that $A\backslash\{x\}\not\in\mathcal{F}$. It follows that if we let \[ F(A) = F(A;\mathcal{F}) = \#\{ B\in\mathcal{F} : \lvert A\Delta B\rvert =1\} \] then $F(A;\mathcal{F})\leq F(A;S_x(\mathcal{F}))$ for every $A\in\mathcal{F}$ and $x$. In particular, by repeated shifts, we can assume that if $\mathcal{F}$ has VC-dimension at most $d$ then $F(A)\leq d$ for every $A\in\mathcal{F}$.


Suppose we have any probability distribution on $\mathcal{F}$. Consider $1_{x\in A}$, the random variable which detects whether $x$ lies in the randomly chosen set $A$. We calculate \[\textrm{Var}(1_{x\in A} \vert \cup_{y\neq x} 1_{y\in A})=\sum_{x\not\in B}\mathbb{P}(A)\mathbb{P}(x\in A \vert A\backslash \{x\}=B)\mathbb{P}(x\not\in A \vert A\backslash \{x\}=B)\] \[=\sum_{\substack{A,B\\ A\Delta B=\{x\}}}(\mathbb{P}(A)+\mathbb{P}(B))\frac{\mathbb{P}(A)}{\mathbb{P}(A)+\mathbb{P}(B)}\frac{\mathbb{P}(B)}{\mathbb{P}(A)+\mathbb{P}(B)} =\sum_{\substack{A,B\\ A\Delta B=\{x\}}}\frac{\mathbb{P}(A)\mathbb{P}(B)}{\mathbb{P}(A)+\mathbb{P}(B)}. \] It follows that \[\textrm{Var}(1_{x\in A}\vert \cup_{y\neq x}1_{y\in A})\leq \sum_{A}\mathbb{P}(A)\#\{ B\in\mathcal{F} : A\Delta B=\{x\}\},\] and so \[\sum_x \textrm{Var}(1_{x\in A}\vert \cup_{y\neq x}1_{y\in A})\leq \sum_A \mathbb{P}(A)F(A)\leq d,\] if $\mathcal{F}$ has VC-dimension at most $d$. This result can be immediately generalised, by considering the family of sets created by intersecting each $A\in\mathcal{F}$ with some fixed ground set $I$:
Lemma 1
If the VC-dimension of $\mathcal{F}$ is at most $d$ then, for any probability distribution on $\mathcal{F}$, and any set $I$, \[\sum_{x\in I} \textrm{Var}(1_{x\in A} \vert \cup_{y\in I\backslash \{x\}}1_{y\in A})\leq d.\]

Separated families

Thus far we have only considered results on families with bounded VC-dimension. To prove the result of Haussler we also need some information on families which are well-separated. In Haussler's proof this is provided by the following.
Lemma 2
If $\mathcal{F}\subset 2^{[n]}$ is a family of sets such that $\lvert A\Delta B\rvert \geq k$ for any $A,B\in \mathcal{F}$ then, if we choose a set $A\in\mathcal{F}$ uniformly, and from a fixed set $I\subset [n]$ of size $m+1$ choose a random $x\in I$ uniformly, then \[\mathbb{E}_{x\in I} \textrm{Var}( 1_{x\in A} \vert \cup_{y\in I\backslash\{x\}} 1_{y\in A})\gg \frac{k}{n-m}\left(1-\frac{\lvert \mathcal{F}_{I\backslash\{x\}}\rvert }{\lvert \mathcal{F}\rvert}\right).\] Here $\mathcal{F_{J}}$ denotes the set of equivalence classes in $\mathcal{F}$, with two sets equivalent if $A\cap J=B\cap J$.
Let's first see how Lemmas 1 and 2 combine to give the result of Haussler. Combining the upper bound of Lemma 1 with the lower bound of Lemma 2 implies that, under the assumptions of Theorem 1, \[d \gg m\frac{k}{n-m}\left(1-\frac{\lvert \mathcal{F}_{I\backslash\{x\}}\rvert }{\lvert \mathcal{F}\rvert}\right).\] The Sauer-Shelah lemma implies that $\lvert \mathcal{F}_{I\backslash\{x\}}\rvert \leq (cm/d)^d$. Suppose, for the sake of contradiction, that $\lvert \mathcal{F}\rvert \geq (Cn/k)^d$, so that \[d \gg m\frac{k}{n-m}(1-(cmk/Cnd))^d\gg d,\] which is a contradiction, if we choose $m\approx nd/k$ and our constant $C$ appropriately (note that we can suppose that $k\gg d$ or else the Sauer-Shelah lemma immediately gives the result).

It remains to prove Lemma 2. Fix some equivalence class $C\in\mathcal{F}_{I\backslash \{x\}}$. Note that the random variable $1_{x\in A}$, conditioned on $A\in C$, is a Bernoulli random variable, with probability $p_x$, say. The variance in question is therefore $p(1-p)$. Drawing two independent samples, say $A$ and $B$, we have \[\mathbb{E}_{x\in I}2p_x(1-p_x)=\mathbb{E}_{x\in I}\mathbb{P}(1_{x\in A}\neq 1_{x\in B})\geq \frac{k}{n-m}\mathbb{P}(A\neq B)=\frac{k}{n-m}(1-1/\lvert C\rvert).\] The inequality here is true because if $A\neq B$ then they differ on at least $k$ of the $n-m$ elements they can possibly disagree on. The actual quantity we need to estimate is \[\sum_{C\in \mathcal{F}_{I\backslash\{x\}}}\frac{\lvert C\rvert}{\lvert \mathcal{F}\rvert}\mathbb{E}\textrm{Var}(1_{x\in A}\vert A\in C)\gg \frac{k}{(n-m)\lvert \mathcal{F}\rvert}\sum_{C\in\mathcal{F}_{I\backslash \{x\}}}(\lvert C\rvert-1),\] and Lemma 2 follows.