# A note on the symmetrization technique

This note addresses the steps in the symmetrization technique in empirical process theory in probability. I will walk through each step and provide explanations. If you have some confusion and questions about this technique, like me, then this might be helpful to you! I encountered the argument perhaps 5 or 6 years ago, and I thought I knew what was going on. Until recently, I realized I missed something (especially the step ${(d)}$ below), when I tried to use it in my own research.

Setup. Consider a finite function class of size ${m}$, ${\mathcal{F}:=\{f_i, 1\leq i\leq m\mid f_i :\mathcal{X}\rightarrow \mathbb{R}\}}$ where ${\mathcal{X}}$ is a set. We are given a series of independent random varaibles ${X_i}$, ${i=1,\dots,n}$, which take value in ${\mathcal{X}}$. Let ${X = (X_i)_{i=1}^n}$. A typical quantity appeared in empirical process theory is

$\displaystyle \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right]. \ \ \ \ \ (1)$

And we aim to upper bound it using the so-called Rademacher complexity (to be defined later).

The symmetrization argument. The symmetrization argument uses (1) an independent copy of ${X}$, denoted as ${X'}$, and (2) ${n}$ independent Rademacher random variables, ${\epsilon_i \in \{\pm 1\}, 1\leq i\leq n}$ with ${\mathbb{P}(\epsilon_i =1) = \mathbb{P}(\epsilon_i = -1) = \frac{1}{2}}$. Denote ${\epsilon = (\epsilon_i)_{i=1}^n}$. Let us display the symmetrization argument:

\displaystyle \begin{aligned} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n f(X_i)- \mathbb{E}f(X_i) \right) \right] \\ \overset{(a)}{=} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n f(X_i)- \mathbb{E}_{X_i'}f(X_i') \right) \right] \\ \overset{(b)}{ =} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n \mathbb{E}_{X_i'}(f(X_i)-f(X_i')) \right) \right]\\ \overset{(c)}{\leq} & \mathbb{E}_{X,X'}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right) \right]\\ \overset{(d)}{=} & \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n \epsilon_i(f(X_i)-f(X_i')) \right) \right]\\ \overset{(e)}{=} & \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]+ \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i') \right| \right]\\ \overset{(f)}{=} & 2 \mathbb{E}_{X,\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]. \end{aligned} \ \ \ \ \ (2)

The last quantity ${ \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]}$ is called the Rademacher complexity of ${\mathcal{F}}$ w.r.t. ${X}$. In step ${(a)}$ and ${(f)}$, we uses the fact that ${X}$ and ${X'}$ has the same distribution. In step ${(b)}$, we use the fact that ${X}$ and ${X'}$ are independent so that we could take the expectation w.r.t. ${X'}$ first (or consider ${\mathbb{E}_{X'}[\cdot]}$ as ${\mathbb{E}[ \cdot \mid X]}$). In step ${(e)}$, we use the triangle inequality. Next, we explain the steps ${(c)}$ and ${(d)}$.

Step ${(c)}$. Here in the step ${(c)}$, we use the fact that for each ${f}$, we have

$\displaystyle \left(\frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right)\right)\leq \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right).$

Taking the expectation over ${X'}$ gives

$\displaystyle \mathbb{E}_{X'} \left(\frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right)\right) \leq \mathbb{E}_{X'} \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right).$

After taking a superium over the left-hand side of the above over ${f}$ first and then taking an expectation over ${X}$, we conclude the step ${(c)}$.

Step ${(d)}$. Step ${(d)}$ looks puzzling (to me) at first. A naive reasoning of ${(d)}$ is that ${f(X_i) - f(X'_i)}$ has the same distribution as ${\epsilon_i (f(X_i) -f(X'_i))}$. I was not convinced by this argument since we are taking a superium over ${\mathcal{F}}$.

Denote “equal in distribution” as ${\overset{\text{d}}{=}}$. The things we are (or I am) missing is that there are ${m}$ many ${f}$s in ${\mathcal{F}}$. Hence, knowing that for each ${f}$,

$\displaystyle (f(X_i) - f(X'_i))_{1\leq i\leq n} \overset{\text{d}}{=} (\epsilon_i (f(X_i) -f(X'_i)))_{1\leq i\leq n}$

is not enough. Let us first assume ${\epsilon \in \{\pm 1\}^n}$ is fixed. What we really need is to ensure that for any fixed ${\epsilon \in \{\pm 1\}^n}$, the joint distributions of the following two are the same,

$\displaystyle [f_j(X_i)-f_j(X_i')]_{1\leq i\leq n,1\leq j\leq m} \overset{\text{d}}{=} \left[\epsilon_i \left(f_j(X_i)-f_j(X_i')\right)\right]_{1\leq i\leq n,1\leq j\leq m}. \ \ \ \ \ (3)$

To argue (3), first, for each ${i}$, we have

$\displaystyle [f_j(X_i) - f_j(X_i')]_{j=1}^m \overset{\text{d}}{=} [\epsilon_i (f_j(X_i) - f_j(X_i'))]_{j=1}^m,$

by noting that $[\epsilon_i (f_j(X_i) - f_j(X_i'))]_{j=1}^m = \epsilon_i [(f_j(X_i) - f_j(X_i'))]_{j=1}^m.$ Hence, because of independence of ${(X_1,X_1'),\dots, (X_n,X_n')}$, we know that

$\displaystyle [[f_j(X_i)-f_j(X_i')]_{1\leq j\leq m}]_{1\leq i\leq n} \overset{\text{d}}{=} [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n}.$

This establishes the equality (3) for each fixed ${\epsilon}$. If ${\epsilon \sim \{\pm 1\}^n}$ uniformly, then we see (3) continues to hold by using the law of total probability: for any measurable set ${B}$

\displaystyle \begin{aligned} &\mathbb{P}( [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B)\\ \overset{(a)}{=}& \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B | \epsilon = \bar{\epsilon})\\ \overset{(b)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\bar{\epsilon}_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B | \epsilon = \bar{\epsilon})\\ \overset{(c)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\bar{\epsilon}_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B )\\ \overset{(d)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B )\\ =& \mathbb{P}( [[(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B ). \end{aligned} \ \ \ \ \ (4)

Here in ${(a)}$, we use the law of total probability by splitting the event according to the value ${\epsilon}$ take. In ${(b)}$, we replace the value of ${\epsilon}$ by ${\bar{\epsilon}}$ since this is the condition in the conditional probability. In ${(c)}$, we use the fact that ${\epsilon}$ is independent of ${X}$. In ${(d)}$, we use (3) for fixed ${\epsilon}$. Thus, the equality (3) holds for the case ${\epsilon}$ being random. This means for any measurable function ${h: \mathbb{R}^{m\times n} \rightarrow \mathbb{R}}$,

$\displaystyle \mathbb{E}\left[h\left([f_j(X_i)-f_j(X_i')]_{1\leq j\leq m, \;1\leq i\leq n} \right)\right] = \mathbb{E}\left[h\left(\left[\epsilon_i(f_j(X_i)-f_j(X_i'))\right]_{1\leq j\leq m, \;1\leq i\leq n} \right)\right]$.

Step ${(d)}$ in the symmetrization argument now follows by taking an appropriate ${h}$.

Remark. To deal with the case that ${\mathcal{F}}$ is infinite, we may set

$\displaystyle \mathcal{A} =\{\mathcal{G}\mid \mathcal{G}\subset \mathcal{F},\; \mathcal{G} \text{ has finite elements}\},$

and define

\displaystyle \begin{aligned} \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right] := \sup_{\mathcal{G}\in \mathcal{A}} \mathbb{E}\left [ \sup_{f\in \mathcal{G}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right]. \end{aligned} \ \ \ \ \ (5)

Then our previous argument continues to work.