A note on the symmetrization technique

This note addresses the steps in the symmetrization technique in empirical process theory in probability. I will walk through each step and provide explanations. If you have some confusion and questions about this technique, like me, then this might be helpful to you! I encountered the argument perhaps 5 or 6 years ago, and I thought I knew what was going on. Until recently, I realized I missed something (especially the step ${(d)}$ below), when I tried to use it in my own research.

Setup. Consider a finite function class of size ${m}$ , ${\mathcal{F}:=\{f_i, 1\leq i\leq m\mid f_i :\mathcal{X}\rightarrow \mathbb{R}\}}$ where ${\mathcal{X}}$ is a set. We are given a series of independent random varaibles ${X_i}$ , ${i=1,\dots,n}$ , which take value in ${\mathcal{X}}$ . Let ${X = (X_i)_{i=1}^n}$ . A typical quantity appeared in empirical process theory is

$\displaystyle \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right]. \ \ \ \ \ (1)$

And we aim to upper bound it using the so-called Rademacher complexity (to be defined later).

The symmetrization argument. The symmetrization argument uses (1) an independent copy of ${X}$ , denoted as ${X'}$ , and (2) ${n}$ independent Rademacher random variables, ${\epsilon_i \in \{\pm 1\}, 1\leq i\leq n}$ with ${\mathbb{P}(\epsilon_i =1) = \mathbb{P}(\epsilon_i = -1) = \frac{1}{2}}$ . Denote ${\epsilon = (\epsilon_i)_{i=1}^n}$ . Let us display the symmetrization argument:

$\displaystyle \begin{aligned} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n f(X_i)- \mathbb{E}f(X_i) \right) \right] \\ \overset{(a)}{=} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n f(X_i)- \mathbb{E}_{X_i'}f(X_i') \right) \right] \\ \overset{(b)}{ =} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n \mathbb{E}_{X_i'}(f(X_i)-f(X_i')) \right) \right]\\ \overset{(c)}{\leq} & \mathbb{E}_{X,X'}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right) \right]\\ \overset{(d)}{=} & \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n \epsilon_i(f(X_i)-f(X_i')) \right) \right]\\ \overset{(e)}{=} & \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]+ \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i') \right| \right]\\ \overset{(f)}{=} & 2 \mathbb{E}_{X,\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]. \end{aligned} \ \ \ \ \ (2)$

The last quantity ${ \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]}$ is called the Rademacher complexity of ${\mathcal{F}}$ w.r.t. ${X}$ . In step ${(a)}$ and ${(f)}$ , we uses the fact that ${X}$ and ${X'}$ has the same distribution. In step ${(b)}$ , we use the fact that ${X}$ and ${X'}$ are independent so that we could take the expectation w.r.t. ${X'}$ first (or consider ${\mathbb{E}_{X'}[\cdot]}$ as ${\mathbb{E}[ \cdot \mid X]}$ ). In step ${(e)}$ , we use the triangle inequality. Next, we explain the steps ${(c)}$ and ${(d)}$ .

Step ${(c)}$ . Here in the step ${(c)}$ , we use the fact that for each ${f}$ , we have

$\displaystyle \left(\frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right)\right)\leq \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right).$

Taking the expectation over ${X'}$ gives

$\displaystyle \mathbb{E}_{X'} \left(\frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right)\right) \leq \mathbb{E}_{X'} \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right).$

After taking a superium over the left-hand side of the above over ${f}$ first and then taking an expectation over ${X}$ , we conclude the step ${(c)}$ .

Step ${(d)}$ . Step ${(d)}$ looks puzzling (to me) at first. A naive reasoning of ${(d)}$ is that ${f(X_i) - f(X'_i)}$ has the same distribution as ${\epsilon_i (f(X_i) -f(X'_i))}$ . I was not convinced by this argument since we are taking a superium over ${\mathcal{F}}$ .

Denote “equal in distribution” as ${\overset{\text{d}}{=}}$ . The things we are (or I am) missing is that there are ${m}$ many ${f}$ s in ${\mathcal{F}}$ . Hence, knowing that for each ${f}$ ,

$\displaystyle (f(X_i) - f(X'_i))_{1\leq i\leq n} \overset{\text{d}}{=} (\epsilon_i (f(X_i) -f(X'_i)))_{1\leq i\leq n}$

is not enough. Let us first assume ${\epsilon \in \{\pm 1\}^n}$ is fixed. What we really need is to ensure that for any fixed ${\epsilon \in \{\pm 1\}^n}$ , the joint distributions of the following two are the same,

$\displaystyle [f_j(X_i)-f_j(X_i')]_{1\leq i\leq n,1\leq j\leq m} \overset{\text{d}}{=} \left[\epsilon_i \left(f_j(X_i)-f_j(X_i')\right)\right]_{1\leq i\leq n,1\leq j\leq m}. \ \ \ \ \ (3)$

To argue (3), first, for each ${i}$ , we have

$\displaystyle [f_j(X_i) - f_j(X_i')]_{j=1}^m \overset{\text{d}}{=} [\epsilon_i (f_j(X_i) - f_j(X_i'))]_{j=1}^m,$

by noting that $[\epsilon_i (f_j(X_i) - f_j(X_i'))]_{j=1}^m = \epsilon_i [(f_j(X_i) - f_j(X_i'))]_{j=1}^m.$ Hence, because of independence of ${(X_1,X_1'),\dots, (X_n,X_n')}$ , we know that

$\displaystyle [[f_j(X_i)-f_j(X_i')]_{1\leq j\leq m}]_{1\leq i\leq n} \overset{\text{d}}{=} [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n}.$

This establishes the equality (3) for each fixed ${\epsilon}$ . If ${\epsilon \sim \{\pm 1\}^n}$ uniformly, then we see (3) continues to hold by using the law of total probability: for any measurable set ${B}$

$\displaystyle \begin{aligned} &\mathbb{P}( [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B)\\ \overset{(a)}{=}& \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B | \epsilon = \bar{\epsilon})\\ \overset{(b)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\bar{\epsilon}_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B | \epsilon = \bar{\epsilon})\\ \overset{(c)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\bar{\epsilon}_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B )\\ \overset{(d)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B )\\ =& \mathbb{P}( [[(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B ). \end{aligned} \ \ \ \ \ (4)$

Here in ${(a)}$ , we use the law of total probability by splitting the event according to the value ${\epsilon}$ take. In ${(b)}$ , we replace the value of ${\epsilon}$ by ${\bar{\epsilon}}$ since this is the condition in the conditional probability. In ${(c)}$ , we use the fact that ${\epsilon}$ is independent of ${X}$ . In ${(d)}$ , we use (3) for fixed ${\epsilon}$ . Thus, the equality (3) holds for the case ${\epsilon}$ being random. This means for any measurable function ${h: \mathbb{R}^{m\times n} \rightarrow \mathbb{R}}$ ,

$\displaystyle \mathbb{E}\left[h\left([f_j(X_i)-f_j(X_i')]_{1\leq j\leq m, \;1\leq i\leq n} \right)\right] = \mathbb{E}\left[h\left(\left[\epsilon_i(f_j(X_i)-f_j(X_i'))\right]_{1\leq j\leq m, \;1\leq i\leq n} \right)\right]$ .

Step ${(d)}$ in the symmetrization argument now follows by taking an appropriate ${h}$ .

Remark. To deal with the case that ${\mathcal{F}}$ is infinite, we may set

$\displaystyle \mathcal{A} =\{\mathcal{G}\mid \mathcal{G}\subset \mathcal{F},\; \mathcal{G} \text{ has finite elements}\},$

and define

$\displaystyle \begin{aligned} \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right] := \sup_{\mathcal{G}\in \mathcal{A}} \mathbb{E}\left [ \sup_{f\in \mathcal{G}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right]. \end{aligned} \ \ \ \ \ (5)$

Then our previous argument continues to work.

Three Squirrels

Little Discoveries of Optimization, Statistics, and Probability

A note on the symmetrization technique

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply