A note on the symmetrization technique

This note addresses the steps in the symmetrization technique in empirical process theory in probability. I will walk through each step and provide explanations. If you have some confusion and questions about this technique, like me, then this might be helpful to you! I encountered the argument perhaps 5 or 6 years ago, and I thought I knew what was going on. Until recently, I realized I missed something (especially the step {(d)} below), when I tried to use it in my own research.

Setup. Consider a finite function class of size {m}, {\mathcal{F}:=\{f_i, 1\leq i\leq m\mid f_i :\mathcal{X}\rightarrow \mathbb{R}\}} where {\mathcal{X}} is a set. We are given a series of independent random varaibles {X_i}, {i=1,\dots,n}, which take value in {\mathcal{X}}. Let {X = (X_i)_{i=1}^n}. A typical quantity appeared in empirical process theory is

\displaystyle \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right]. \ \ \ \ \ (1)

And we aim to upper bound it using the so-called Rademacher complexity (to be defined later).

The symmetrization argument. The symmetrization argument uses (1) an independent copy of {X}, denoted as {X'}, and (2) {n} independent Rademacher random variables, {\epsilon_i \in \{\pm 1\}, 1\leq i\leq n} with {\mathbb{P}(\epsilon_i =1) = \mathbb{P}(\epsilon_i = -1) = \frac{1}{2}}. Denote {\epsilon = (\epsilon_i)_{i=1}^n}. Let us display the symmetrization argument:

\displaystyle \begin{aligned} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n f(X_i)- \mathbb{E}f(X_i) \right) \right] \\ \overset{(a)}{=} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n f(X_i)- \mathbb{E}_{X_i'}f(X_i') \right) \right] \\ \overset{(b)}{ =} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n \mathbb{E}_{X_i'}(f(X_i)-f(X_i')) \right) \right]\\ \overset{(c)}{\leq} & \mathbb{E}_{X,X'}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right) \right]\\ \overset{(d)}{=} & \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n \epsilon_i(f(X_i)-f(X_i')) \right) \right]\\ \overset{(e)}{=} & \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]+ \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i') \right| \right]\\ \overset{(f)}{=} & 2 \mathbb{E}_{X,\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]. \end{aligned} \ \ \ \ \ (2)

The last quantity { \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]} is called the Rademacher complexity of {\mathcal{F}} w.r.t. {X}. In step {(a)} and {(f)}, we uses the fact that {X} and {X'} has the same distribution. In step {(b)}, we use the fact that {X} and {X'} are independent so that we could take the expectation w.r.t. {X'} first (or consider {\mathbb{E}_{X'}[\cdot]} as {\mathbb{E}[ \cdot \mid X]}). In step {(e)}, we use the triangle inequality. Next, we explain the steps {(c)} and {(d)}.

Step {(c)}. Here in the step {(c)}, we use the fact that for each {f}, we have

\displaystyle \left(\frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right)\right)\leq \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right).

Taking the expectation over {X'} gives

\displaystyle \mathbb{E}_{X'} \left(\frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right)\right) \leq \mathbb{E}_{X'} \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right).

After taking a superium over the left-hand side of the above over {f} first and then taking an expectation over {X}, we conclude the step {(c)}.

Step {(d)}. Step {(d)} looks puzzling (to me) at first. A naive reasoning of {(d)} is that {f(X_i) - f(X'_i)} has the same distribution as {\epsilon_i (f(X_i) -f(X'_i))}. I was not convinced by this argument since we are taking a superium over {\mathcal{F}}.

Denote “equal in distribution” as {\overset{\text{d}}{=}}. The things we are (or I am) missing is that there are {m} many {f}s in {\mathcal{F}}. Hence, knowing that for each {f},

\displaystyle (f(X_i) - f(X'_i))_{1\leq i\leq n} \overset{\text{d}}{=} (\epsilon_i (f(X_i) -f(X'_i)))_{1\leq i\leq n}

is not enough. Let us first assume {\epsilon \in \{\pm 1\}^n} is fixed. What we really need is to ensure that for any fixed {\epsilon \in \{\pm 1\}^n}, the joint distributions of the following two are the same,

\displaystyle [f_j(X_i)-f_j(X_i')]_{1\leq i\leq n,1\leq j\leq m} \overset{\text{d}}{=} \left[\epsilon_i \left(f_j(X_i)-f_j(X_i')\right)\right]_{1\leq i\leq n,1\leq j\leq m}. \ \ \ \ \ (3)

To argue (3), first, for each {i}, we have

\displaystyle [f_j(X_i) - f_j(X_i')]_{j=1}^m \overset{\text{d}}{=} [\epsilon_i (f_j(X_i) - f_j(X_i'))]_{j=1}^m,

by noting that [\epsilon_i (f_j(X_i) - f_j(X_i'))]_{j=1}^m = \epsilon_i [(f_j(X_i) - f_j(X_i'))]_{j=1}^m. Hence, because of independence of {(X_1,X_1'),\dots, (X_n,X_n')}, we know that

\displaystyle [[f_j(X_i)-f_j(X_i')]_{1\leq j\leq m}]_{1\leq i\leq n} \overset{\text{d}}{=} [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n}.

This establishes the equality (3) for each fixed {\epsilon}. If {\epsilon \sim \{\pm 1\}^n} uniformly, then we see (3) continues to hold by using the law of total probability: for any measurable set {B}

\displaystyle \begin{aligned} &\mathbb{P}( [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B)\\ \overset{(a)}{=}& \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B | \epsilon = \bar{\epsilon})\\ \overset{(b)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\bar{\epsilon}_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B | \epsilon = \bar{\epsilon})\\ \overset{(c)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\bar{\epsilon}_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B )\\ \overset{(d)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B )\\ =& \mathbb{P}( [[(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B ). \end{aligned} \ \ \ \ \ (4)

Here in {(a)}, we use the law of total probability by splitting the event according to the value {\epsilon} take. In {(b)}, we replace the value of {\epsilon} by {\bar{\epsilon}} since this is the condition in the conditional probability. In {(c)}, we use the fact that {\epsilon} is independent of {X}. In {(d)}, we use (3) for fixed {\epsilon}. Thus, the equality (3) holds for the case {\epsilon} being random. This means for any measurable function {h: \mathbb{R}^{m\times n} \rightarrow \mathbb{R}},

\displaystyle \mathbb{E}\left[h\left([f_j(X_i)-f_j(X_i')]_{1\leq j\leq m, \;1\leq i\leq n} \right)\right] = \mathbb{E}\left[h\left(\left[\epsilon_i(f_j(X_i)-f_j(X_i'))\right]_{1\leq j\leq m, \;1\leq i\leq n} \right)\right].

Step {(d)} in the symmetrization argument now follows by taking an appropriate {h}.

Remark. To deal with the case that {\mathcal{F}} is infinite, we may set

\displaystyle \mathcal{A} =\{\mathcal{G}\mid \mathcal{G}\subset \mathcal{F},\; \mathcal{G} \text{ has finite elements}\},

and define

\displaystyle \begin{aligned} \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right] := \sup_{\mathcal{G}\in \mathcal{A}} \mathbb{E}\left [ \sup_{f\in \mathcal{G}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right]. \end{aligned} \ \ \ \ \ (5)

Then our previous argument continues to work.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s