This note addresses the steps in the symmetrization technique in empirical process theory in probability. I will walk through each step and provide explanations. If you have some confusion and questions about this technique, like me, then this might be helpful to you! I encountered the argument perhaps 5 or 6 years ago, and I thought I knew what was going on. Until recently, I realized I missed something (especially the step below), when I tried to use it in my own research.
Setup. Consider a finite function class of size ,
where
is a set. We are given a series of independent random varaibles
,
, which take value in
. Let
. A typical quantity appeared in empirical process theory is
And we aim to upper bound it using the so-called Rademacher complexity (to be defined later).
The symmetrization argument. The symmetrization argument uses (1) an independent copy of , denoted as
, and (2)
independent Rademacher random variables,
with
. Denote
. Let us display the symmetrization argument:
The last quantity is called the Rademacher complexity of
w.r.t.
. In step
and
, we uses the fact that
and
has the same distribution. In step
, we use the fact that
and
are independent so that we could take the expectation w.r.t.
first (or consider
as
). In step
, we use the triangle inequality. Next, we explain the steps
and
.
Step . Here in the step
, we use the fact that for each
, we have
Taking the expectation over gives
After taking a superium over the left-hand side of the above over first and then taking an expectation over
, we conclude the step
.
Step . Step
looks puzzling (to me) at first. A naive reasoning of
is that
has the same distribution as
. I was not convinced by this argument since we are taking a superium over
.
Denote “equal in distribution” as . The things we are (or I am) missing is that there are
many
s in
. Hence, knowing that for each
,
is not enough. Let us first assume is fixed. What we really need is to ensure that for any fixed
, the joint distributions of the following two are the same,
To argue (3), first, for each , we have
by noting that Hence, because of independence of
, we know that
This establishes the equality (3) for each fixed . If
uniformly, then we see (3) continues to hold by using the law of total probability: for any measurable set
Here in , we use the law of total probability by splitting the event according to the value
take. In
, we replace the value of
by
since this is the condition in the conditional probability. In
, we use the fact that
is independent of
. In
, we use (3) for fixed
. Thus, the equality (3) holds for the case
being random. This means for any measurable function
,
.
Step in the symmetrization argument now follows by taking an appropriate
.
Remark. To deal with the case that is infinite, we may set
and define
Then our previous argument continues to work.