A note on the symmetrization technique

This note addresses the steps in the symmetrization technique in empirical process theory in probability. I will walk through each step and provide explanations. If you have some confusion and questions about this technique, like me, then this might be helpful to you! I encountered the argument perhaps 5 or 6 years ago, and I thought I knew what was going on. Until recently, I realized I missed something (especially the step {(d)} below), when I tried to use it in my own research.

Setup. Consider a finite function class of size {m}, {\mathcal{F}:=\{f_i, 1\leq i\leq m\mid f_i :\mathcal{X}\rightarrow \mathbb{R}\}} where {\mathcal{X}} is a set. We are given a series of independent random varaibles {X_i}, {i=1,\dots,n}, which take value in {\mathcal{X}}. Let {X = (X_i)_{i=1}^n}. A typical quantity appeared in empirical process theory is

\displaystyle \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right]. \ \ \ \ \ (1)

And we aim to upper bound it using the so-called Rademacher complexity (to be defined later).

The symmetrization argument. The symmetrization argument uses (1) an independent copy of {X}, denoted as {X'}, and (2) {n} independent Rademacher random variables, {\epsilon_i \in \{\pm 1\}, 1\leq i\leq n} with {\mathbb{P}(\epsilon_i =1) = \mathbb{P}(\epsilon_i = -1) = \frac{1}{2}}. Denote {\epsilon = (\epsilon_i)_{i=1}^n}. Let us display the symmetrization argument:

\displaystyle \begin{aligned} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n f(X_i)- \mathbb{E}f(X_i) \right) \right] \\ \overset{(a)}{=} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n f(X_i)- \mathbb{E}_{X_i'}f(X_i') \right) \right] \\ \overset{(b)}{ =} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n \mathbb{E}_{X_i'}(f(X_i)-f(X_i')) \right) \right]\\ \overset{(c)}{\leq} & \mathbb{E}_{X,X'}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right) \right]\\ \overset{(d)}{=} & \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n \epsilon_i(f(X_i)-f(X_i')) \right) \right]\\ \overset{(e)}{=} & \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]+ \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i') \right| \right]\\ \overset{(f)}{=} & 2 \mathbb{E}_{X,\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]. \end{aligned} \ \ \ \ \ (2)

The last quantity { \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]} is called the Rademacher complexity of {\mathcal{F}} w.r.t. {X}. In step {(a)} and {(f)}, we uses the fact that {X} and {X'} has the same distribution. In step {(b)}, we use the fact that {X} and {X'} are independent so that we could take the expectation w.r.t. {X'} first (or consider {\mathbb{E}_{X'}[\cdot]} as {\mathbb{E}[ \cdot \mid X]}). In step {(e)}, we use the triangle inequality. Next, we explain the steps {(c)} and {(d)}.

Step {(c)}. Here in the step {(c)}, we use the fact that for each {f}, we have

\displaystyle \left(\frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right)\right)\leq \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right).

Taking the expectation over {X'} gives

\displaystyle \mathbb{E}_{X'} \left(\frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right)\right) \leq \mathbb{E}_{X'} \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right).

After taking a superium over the left-hand side of the above over {f} first and then taking an expectation over {X}, we conclude the step {(c)}.

Step {(d)}. Step {(d)} looks puzzling (to me) at first. A naive reasoning of {(d)} is that {f(X_i) - f(X'_i)} has the same distribution as {\epsilon_i (f(X_i) -f(X'_i))}. I was not convinced by this argument since we are taking a superium over {\mathcal{F}}.

Denote “equal in distribution” as {\overset{\text{d}}{=}}. The things we are (or I am) missing is that there are {m} many {f}s in {\mathcal{F}}. Hence, knowing that for each {f},

\displaystyle (f(X_i) - f(X'_i))_{1\leq i\leq n} \overset{\text{d}}{=} (\epsilon_i (f(X_i) -f(X'_i)))_{1\leq i\leq n}

is not enough. Let us first assume {\epsilon \in \{\pm 1\}^n} is fixed. What we really need is to ensure that for any fixed {\epsilon \in \{\pm 1\}^n}, the joint distributions of the following two are the same,

\displaystyle [f_j(X_i)-f_j(X_i')]_{1\leq i\leq n,1\leq j\leq m} \overset{\text{d}}{=} \left[\epsilon_i \left(f_j(X_i)-f_j(X_i')\right)\right]_{1\leq i\leq n,1\leq j\leq m}. \ \ \ \ \ (3)

To argue (3), first, for each {i}, we have

\displaystyle [f_j(X_i) - f_j(X_i')]_{j=1}^m \overset{\text{d}}{=} [\epsilon_i (f_j(X_i) - f_j(X_i'))]_{j=1}^m,

by noting that [\epsilon_i (f_j(X_i) - f_j(X_i'))]_{j=1}^m = \epsilon_i [(f_j(X_i) - f_j(X_i'))]_{j=1}^m. Hence, because of independence of {(X_1,X_1'),\dots, (X_n,X_n')}, we know that

\displaystyle [[f_j(X_i)-f_j(X_i')]_{1\leq j\leq m}]_{1\leq i\leq n} \overset{\text{d}}{=} [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n}.

This establishes the equality (3) for each fixed {\epsilon}. If {\epsilon \sim \{\pm 1\}^n} uniformly, then we see (3) continues to hold by using the law of total probability: for any measurable set {B}

\displaystyle \begin{aligned} &\mathbb{P}( [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B)\\ \overset{(a)}{=}& \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B | \epsilon = \bar{\epsilon})\\ \overset{(b)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\bar{\epsilon}_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B | \epsilon = \bar{\epsilon})\\ \overset{(c)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\bar{\epsilon}_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B )\\ \overset{(d)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B )\\ =& \mathbb{P}( [[(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B ). \end{aligned} \ \ \ \ \ (4)

Here in {(a)}, we use the law of total probability by splitting the event according to the value {\epsilon} take. In {(b)}, we replace the value of {\epsilon} by {\bar{\epsilon}} since this is the condition in the conditional probability. In {(c)}, we use the fact that {\epsilon} is independent of {X}. In {(d)}, we use (3) for fixed {\epsilon}. Thus, the equality (3) holds for the case {\epsilon} being random. This means for any measurable function {h: \mathbb{R}^{m\times n} \rightarrow \mathbb{R}},

\displaystyle \mathbb{E}\left[h\left([f_j(X_i)-f_j(X_i')]_{1\leq j\leq m, \;1\leq i\leq n} \right)\right] = \mathbb{E}\left[h\left(\left[\epsilon_i(f_j(X_i)-f_j(X_i'))\right]_{1\leq j\leq m, \;1\leq i\leq n} \right)\right].

Step {(d)} in the symmetrization argument now follows by taking an appropriate {h}.

Remark. To deal with the case that {\mathcal{F}} is infinite, we may set

\displaystyle \mathcal{A} =\{\mathcal{G}\mid \mathcal{G}\subset \mathcal{F},\; \mathcal{G} \text{ has finite elements}\},

and define

\displaystyle \begin{aligned} \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right] := \sup_{\mathcal{G}\in \mathcal{A}} \mathbb{E}\left [ \sup_{f\in \mathcal{G}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right]. \end{aligned} \ \ \ \ \ (5)

Then our previous argument continues to work.

Random matrices and their rank

In this post, we study a few probability one results conerning the rank of random matrices: with probability one,

  1. a random asymmetric matrix has full rank;
  2. a random symmetric matrix has full rank;
  3. {\frac{n(n+1)}{2}} many random rank one positive semidefinite matrices (belonging to {\mathbb{R}^{n\times n}}) are linearly independent.

At least for the first two, they look quite natural and should be correct at first glance. However, to rigorously prove them does require some effort. The post is trying to show that one can use the concept of independence to prove the above assertions. Indeed, all the proof follows an inductive argument where independence makes life much easier:

  1. Draw the first sample, it satisfies the property we want to show,
  2. Suppose first {k} samples satisfy the property we want to show, draw the next sample. Since it is independent of the previous {k} samples, the first {k} samples can be considered as fixed. Then we make some effort to say the first {k+1} sample does satisfy the property.

Before we going into each example, we further note for any probability measure that is absolutely continuous respect to the measure introduced below and vice versa (so independence is not needed), the probability {1} results listed above still hold.

1. Random Matrices are full rank

Suppose {A\in \mathbb{R}^{n\times m}} with {A=[a_{ij}]} and each entry of {A } is drawn independently from standard Gaussian {N(0,1)}. We prove that {A} has rank {\min\{m,n\}} with probability {1}.

To start, we should assume without loss of generality that {m\leq n}. Otherwise, we just repeat the following argument for {A^\top}.

The trick is to use the independence structure and consider columns of {A} are drawn from left to right. Let’s write the {i}-th column of {A} by {a_{\cdot i}}

For the first column, we know {a_{\cdot 1}} is linearly independent because it is {0} with probability {0}. For the second column, because it is drawn independently from the first column, we can consider the first column is fixed, and the probability of {a_{\cdot 2}} fall in to the span of a fixed column is {0} because {a_{\cdot 2}} has a continuous density in {\mathbb{R}^n}. Thus the first two columns are linearly independent.

For general {k\leq n}, because {m\leq n}, we know the first {k-1} columns forms a subspace in {\mathbb{R}^n} and so {a_{\cdot k}} falls into that subspace with probability {0} (linear subspace has Lebesgue measure {0} in {\mathbb{R}^n} and Gaussian measure is absolutely continuous with respect to Lebesgue measure and vice versa). Thus we see first {k} columns are linearly independent. Proceeding the argument until {k=m} completes the proof.

2. Symmetric Matrices are full rank

Suppose {A\in \mathbb{S}^n\subset \mathbb{R}^{n\times n}} with {A=[a_{ij}]} and each entry of the upper half of {A }, {a_{ij},j\geq i}, is drawn independently from standard Gaussian {N(0,1)}. We prove that {A} has rank {n}.

Let’s write the {i}-th row of {A} by {a_{i\cdot}\in \mathbb{R}^{1\times n}}. For the first {k} entries of the {i}-th row, we write {a_{i,1:k}\in \mathbb{R}^{1\times k}}. Similarly, {a_{1:k,i}} denotes the first {k} entries of the {i}-th column. For the top left {k\times k} submatrix of {A}, we denote it by {a_{1:k,1:k}\in \mathbb{R}^{k\times k}}. Similar notation applies to other submatrices.

The idea is drawing each column of {A} from left to right sequentially and using the independence structure. Starting from the first column, with probability one, we have that {a_{\cdot 1}} is linearly independent (meaning {a_{\cdot 1}\not =0}).

Now since {a_{\cdot 1}} is drawn and the second column except the first entry is independent of {a_{\cdot 1}}, we know it is in the span of {a_{\cdot 1}} if the second column {a_{\cdot 2}} is {\frac{a_{12}}{a_{11}}} ({a_{11}} is {0} with probability 0) multiple of {a_{\cdot 1}}. This still happens with probability {0} since each entry of {a_{\cdot2}} has continuous probability density and is drawn independently from {a_{\cdot 1}}.

Now assume we have drawn the first {k} columns of {A}, the first {k} columns are linearly independent with probability {1}, and the left top {k\times k} submatrix of {A} is invertible (the base case is what we just proved). We want to show after we draw the {k+1} column, the first {k+1} columns are still linearly independent with probability {1}. Since the first {k} columns is drawn, the first {k} entries of the {k+1} column {a_{\cdot (k+1)}} is fixed, we wish to show the rest {n-k} entries which are drawn independently from all previous {k} columns will ensure linearly independence of first {k+1} columns.

Suppose instead {a_{\cdot (k+1)}} is linearly dependent with first {k} columns of {A}. Since the left top {k\times k} submatrix of {A} is invertible, we know there is only one way to write {a_{\cdot (k+1)}} as a linear combination of previous columns. More precisely, we have

\displaystyle a_{\cdot (k+1)} = a_{1:n,1:k}a_{1:k, 1:k}^{-1}a_{k+1,1:k}^\top.

This means {a_{\cdot (k+1)}} has to be a fixed vector which happens with probability {0} because the last {n-k} entries of {a_{\cdot (k+1)}} is drawn independently from {a_{\cdot i}}, {i=1,\dots ,k.} Note this implies that the first {k+1} columns of {A} are linearly independent as well as the left top {(k+1) \times (k+1)} submatrix of {A} is invertible because we can repeat the argument simply by thinking {A} is {(k+1) \times (k+1)} (instead of {n\times n}). Thus the induction is complete and we prove {A} indeed has rank {n} with probability {1}.

3. Rank {1} positive semidefinite matrices are linearly independent

This time, we consider {\mathbf{a}_i}, {i=1,\dots, \frac{n(n+1)}{2}} are drawn independently from standard normal distribution in {\mathbb{R}^n}. Denote the set of symmetric matrices in {\mathbb{R}^{n\times n}} as {\mathbb{S}^n}. We show that with probability {1} the matrices

\displaystyle A_i = \mathbf{a}_i\mathbf{a}_i^\top, \quad i =1,\dots , \frac{n(n+1)}{2}

are linearly independent in {\mathbb{S}^n}.

Denote the standard {i}-th basis vector in {\mathbb{R}^n} as {e_i}. First, let us consider the following basis (which can be verified easily) in {\mathbb{S}^n},

\displaystyle E_{ii} = e_ie_i, \quad i=1,\dots,n,\,\text{and}\quad E_{ij} = e_ie_i +e_{j}e_j + e_ie_j +e_je_i, 1\leq i<j\leq n.

We order the basis according to the lexicographical order on {(i,j)}. Suppose {\mathbf{a}} follows standard normal distribution in {\mathbb{R}^n}, we prove that for any fixed linear subspace {V\subset \mathbb{S}^{n^2}} with dimension less than {\frac{n(n+1)}{2}}, we have

\displaystyle \mathop{\mathbb P}(\mathbf{a}\mathbf{a} ^\top \in V ) = 0.

We start by applying an fixed invertible linear map {\mathcal{F}:\mathbb{S}^n \rightarrow \mathbb{S}^n} to {V} so that the basis of {\mathcal{F}(V)} is the first {\dim(V)} many basis vector of {\{E_{ij}\}_{i\leq j}} ({in lexicographical order.}), and the basis of the orthogonal complement (defined via the trace inner product) of {V}, {V^\perp\subset \mathbb{S}^n}, is mapped to the rest of the basis vector of {\{E_{ij}\}_{i\leq j}} under {\mathcal{F}}. We then only need to prove

\displaystyle \mathop{\mathbb P}(\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)) =0.

We prove by contradiction. Suppose {\mathop{\mathbb P}(\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)) >0}. It implies that there exists infinitely many {\mathbf{a}\in \mathbb{R}^n} such that {\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)}. Moreover, each component of {\mathbf{a}} takes infinitely many values. We show such situation cannot occur. Denote the {\dim (V)}-th largest element in {\{(i,j)\}_{i\leq j}} as {(i_0,j_0)}. Since {V} has dimension less than {\frac{n(n+1)}{2}}, {\mathcal{F}(V^\top)} contains nonzero element and so

\displaystyle \mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in \mathcal{F}(V) \iff [\mathcal{F}(\mathbf{a}\mathbf{a}^\top)]_{ij}=0,\quad \forall (i,j) > (i_0,j_0) \,\text{in lexicographical order.}

Now fix an {(i_1,j_1)>(i_0,j_0)}, we know { [\mathcal{F}(\mathbf{a}\mathbf{a}^\top)]_{i_1j_1}=0} means that there is some {F\in \mathbb{S}^n} (depending only on {V}) such that

\displaystyle \mathbf{tr}(F\mathbf{a}\mathbf{a}^\top) = \sum_{1\leq i,j\leq n} F_{ij} a_ia_j =0,

where {a_i} is the {i}-th complement of {\mathbf{a}}. Now if we fix {a_i, i=2,\dots,n} and only vary {a_1}, then we have

\displaystyle \sum_{1\leq i,j\leq n} F_{ij} a_ia_j =F_{11}a_1^2 +2 (\sum_{1<j\leq n}F_{1j}a_j)a_1 + \sum_{i\not =1,j\not =1}F_{ij}a_ia_j=0,

which holds for infinitely many {a_i \in \mathbb{R}} only if

\displaystyle F_{ii}=0, \quad (\sum_{1<j\leq n}F_{1j}a_j)=0,\quad\text{and} \sum_{i\not =1,j\not =1}F_{ij}a_ia_j =0.

Thus we can repeat the argument as before and conclude that {\mathbf{tr}(F\mathbf{a}\mathbf{a}^\top) = \sum_{1\leq i,j\leq n} F_{ij} a_ia_j =0,} holds for infinitely many {\mathbf{a}} with each component taking infinitely many values only if {F=0}. This contradicts the fact that {\mathcal{F}} is invertible and hence we must have

\displaystyle \mathop{\mathbb P}(\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)) =0.

Now proving {A_i=\mathbf{a}_i \mathbf{a}_i^\top} spans {\mathbb{S}^n} with probability {1} is easy. Because for each {k}, the {A_i} with {i<k} being drawn can be considered fixed because {A_k} is drawn independently of {A_i}. But the probability of {A_k} stays in the span of {A_i,i<k} is {0} by previous result and hence {A_k} is linearly independent of all {A_i}. This argument continues to hold for {k= \frac{n(n+1)}{2}} because the space spanned by {A_1,\dots, A_{\frac{n(n+1)}{2}-1}} is at most {\frac{n(n+1)}{2}-1} and so previous result applies. This completes the proof.

Close property of conditional expectation under convexity

Suppose we have a random vector {Z\in \mathbb{R}^n} and we know that {Z} only takes value in a convex set {C}, i.e.,

\displaystyle \mathbb{P}(Z\in C) =1.

Previously, we showed in this post that as long as {C} is convex, we will have

\displaystyle \mathbb{E}(Z) \in C,

if {\mathbb{E}(Z)} exists. It is then natural to ask how about conditional expectation. Is it true for any reasonable sigma-algebra {\mathcal{G}} that

\displaystyle \mathbb{P}(\mathbb{E}(Z\mid \mathcal{G}) \in C )=1?

The answer is affirmative. Let us first recall our previous theorem.

Theorem 1 For any borel measurable convex set {C\subset {\mathbb R}^n}, and for any probability measure {\mu} on {(\mathbb{R}^n,\mathcal{B})} with

\displaystyle \int_C \mu(dx) =1, \quad \int_{\mathbb{R}^n} \|x\|\mu(dx) <\infty,

we have

\displaystyle \int_{\mathbb{R}^n} x\mu(dx) \in C.

By utilizing the above theorem and regular conditional distribution, we can prove our previous claim.

Theorem 2 Suppose {Z} is a random vector from a probability space {(\Omega, \mathcal{F},\mathbb{P})} to {(\mathbb{R}^n,\mathcal{B})} where {\mathcal{B}} is the usual Borel sigma algebra on {\mathbb{R}^n}. If {C} is Borel measurable and

\displaystyle \mathbb{P}(Z\in C) =1, \mathbb{E}\|Z\| <\infty ,

then we have

\displaystyle \mathbb{P}( \mathbb{E}(Z\mid \mathcal{G})\in C) =1

for any sigma algebra {\mathcal{G} \subset \mathcal{F}}.

Proof: Since {Z} takes value in {\mathbb{R}^n}, we know there is a family of conditional distribution {\{\mu(\omega,\cdot)\}_{\omega \in \Omega}} such that for almost all {\omega \in \Omega}, we have for any {A\in \mathcal{B}}

\displaystyle \mathbb{E}(1_{\{Z\in A\}}\mid \mathcal{G})(\omega) = \int_{A} \mu(\omega, dz)

and for any real valued borel function {f} with domain {\mathbb{R}^n} and {\mathbb{E}(|f(Z)|)}<+\infty, we have

\displaystyle \mathbb{E}(f(Z)\mid \mathcal{G}) (\omega)= \int_{\mathbb{R}^n} f(z) \mu(\omega, dz).

The above two actually comes from the existence of regular conditional distribution which is an important result in measure-theoretic probability theory.

Now take {A=C} and {f(Z) = Z}, we have for almost all {\omega\in \Omega},

\displaystyle \mathbb{E}(1_{\{Z\in C\}}\mid \mathcal{G})(\omega) = \int_{C} \mu(\omega, dz)


\displaystyle \mathbb{E}(Z\mid \mathcal{G}) (\omega)= \int_{\mathbb{R}^n} z \mu(\omega, dz).

But since {\mathop{\mathbb P}(Z\in C)=1}, we know that for almost all {\omega \in \Omega},

\displaystyle 1= \mathbb{E}(1_{\{Z\in C\}}\mid \mathcal{G})(\omega) = \int_{C} \mu(\omega, dz).

Thus for almost all {\omega} we have a probability distribution {\mu(\omega, dz)} on {{\mathbb R}^n} with mean equals to {\mathop{\mathbb E} (Z\mid \mathcal{G} )(\omega)} and {\int_C \mu (\omega ,dz) =1}. This is exactly the situation of Theorem 1. Thus by applying Theorem 1 to the probability measure {\mu(\omega, dz)}, we find that

\displaystyle \mathbb{E}(Z\mid \mathcal{G})(\omega) = \int _{\mathbb{R}^n} z\mu (\omega, dz) \in C,

for almost all {\omega \in \Omega}. \Box

Close property of expectation under convexity

Suppose we have a random vector {Z} and a convex set {C\in {\mathbb R} ^n} such that

\displaystyle \mathop{\mathbb P}(Z\in C) =1.

If you are doing things with convexity, then you may wonder whether

\displaystyle \mathop{\mathbb E}(Z) \in C.

This is certainly true if {Z} only takes finitely many value in {C} or {C} is closed. In the first case, you just verify the definition of convexity and the second case, you may use the strong law of large numbers. But if you draw a picture and think for a while, you might wonder whether these conditions are needed as it looks like no matter what value {Z} takes, it can not go out of {C} and the average should still belong to {C} as long as {C} is convex. In this post, we are going to show that it is indeed the case and we then have a theorem.

Theorem 1 For any convex set {C\subset {\mathbb R}^n}, and for any random vector {Z} such that

\displaystyle \mathop{\mathbb P}(Z\in C)=1,

its expectation is still in {C}, i.e,

\displaystyle \mathop{\mathbb E}(Z) \in C

as long as the mean exists.

Skip the following remark if you don’t know or not familiar with measure theory.

Remark 1 If you are a measure theoretic person, you might wonder whether {C} should be Borel measurable. The answer is no. The set {C} needs not to be Borel measurable. To make the point clear, suppose there is an underlying probability {(\Omega, \mathcal{F},\mathop{\mathbb P})} and {Z} is a random variable from this probability space to {(\mathbb{R}^n, \mathcal{B})} where {\mathcal{B}} is the borel sigma-algebra. Then we can either add the condition that the event {\{\omega \in \Omega \mid Z\in C \} = F\in \mathcal{F}} or {\mathop{\mathbb P}(F)=1} is understood as {F} is a measurable event with respect to the completed measure space {(\Omega, \bar{\mathcal{F}},\bar{\mathop{\mathbb P}})} and we overload the notation {\mathop{\mathbb P}} to mean {\bar{\mathop{\mathbb P}}}. The probability space {(\Omega, \mathcal{F},\mathop{\mathbb P})} is completed by the probability measure {\mathop{\mathbb P}}.

To have some preparation for the proof, recall the separating hyperplane theorem of convex set.

Theorem 2 (Separating Hyperplane theorem) Suppose {C} and {D} are convex sets in {{\mathbb R}^n} and {C\cap D = \emptyset}, then there exists a nonzero {a \in {\mathbb R}^{n}} such that

\displaystyle a^Tx \geq a^Ty

for all {x\in C,y\in D}.

Also recall the following little facts about convexity.

  • Any convex set in {{\mathbb R}} is always an interval.
  • Any affine space of {n-m} dimension in {{\mathbb R}^n} is of the form {\{x\in {\mathbb R}^{n}:Ax=b\}} for some {A\in {\mathbb R}^{m\times n}} and {b \in {\mathbb R}^m}.

We are now ready to prove Theorem 1.

Proof of Theorem 1: We may suppose that {C} has nonempty interior. Since if it is not, we can take the affine plane containing {C} with smallest dimension. Suppose {L =\mathop{\mathbb E} (Z)} is not in {C}, then by separating hyperplane theorem, there exists a nonzero a such that

\displaystyle L= a^T\mathop{\mathbb E}(Z) \geq a^Tx, \forall x \in C.

Since {Z\in C} almost surely, we should have

\displaystyle a^TZ \leq L

almost surely. Since {\mathop{\mathbb E}( a^T Z)= a^T\mathop{\mathbb E}( Z)}, we see that {a^T Z= L} with probability {1}. Since intersection of the hyperplane of {a^Tx = L} and {C} is still convex, we see that that {Z} only takes value in a convex set in a {n-1} dimensional affine space.

Repeat the above argument, we can decrease the dimension until {n=1}. After a proper translation and rotation, we can say that {Z} takes its value in an interval in {\mathbb{R}} and we want to argue that the mean of {Z} is always in the interval.

This is almost trivial. Suppose the interval is bounded. If the interval is closed, then since taking expectation preserves order, i.e., {X\geq Y \implies \mathop{\mathbb E} X\geq \mathop{\mathbb E} Y}, we should have its mean in the interval. If the interval is half open and half closed and if the means is not in the interval, then {\mathop{\mathbb E} Z} must be the open end of the interval since expectation preserves order, but this means that {Z} has full measure on the open end which contradicts the assumption that {Z} is in the interval with probability one. The case both open is handled in the same way. If the interval is unbounded one way, then the previous argument still works and if it is just {\mathbb{R}}, then for sure that {\mathop{\mathbb E} Z \in\mathbb{R}}. This completes the proof. \Box

Solution to (general) Truncated Moment Problem

We are going to solve the truncated moment problem in this post. The theorem we are going to establish is more general than the original problem itself. The following theorem is a bit abstract, you can skip to Corollary 2 to see what the truncated moment problem is and why it has a generalization in the form of Theorem 1.

Theorem 1 Suppose {X} is a random transformation from a probability space {(A,\mathcal{A},\mathop{\mathbb P})} to a measurable space {(B,\mathcal{B})} where each singleton set of B is in \mathcal{B}. Let each {f_i} be a real valued (Borel measurable) function with its domain to be {B}, {i=1,\dots,m}. Given

\displaystyle (\mathbb{E}f_i(X))_{i=1,\dots,m}

and they are all finite, there exists a random variable {Y\in B} such that {Y} takes no more than {m+1} values in {B}, and

\displaystyle (\mathbb{E}f_i(Y))_{i=1,\dots,m} = (\mathbb{E}f_i(X))_{i=1,\dots,m}.

(If you are not familiar with terms Borel measurable, measurable space and sigma-algebras \mathcal{A}, \mathcal{B},  then just ignore these. I put these term here just to make sure the that the theorem is rigorous enough.)

Let me parse the theorem for you. Essentially, the theorem is trying to say that given {m} many expectations, no matter what kind of source the randomness comes from, i.e., what {X} is, we can always find a finite valued random variable (which is {Y} in the theorem) that achieves the same expectation.

To have a concrete sense of what is going on, consider the following Corollary of Theorem 1. It is the original truncated moment problem.

Corollary 2 (Truncated Moment Problem) For any real valued random variable {X\in {\mathbb R}} with its first {m} moments all finite, i.e., for all {1\leq i\leq m}

\displaystyle \mathop{\mathbb E}|X|^i < \infty,

there exists a real valued discrete random variable {Y} which takes no more than {m+1} values in {{\mathbb R}} and its first {m} moments are the same as {X}, i.e.,

\displaystyle (\mathbb{E}Y,\mathbb{E}(Y^2),\dots, \mathbb{E}(Y^m) )=(\mathbb{E}X,\mathbb{E}(X^2),\dots, \mathbb{E}(X^m)).

This original truncated moment problem is asking that given the (uncentered) moments, can we always find a finite discrete random variable that matches all the moments. It should be clear that is a simple consequence of Theorem 1 by letting {B={\mathbb R}} and {f_i(x) = x^{i},, i=1,\dots,m}.

There is also a multivariate version of truncated moment problem which can also be regarded as a special case of Theorem 1.

Corollary 3 (Truncated Moment Problem, Multivariate Version) For any real random vector {X=(X_1,\dots,X_n)\in \mathbb{R}^n} and its all {k}th order moments are finite, i.e.,

\displaystyle \mathop{\mathbb E}(\Pi_{i=1}^n|X_{i}|^{\alpha_i}) <\infty

for any {{1\leq \sum \alpha_i\leq k}}. Each {\alpha_i} here is a nonnegative integer. The total number of moments in this case is {n+k \choose k}. Then there is a real random vector {Y \in \mathbb{R}^n} such that it takes no more than {{n+k \choose k}+1} values, and

\displaystyle (\mathop{\mathbb E}(\Pi_{i=1}^nX_{i}^{\alpha_i}))_{1\leq \sum \alpha_i\leq k} = (\mathop{\mathbb E}(\Pi_{i=1}^nY_{i}^{\alpha_i})) _{1\leq \sum \alpha_i\leq k}.

Though the form of Theorem 1 is quite general and looks scary, it is actually a simple consequence of the following lemma and the use of convex hull.

Lemma 4 For any convex set {C \in \mathbb{R}^k}, and any random variable {Z} which has finite mean and takes value only in {C} , i.e,

\displaystyle \mathop{\mathbb E}(Z) \in \mathbb{R}^k, \mathop{\mathbb P}(Z\in C) =1,

we have

\displaystyle \mathop{\mathbb E} (Z) \in C.

The above proposition is trivially true if {C} is closed or Z takes only finitely many value. But it is true that {C} is only assumed to be convex. We will show it in this post.

We are now ready to show Theorem 1.

Proof of Theorem 1: Consider the set

\displaystyle L = \{ (f_i(x))_{i=1,\dots,m}\mid x\in B \},

The convex hull of this set {L} is

\displaystyle \text{conv}(L) = \{ \sum_{j=1}^l \alpha _j a_j\mid \alpha_j \geq 0 ,\sum_{j=1}^l \alpha_j =1, a_j\in L, l \in {\mathbb N}\}.

Now take the random variable {Z=(f_i(X))_{i=1,\dots,m}} which takes value only in {L\subset \text{conv}(L)}, by Lemma 4 of convex set, we know that

\displaystyle \mathop{\mathbb E} Z \in \text{conv}(L).

Note that every element in {\text{conv}(L)} has a FINITE representation in terms of {a_j}s!

This means we can find {l\in {\mathbb N}}, {\alpha_j\geq 0, \sum_{j = 1}^l \alpha_j =1} and {a_j \in L, j=1,\dots,l} such that

\displaystyle \sum_{j=1}^l \alpha_ja_j = \mathop{\mathbb E} Z = (\mathop{\mathbb E} f_i(X))_{i=1,\dots,m}.

Since each {a_j = (f(x_j))_{i=1,\dots,m}} for some {x_j \in B}, we can simply take the distribution of {Y} to be

\displaystyle \mathop{\mathbb P}(Y = x_j) = \alpha_j, \forall i =1,\dots,l.

Finally, apply the theorem of Caratheodory to conclude that {l\leq m+1}. \Box