# A variance-bias decomposition of L1 norm

Suppose we have real random variables ${X,X',Y,Y'}$, ${X,X'\sim F}$, ${Y,Y'\sim G}$ where ${F}$ and ${G}$ are cumulative distribution function and ${X,X',Y,Y'}$ are all independent and ${\mathbb{E}|X|,\mathbb{E}|Y|}$ are finite. We prove the following theorem.

Theorem 1 (Variance-Bias decomposition of ${L1}$ norm) For independent random variables ${X,X',Y,Y'}$ defined above, we have

$\displaystyle 2\mathbb{E}|X-Y| = \mathbb{E}|X-X'| + \mathbb{E}|Y-Y'| +2 \int (G(u)-F(u))^2du.$

Thus

\displaystyle \begin{aligned} 2\mathbb{E}|X-Y| \geq \mathbb{E}|X-X'| + \mathbb{E}|Y-Y'| \end{aligned} \ \ \ \ \ (1)

with equality holds if and only if ${G=F}$.

The quantity ${\mathbb{E}|X-X'|}$ is usually referred as mean absolute difference and it measures the spread of a distribution. I don’t know the term for the quantity ${\mathbb{E}|X-Y|}$ but what it measures is the difference between the distribution ${F}$ and ${G}$. I think cross mean difference would be a nice name.

The equality can be considered as an analogue of the well-known variance-bias decomposition of estimators/predictors in statistics. If we think we are using ${X}$ to estimate/predicts ${Y}$, then the expected error (the cross mean difference) in terms of absolute value (${L1}$ norm in more advance term) is the sum of the mean absolute difference in ${X}$ and ${Y}$, i.e., $\mathbb{E}|X-X'|, \mathbb{E}|Y-Y'|$, which can be considered as variance and the difference in the two distribution ,i.e., ${\int(F-G)^2}$, which can be considered as bias.

There is an analogue in terms of the usual square loss (or ${L2}$ norm) and it is

$\displaystyle 2\mathbb{E}(X-Y)^2 = \mathbb{E}(X-X')^2 + \mathbb{E}(Y-Y')^2 + 2(\mathbb{E}X-\mathbb{E}Y)^2.$

Under this setting, we also see a decomposition of the estimator/prediction error in terms of the variance in ${X}$and ${Y}$, i.e., ${\mathbb{E}(X-X')^2, \mathbb{E}(Y-Y')^2}$, and the difference of mean can be considered as bias as well.

The theorem assume ${X,Y}$ both have first finite moments. In the case either ${X}$ or ${Y}$ has no finite first moment, the equality of the decomposition and inequality is still true by inspecting the proof below. But that the equality holds for inequality (1) does not necessarily imply that ${F =G}$.

Proof: The trick to establish the equality is to write the quantity ${\mathbb{E}|X-Y|}$ in the following form.

\displaystyle \begin{aligned} 2\mathbb{E}|X-Y| & = 2\mathbb{E}(X-Y) 1_{\{ X\geq Y\}} + 2\mathbb{E}(Y-X) 1_{\{ Y\geq X\}}\\ & =2 \mathbb{E}\int 1_{\{Y\leq u\leq X\}}du + 2 \mathbb{E}\int 1_{\{ X\leq u\leq Y\}}du\\ & =2 \int \mathbb{P}(Y\leq u\leq X)du+ 2\int \mathbb{P}(X\leq u\leq Y)du\\ & =2\int G(u)(1-F(u))du +2\int F(u)(1-G(u))du\\ & = 2\int G(u)(1-F(u)) + F(u)(1-G(u)) du. \end{aligned} \ \ \ \ \ (2)

The third equality is due to Fubini’s theorem and the fourth is because of the independence between ${X}$ and ${Y}$. Similarly, we have

$\displaystyle \mathbb{E}|X-X'| +\mathbb{E}|Y-Y'| =2 \int F(u)(1-F(u)) + G(u)(1-G(u)) du.$

Thus the difference of ${2\mathbb{E}|X-Y|}$ and ${\mathbb{E}|X-X'|+\mathbb{E}|Y-Y'|}$ which is finite because ${\mathbb{E}|X|,\mathbb{E}|Y|<\infty}$ is

\displaystyle \begin{aligned} &2\mathbb{E}|X-Y|-\mathbb{E}|X-X'|-\mathbb{E}|Y-Y'|\\ =&2\int G(u)(1-F(u)) + F(u)(1-G(u)) du -2 \int F(u)(1-F(u)) + G(u)(1-G(u)) du\\ = &2\int (G(u)-F(u))^2du \geq 0\\ \end{aligned} \ \ \ \ \ (3)

Thus the equality of decomposition in the theorem and the inequality (1) is established. Now we argue equality of inequality (1) holds if and only ${F=G}$. If ${F=G}$, then inequality (1) obviously becomes an equality, Now if inequality (1) becomes an equality, by the last line of inequality (3), we have

$\displaystyle \int (G(u)-F(u))^2 du = 0.$

This means that

$\displaystyle G(u) = F(u)\quad \text{almost everywhere}.$

But ${G}$ and ${F}$ are right continuous, we have for all ${u\in \mathbb{R}}$,

$\displaystyle G(u) = F(u).$

$\Box$

# Sandwich inequality of optimal gap and distance to solutions of LP

Suppose we have a general optimization problem.

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & f(x)\\ & {\text{subject to}} & & x \in \Omega. \\ \end{aligned} \ \ \ \ \ (1)

Also, suppose problem (1) has a minimum and the minimum can be achieved by a unique minimizer ${x^*}$.

Now if I have a point ${x}$ such that ${f(x) - f(x^*) }$ is very small, then how small is the distance ${\|x-x^*\|}$. We might expect that ${f(x) - f(x^*) \rightarrow 0}$ will imply that ${x \rightarrow x^*}$. This is true if ${\Omega}$ is compact and ${f}$ is continuous. But this does not tell what is the quantitative relationship between the optimal gap, i.e., ${f(x) -f(x^*)}$, and the distance to the solution, i.e., ${\|x-x^*\|}$.

In this post, I am going to show that for linear programming (LP), the optimal gap and distance to solutions are the same up to a multiplicative constant which only depends on the problem data.

To start, consider an LP in the standard form, i.e.,

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & c^Tx\\ & {\text{subject to}} & & Ax =b \\ & & & x\geq 0. \end{aligned} \ \ \ \ \ (2)

where the decision variable is ${x}$ and problem data are ${A\in \mathbb{R}^{m\times n},b\in \mathbb{R}^m,c\in \mathbb{}R^n}$. ${x\geq 0}$ means each coordinate of ${x}$ is nonnegative.

Denote the solution set of problem (1) to be ${X^* = \arg \min\{ c^Tx| Ax = b, x\geq 0\} }$, and the distance to the solution set ${X^*}$ to be ${\text{dist}(x,X^*) = \inf_{x^* \in X^*} \|x-x^*\|}$. Note that the norm here is arbitrary, not necessarily the Euclidean norm.

We are now ready to state the theorem.

Theorem 1 (Sandwich inequality of optimal gap and distance to solutions of LP) For problem (1), theres exist constants ${C_0, c_0>0}$ which depends only on ${A,b,c}$ and ${\|\cdot\|}$ such that for all feasible ${x}$, i.e., ${Ax= b, x\geq 0}$,

$\displaystyle C_0 \text{dist}(x,X^*)\geq c^Tx - c^Tx^* \geq c_0 \text{dist}(x,X^*).$

The above theorem shows that the role of optimal gap, i.e., ${c^Tx - c^Tx^*}$, and the distance to the solution set, i.e. ${\text{dist}(x,X^*)}$, are the same up to a multiplicative constant. The right inequality of the theorem is usually referred to as linear growth in the optimization literature.

The proof below is constructive and we can in fact take ${c_0 = \frac{\epsilon}{\max\{2B,1\}}}$ where ${ B = \max_{ 1\leq i \leq l} \|x_i\|}$ and ${\epsilon = \min_{ m+1\leq i\leq l, n+1\leq j \leq s} \{ c^Tx_i-c^Tx^*, c^T\gamma_j\}}$ for ${x^*\in X^*}$ and ${C_0 = \|c\|_*}$. Here ${\| y\|_* = \sup \{ u^Ty\mid \|u\|\leq 1\}}$ is the dual norm and $x_i,\gamma_j$ are extreme points and extreme rays. We assume there are $l$ many extreme points and $s$ many extreme rays with first $m\; x_i$s and first $n \; \gamma_j$ are in the optimal set $X^*$. See the proof for more detail.

The idea of the proof mainly relies on the extreme point and extreme rays representation of the feasible region and the optimal set,i.e., ${\{x : Ax = b, x\geq 0\}}$ and $X^*$.

Proof: The feasible region can be written as

$\displaystyle \{x|Ax=b,x\geq 0\} = \{ \sum_{i=1}^l \alpha_i x_i + \sum_{j=1}^s \beta_j \gamma_j | \sum_i \alpha_i = 1, \alpha_i\geq 0, \beta_j\geq 0\}.$

Here ${x_i}$s are extreme points of the feasible region and ${\gamma_j}$s are extreme rays. By scaling the extreme rays, we can assume that ${\|\gamma_j\| =1}$ for all ${j}$.

The optimal set can also be written as

$\displaystyle X^* = \{ \sum_{i=1}^m \alpha_i x_i + \sum_{j=1}^n \beta_j \gamma_j | \sum_i \alpha_i = 1, \alpha_i\geq 0, \beta_j\geq 0\}.$

We assume here the first ${m}$ many ${x_i}$ and ${n}$ many ${\gamma_j}$ are in the optimal set and the rest of ${x_i}$ and ${\gamma_j}$s are not for notation simplicity.

We denote ${B = \max_{ 1\leq i \leq l} \|x_i\|}$ and ${\epsilon = \min_{ m+1\leq i\leq l, n+1\leq j \leq s} \{ c^Tx_i-c^Tx^*, c^T\gamma_j\}}$ where ${x^*\in X^*}$. Note ${\epsilon>0}$ since the ${\gamma_j}$s not in the optimal set should have inner product with ${c}$ to be positive.

We first prove the second inquality, i.e., ${c^Tx - c^Tx^* \geq c_0 \text{dist}(x,X^*)}$.

Now take an arbitrary feasible ${x}$, it can be written as

$\displaystyle x = \sum_{i=1}^m a_i x_i + \sum_{i=m+1}^l a_i x_i + \sum _{j=1}^n b_j \gamma_j + \sum_{j=n+1}^s b_j \gamma_j$

for some ${a_i \geq 0, \sum_i a_i =1}$ and ${b_j\geq 0}$.

The objective value of ${x}$ is then

$\displaystyle c^Tx = \sum_{i=1}^m a_i c^T x_i + \sum_{i=m+1}^l a_i c^T x_i +\sum_{j=n+1}^s b_j c^T \gamma_j.$

We use the fact that ${c^T\gamma_j=0}$ for all ${j\leq m}$ here.

Subtract the above by ${c^Tx^*}$. We have

\displaystyle \begin{aligned} c^Tx -c^Tx^*& = (\sum_{i=1}^m a_i-1)c^T x_i + \sum_{i=m+1}^l a_i c^T x_i +\sum_{j=n+1}^s b_j c^T \gamma_j \\ & = (-\sum_{i=m+1}^l a_i)c^T x_i + \sum_{i=m+1}^l a_i c^T x_i +\sum_{j=n+1}^s b_j c^T \gamma_j \\ &\geq (\sum_{i=m+1}^l a_i) \epsilon + (\sum_{j=n+1}^s b_j )\epsilon \end{aligned} \ \ \ \ \ (3)

The second equality is due to ${\sum_i a_i =1}$ and the inequality is because of the definition of ${\epsilon}$ and the ${b_j,a_i}$s are positive.

The distance between ${x}$ and ${X^*}$ is the infimum of

$\displaystyle \| \sum_{i=1}^m( a_i-\alpha_i)x_i + \sum_{i=m+1}^l a_i x_i + \sum _{j=1}^n (b_j-\beta_j) \gamma_j + \sum_{j=n+1}^s b_j \gamma_j \|.$

By taking ${\alpha_i \geq a_i}$ with ${\sum_i^m \alpha_i =1}$ and ${\beta_j = b_j}$ and apply triangular inequality to the above quantity, we have

\displaystyle \begin{aligned} &\| \sum_{i=1}^m( a_i-\alpha_i)x_i + \sum_{i=m+1}^l a_i x_i + \sum _{j=1}^n (b_j-\beta_j) \gamma_j + \sum_{j=n+1}^s b_j \gamma_j \| \\ &\leq \sum_{i=1}^m( \alpha_i-a_i)\|x_i \|+ \sum_{i=m+1}^l a_i\|x_i \|+\sum_{j=n+1}^s b_j \|\gamma_j \|\\ &\leq \sum_{i=1}^m (\alpha_i -a_i) B + \sum_{i=m+1}^l a_i B + \sum_{j=n+1}^s b_j\\ &= (1-\sum_{i=1}^m a_i)B + (\sum_{i=m+1}^l a_i)B + \sum_{j=n+1}^s b_j\\ &= 2B (\sum_{i=m+1}^l a_i) +\sum_{j=n+1}^s b_j. \end{aligned} \ \ \ \ \ (4)

The first inequality is the triangular inequality and ${\alpha_i\geq a_i, a_i\geq 0, b_j\geq 0}$. The second inequality is applying the definition of ${B}$ and ${\|\gamma_j\|=1}$. The first equality is due to ${\sum_i \alpha_i =1}$ and the second equality is due to ${\sum_i a_i =1}$.

Thus the distance between ${x}$ and ${X^*}$ is bounded above by

$\displaystyle \text{dist}(x,X^*) \leq 2B(\sum_{i=m+1}^l a_i) +\sum_{j=n+1}^s b_j.$

Since ${c^Tx -c^Tx^* \geq( \sum_{i = m+1}^l a_i + \sum_{j=n+1}^s b_j)\epsilon}$ by our previous argument, we see that setting

$\displaystyle c_0 = \frac{\epsilon}{\max\{2B,1\}}$

should give us

$\displaystyle c^Tx - c^Tx^* \geq c_0 \text{dist}(x,X^*).$

We now prove the inequality

$\displaystyle C_0 \text{dist}(x,X^*)\geq c^Tx - c^Tx^*.$

Note that the infimum in ${\text{dist}(x,X^*) = \inf_{x^* \in X^*} \|x-x^*\|}$ is actually achieved by some ${x^*}$. The reason is that we can first pick a ${x'\in X^*}$, then

$\displaystyle \inf_{x^* \in X^*} \|x-x^*\| = \inf_{x^* \in X^*, \|x-x^*\| \leq \|x-x'\|} \|x-x^*\|.$

But the set ${X^* \cap \{x^* | \|x-x^*\| \leq \|x-x'\| \}}$ is actually bounded and closed (${X^*}$ is closed as it is a convex combination of finite points plus a conic combination of extreme vectors), thus Weierstrass theorems tells us that the infimum is actually achieved by some ${x^*\in X}$.

Now take ${x^*}$ such that ${\|x-x^*\| =\text{dist}(x,X)}$. We have

$\displaystyle c^Tx -c^Tx^* \leq \|c\|_* \|x-x^*\|=\|c\|_*\text{dist}(x,X^*)$

where ${\|\cdot\|_*}$ is the dual norm of ${\|\cdot\|}$. Thus letting ${C_0 = \|c\|_*}$ finishes the proof. $\Box$

From the proof, we see that two possible choice of ${c_0}$ and ${C_0}$ are ${c_0 = \frac{\epsilon}{\max\{2B,1\}}}$ where ${ B = \max_{ 1\leq i \leq l} \|x_i\|}$ and ${\epsilon = \min_{ m+1\leq i\leq l, n+1\leq j \leq s} \{ c^Tx_i-c^Tx^*, c^T\gamma_j\}}$ for ${x^*\in X^*}$ and ${C_0 = \|c\|_*}$. These are not optimal and can be sharpened. I probably will give a sharper constant in a future post.

# Close property of expectation under convexity

Suppose we have a random vector ${Z}$ and a convex set ${C\in {\mathbb R} ^n}$ such that

$\displaystyle \mathop{\mathbb P}(Z\in C) =1.$

If you are doing things with convexity, then you may wonder whether

$\displaystyle \mathop{\mathbb E}(Z) \in C.$

This is certainly true if ${Z}$ only takes finitely many value in ${C}$ or ${C}$ is closed. In the first case, you just verify the definition of convexity and the second case, you may use the strong law of large numbers. But if you draw a picture and think for a while, you might wonder whether these conditions are needed as it looks like no matter what value ${Z}$ takes, it can not go out of ${C}$ and the average should still belong to ${C}$ as long as ${C}$ is convex. In this post, we are going to show that it is indeed the case and we then have a theorem.

Theorem 1 For any convex set ${C\subset {\mathbb R}^n}$, and for any random vector ${Z}$ such that

$\displaystyle \mathop{\mathbb P}(Z\in C)=1,$

its expectation is still in ${C}$, i.e,

$\displaystyle \mathop{\mathbb E}(Z) \in C$

as long as the mean exists.

Skip the following remark if you don’t know or not familiar with measure theory.

Remark 1 If you are a measure theoretic person, you might wonder whether ${C}$ should be Borel measurable. The answer is no. The set ${C}$ needs not to be Borel measurable. To make the point clear, suppose there is an underlying probability ${(\Omega, \mathcal{F},\mathop{\mathbb P})}$ and ${Z}$ is a random variable from this probability space to ${(\mathbb{R}^n, \mathcal{B})}$ where ${\mathcal{B}}$ is the borel sigma-algebra. Then we can either add the condition that the event ${\{\omega \in \Omega \mid Z\in C \} = F\in \mathcal{F}}$ or ${\mathop{\mathbb P}(F)=1}$ is understood as ${F}$ is a measurable event with respect to the completed measure space ${(\Omega, \bar{\mathcal{F}},\bar{\mathop{\mathbb P}})}$ and we overload the notation ${\mathop{\mathbb P}}$ to mean ${\bar{\mathop{\mathbb P}}}$. The probability space ${(\Omega, \mathcal{F},\mathop{\mathbb P})}$ is completed by the probability measure ${\mathop{\mathbb P}}$.

To have some preparation for the proof, recall the separating hyperplane theorem of convex set.

Theorem 2 (Separating Hyperplane theorem) Suppose ${C}$ and ${D}$ are convex sets in ${{\mathbb R}^n}$ and ${C\cap D = \emptyset}$, then there exists a nonzero ${a \in {\mathbb R}^{n}}$ such that

$\displaystyle a^Tx \geq a^Ty$

for all ${x\in C,y\in D}$.

Also recall the following little facts about convexity.

• Any convex set in ${{\mathbb R}}$ is always an interval.
• Any affine space of ${n-m}$ dimension in ${{\mathbb R}^n}$ is of the form ${\{x\in {\mathbb R}^{n}:Ax=b\}}$ for some ${A\in {\mathbb R}^{m\times n}}$ and ${b \in {\mathbb R}^m}$.

We are now ready to prove Theorem 1.

Proof of Theorem 1: We may suppose that ${C}$ has nonempty interior. Since if it is not, we can take the affine plane containing ${C}$ with smallest dimension. Suppose ${L =\mathop{\mathbb E} (Z)}$ is not in ${C}$, then by separating hyperplane theorem, there exists a nonzero $a$ such that

$\displaystyle L= a^T\mathop{\mathbb E}(Z) \geq a^Tx, \forall x \in C.$

Since ${Z\in C}$ almost surely, we should have

$\displaystyle a^TZ \leq L$

almost surely. Since ${\mathop{\mathbb E}( a^T Z)= a^T\mathop{\mathbb E}( Z)}$, we see that ${a^T Z= L}$ with probability ${1}$. Since intersection of the hyperplane of ${a^Tx = L}$ and ${C}$ is still convex, we see that that ${Z}$ only takes value in a convex set in a ${n-1}$ dimensional affine space.

Repeat the above argument, we can decrease the dimension until ${n=1}$. After a proper translation and rotation, we can say that ${Z}$ takes its value in an interval in ${\mathbb{R}}$ and we want to argue that the mean of ${Z}$ is always in the interval.

This is almost trivial. Suppose the interval is bounded. If the interval is closed, then since taking expectation preserves order, i.e., ${X\geq Y \implies \mathop{\mathbb E} X\geq \mathop{\mathbb E} Y}$, we should have its mean in the interval. If the interval is half open and half closed and if the means is not in the interval, then ${\mathop{\mathbb E} Z}$ must be the open end of the interval since expectation preserves order, but this means that ${Z}$ has full measure on the open end which contradicts the assumption that ${Z}$ is in the interval with probability one. The case both open is handled in the same way. If the interval is unbounded one way, then the previous argument still works and if it is just ${\mathbb{R}}$, then for sure that ${\mathop{\mathbb E} Z \in\mathbb{R}}$. This completes the proof. $\Box$

# Solution to (general) Truncated Moment Problem

We are going to solve the truncated moment problem in this post. The theorem we are going to establish is more general than the original problem itself. The following theorem is a bit abstract, you can skip to Corollary 2 to see what the truncated moment problem is and why it has a generalization in the form of Theorem 1.

Theorem 1 Suppose ${X}$ is a random transformation from a probability space ${(A,\mathcal{A},\mathop{\mathbb P})}$ to a measurable space ${(B,\mathcal{B})}$ where each singleton set of $B$ is in $\mathcal{B}$. Let each ${f_i}$ be a real valued (Borel measurable) function with its domain to be ${B}$, ${i=1,\dots,m}$. Given

$\displaystyle (\mathbb{E}f_i(X))_{i=1,\dots,m}$

and they are all finite, there exists a random variable ${Y\in B}$ such that ${Y}$ takes no more than ${m+1}$ values in ${B}$, and

$\displaystyle (\mathbb{E}f_i(Y))_{i=1,\dots,m} = (\mathbb{E}f_i(X))_{i=1,\dots,m}.$

(If you are not familiar with terms Borel measurable, measurable space and sigma-algebras $\mathcal{A}, \mathcal{B}$,  then just ignore these. I put these term here just to make sure the that the theorem is rigorous enough.)

Let me parse the theorem for you. Essentially, the theorem is trying to say that given ${m}$ many expectations, no matter what kind of source the randomness comes from, i.e., what ${X}$ is, we can always find a finite valued random variable (which is ${Y}$ in the theorem) that achieves the same expectation.

To have a concrete sense of what is going on, consider the following Corollary of Theorem 1. It is the original truncated moment problem.

Corollary 2 (Truncated Moment Problem) For any real valued random variable ${X\in {\mathbb R}}$ with its first ${m}$ moments all finite, i.e., for all ${1\leq i\leq m}$

$\displaystyle \mathop{\mathbb E}|X|^i < \infty,$

there exists a real valued discrete random variable ${Y}$ which takes no more than ${m+1}$ values in ${{\mathbb R}}$ and its first ${m}$ moments are the same as ${X}$, i.e.,

$\displaystyle (\mathbb{E}Y,\mathbb{E}(Y^2),\dots, \mathbb{E}(Y^m) )=(\mathbb{E}X,\mathbb{E}(X^2),\dots, \mathbb{E}(X^m)).$

This original truncated moment problem is asking that given the (uncentered) moments, can we always find a finite discrete random variable that matches all the moments. It should be clear that is a simple consequence of Theorem 1 by letting ${B={\mathbb R}}$ and ${f_i(x) = x^{i},, i=1,\dots,m}$.

There is also a multivariate version of truncated moment problem which can also be regarded as a special case of Theorem 1.

Corollary 3 (Truncated Moment Problem, Multivariate Version) For any real random vector ${X=(X_1,\dots,X_n)\in \mathbb{R}^n}$ and its all ${k}$th order moments are finite, i.e.,

$\displaystyle \mathop{\mathbb E}(\Pi_{i=1}^n|X_{i}|^{\alpha_i}) <\infty$

for any ${{1\leq \sum \alpha_i\leq k}}$. Each ${\alpha_i}$ here is a nonnegative integer. The total number of moments in this case is ${n+k \choose k}$. Then there is a real random vector ${Y \in \mathbb{R}^n}$ such that it takes no more than ${{n+k \choose k}+1}$ values, and

$\displaystyle (\mathop{\mathbb E}(\Pi_{i=1}^nX_{i}^{\alpha_i}))_{1\leq \sum \alpha_i\leq k} = (\mathop{\mathbb E}(\Pi_{i=1}^nY_{i}^{\alpha_i})) _{1\leq \sum \alpha_i\leq k}.$

Though the form of Theorem 1 is quite general and looks scary, it is actually a simple consequence of the following lemma and the use of convex hull.

Lemma 4 For any convex set ${C \in \mathbb{R}^k}$, and any random variable ${Z}$ which has finite mean and takes value only in ${C}$ , i.e,

$\displaystyle \mathop{\mathbb E}(Z) \in \mathbb{R}^k, \mathop{\mathbb P}(Z\in C) =1,$

we have

$\displaystyle \mathop{\mathbb E} (Z) \in C.$

The above proposition is trivially true if ${C}$ is closed or $Z$ takes only finitely many value. But it is true that ${C}$ is only assumed to be convex. We will show it in this post.

We are now ready to show Theorem 1.

Proof of Theorem 1: Consider the set

$\displaystyle L = \{ (f_i(x))_{i=1,\dots,m}\mid x\in B \},$

The convex hull of this set ${L}$ is

$\displaystyle \text{conv}(L) = \{ \sum_{j=1}^l \alpha _j a_j\mid \alpha_j \geq 0 ,\sum_{j=1}^l \alpha_j =1, a_j\in L, l \in {\mathbb N}\}.$

Now take the random variable ${Z=(f_i(X))_{i=1,\dots,m}}$ which takes value only in ${L\subset \text{conv}(L)}$, by Lemma 4 of convex set, we know that

$\displaystyle \mathop{\mathbb E} Z \in \text{conv}(L).$

Note that every element in ${\text{conv}(L)}$ has a FINITE representation in terms of ${a_j}$s!

This means we can find ${l\in {\mathbb N}}$, ${\alpha_j\geq 0, \sum_{j = 1}^l \alpha_j =1}$ and ${a_j \in L, j=1,\dots,l}$ such that

$\displaystyle \sum_{j=1}^l \alpha_ja_j = \mathop{\mathbb E} Z = (\mathop{\mathbb E} f_i(X))_{i=1,\dots,m}.$

Since each ${a_j = (f(x_j))_{i=1,\dots,m}}$ for some ${x_j \in B}$, we can simply take the distribution of ${Y}$ to be

$\displaystyle \mathop{\mathbb P}(Y = x_j) = \alpha_j, \forall i =1,\dots,l.$

Finally, apply the theorem of Caratheodory to conclude that ${l\leq m+1}$. $\Box$