Testing homogeneity of data for location-scale family

Suppose you have a sequence of independent data {X_1,\dots, X_n\in \mathbb{R}^d}, how do you test that {X_i}s all come from the same distribution, i.e., how do you test homogeneity of the data?

To make the problem more precise, suppose we have a distribution family indexed by {\theta \in \mathbb{R}^s}, i.e., a set

\displaystyle \Omega = \{ F_\theta\mid F_\theta \;\text{is a probability cumulative function,} \;\theta \in \mathbb{R}^s\},

and each {X_i} follows the distribution {F_{\theta_i}} for some {\theta_i}. Our problem is

Is {F_{\theta_1} =F_{\theta_2}=\dots =F_{\theta_n}}?

If we have that {\theta \not =\bar{\theta} \implies F_{\theta}\not= F_{\bar{\theta}}} (known as the identifiability of {\theta}), then our question becomes

Is {\theta_1 =\theta_2=\dots =\theta_n}?

Now suppose further that each {F_\theta\in \Omega} has a density {f_\theta} (so that we can write down the likelihood), the likelihood of seeing the independent sequence {X_1,\dots, X_n} is

\displaystyle \mathcal{L}_{(X_1,\dots,X_n) }{(\theta_1,\dots,\theta_n)}= \prod_{i=1}^nf_{\theta_i}(X_i).

To test our question in a statistical way, we use hypothesis testing. Our null hypothesis is

\displaystyle \begin{aligned} H_0:\quad \theta_1=\theta_2=\cdots = \theta_n, \end{aligned} \ \ \ \ \ (1)

and our alternative hypothesis is

\displaystyle \begin{aligned} H_1:\quad \theta_i\not= \theta_j \quad \text{for some}\; i\not=j. \end{aligned} \ \ \ \ \ (2)

Further denote the space of the null as {\Theta_0=\{ (\theta_1,\dots,\theta_n)\mid \theta_1=\dots =\theta_n\}} and the space of the alternative as {\Theta_1=\mathbb{R}^{sn} / \Theta_0}. A popular and natural approach is the likelihood ratio test. We construct the test statistic which is called likelihood ratio as

\displaystyle \begin{aligned} R = \frac{\sup_{(\theta_1,\dots, \theta_n)\in \Theta_0}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}{\sup_{(\theta_1,\dots, \theta_n)\in (\Theta_0\cup \Theta_1)}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}. \end{aligned} \ \ \ \ \ (3)

Intuitively, if our null hypothesis is indeed true, i.e., there is some {\theta^*} such that {\theta_1=\dots =\theta_n=\theta^*} and {X_i} follows { f_{\theta^*}}, then this ratio should be large and we have confidence that our null hypothesis is true. This means we should reject our null hypothesis if we find {R} is small. Thus if we want to have a significance level {\alpha} test of our null hypothesis, we should reject null hypothesis when {R\leq c} where {c} satisfies

\displaystyle \begin{aligned} \mathbb{P}(R<c\mid H_0)\leq \alpha. \end{aligned} \ \ \ \ \ (4)

However, the main issue is that we don’t know the distribution of {R} under {H_0} even if we know how to sample from each {f_\theta} and the functional form of {f_\theta} for each {\theta}. The reason is that {H_0} did not specify which {\theta^*} (which equals to {\theta^*=\theta_1=\dots =\theta_n}) generates the data. So the distribution of {R} may depend on {\theta^*} as well and the real thing we need for {c} is

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)\leq \alpha. \end{aligned} \ \ \ \ \ (5)

Thus even if we want to know approximate the {\sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)} through computational methods, we have to simulate for each {\theta^* \in \mathbb{R}^s}. As {\Theta_0} could be rather large (in fact as large as {\mathbb{R}^s}), approximation can be time consuming as well.

Fortunately, if {\Omega} is the so called location-scale family, we find that the distribution of {R} is independent of {\theta^*} and we are free to chose whichever {\theta^*} we like. Let us define what is location-scale family, then state the theorem and prove it.

Definition 1 Suppose we have a family of probability densities {\Omega} on {\mathbb{R}^d} indexed by {\theta =(\mu,\Sigma)} where {\mu \in \mathbb{R}^d} and {\Sigma\in \mathbb{S}_{++}^{d}}, the set of symmetric positive definite matrices in {\mathbb{R}^{d \times d}}. The family {\Omega} is a local-scale family if there is a family member {f} (called pivot) such that for any other {f_\theta} with {\theta =(\mu,\Sigma)},

\displaystyle f_\theta (x) = f(\Sigma^{-1}(x-\mu))\frac{1}{\det(\Sigma)}.

Thus if {Z\in \mathbb{R}^d} follows {f}, then {X=\Sigma Z+\mu} has probability density {f(\Sigma^{-1}(x-\mu))\frac{1}{\det(\Sigma)}}. Indeed, for any Borel set {B\subset \mathbb{R}^d }

\displaystyle \begin{aligned} \mathop{\mathbb P}(X\in B ) &=\mathop{\mathbb P}(\Sigma Z+\mu\in B)\\ & = \mathop{\mathbb P}(Z\in \Sigma ^{-1}(B-\mu)) \\ & = \int_{ \Sigma ^{-1}(B-\mu )} f(z) dz \\ & =\int_B f(\Sigma^{-1}(x-\mu))\frac{1}{\det(\Sigma)}dx \end{aligned} \ \ \ \ \ (6)

where we use a change of variable {x=\Sigma z+\mu} in the last equality and the last equality shows {X} follows {f(\Sigma^{-1}(x-\mu))\frac{1}{\det(\Sigma)}}. We are now ready to state the theorem and prove it.

Theorem 2 Suppose our family of distribution {\Omega} is a local-scale family, then under the null hypothesis, there is a {\theta^*= (\mu^*,\Sigma^*)} such that each {X_i} follows {f_{\theta^*}(x) = \frac{1}{\det(\Sigma^*)}f((\Sigma^*)^{-1}(x-\mu^*))} and the distribution of {R} is independent of {\theta^*}.

Since the distribution of {R} is independent of {\theta^*} under the null. This means that for any {\theta \in \mathbb{R}^d}, and any {c}

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*) = \mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta). \end{aligned} \ \ \ \ \ (7)

Thus we can choose any family member of {\Omega} to sample {X_i} and approximates the distribution of {R} using empirical distribution as long as {\Omega} is a location-scale family!

Proof:  We need to show that the ratio {R} has distribution independent of {\theta^*}. Since {X_i \sim \frac{1}{\det(\Sigma^*)}f((\Sigma^*)^{-1}(x-\mu^*))} and {\Omega} is a location scale family, we can assume they are generated via {Z_i} where {Z_i} follows a pivot {f} and {X_i = \Sigma^*Z_i+\mu^*}. Then the likelihood of {\theta_1,\dots, \theta_n} is

\displaystyle \begin{aligned} \mathcal{L}_{(X_1,\dots,X_n) }(\theta_1,\dots,\theta_n) &=\prod_{i=1}^n\frac{1}{\det(\Sigma_i)} f(\Sigma_i^{-1}(X_i-\mu_i))\\ &=\prod_{i=1}^n\frac{1}{\det(\Sigma_i)} f(\Sigma_i^{-1}(X_i-\mu_i))\\ & = \prod_{i=1}^n\frac{1}{\det(\Sigma_i)} f(\Sigma_i^{-1}(\Sigma^*Z_i+\mu^*-\mu_i))\\ &= \prod_{i=1}^n\frac{1}{\det(\Sigma_i)} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*)))\\ & = \bigr(\det((\Sigma^*)^{-1})\bigr)^n\prod_{i=1}^n\frac{1}{\det(\Sigma_i(\Sigma^*)^{-1})} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))).\\ \end{aligned} \ \ \ \ \ (8)

Thus the likelihood ratio {R} reduces to

\displaystyle \begin{aligned} R &= \frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^n\times \mathbb{S}^d_{++}} \prod_{i=1}^n\frac{1}{\det(\Sigma(\Sigma^*)^{-1})} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*)))}{\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^n\times \mathbb{S}^d_{++},i=1,\dots,n} \prod_{i=1}^n\frac{1}{\det(\Sigma_i(\Sigma^*)^{-1})} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ &=\frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^n\times \mathbb{S}^d_{++}} \prod_{i=1}^n\frac{1}{\det(\Sigma(\Sigma^*)^{-1})} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*))) }{\prod_{i=1}^n\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^n\times \mathbb{S}^d_{++}} \frac{1}{\det(\Sigma_i(\Sigma^*)^{-1})} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ \end{aligned} \ \ \ \ \ (9)

Now let’s define {\hat{\Sigma} = (\Sigma^*)^{-1}\Sigma}, {\hat{\mu} = (\Sigma^*)^{-1}(\mu-\mu^*) }, {\hat{\mu}_i = (\Sigma^*)^{-1}(\mu_i -\mu^*) } and {\hat{\Sigma}_i= (\Sigma^*)^{-1}\Sigma_i }. Note that since {\mu,\Sigma}, {\mu_i,\Sigma_i} can vary all over the space {\mathbb{R}^d\times \mathbb{S}^d_{++}}, so is {\hat{\mu},\hat{\Sigma}}, {\hat{\mu}_i} and {\hat{\Sigma}_i}. The equality (10) can be rewritten as

\displaystyle \begin{aligned} R &= \frac{\sup_{(\hat{\mu},\hat{\Sigma})\in \mathbb{R}^n\times \mathbb{S}^d_{++}} \prod_{i=1}^n\frac{1}{\det(\hat{\Sigma})} f((\hat{\Sigma})^{-1}(Z_i-\hat{\mu})) }{\prod_{i=1}^n\sup_{(\hat{\mu}_i,\hat{\Sigma}_i)\in \mathbb{R}^n\times \mathbb{S}^d_{++}} \frac{1}{\det(\hat{\Sigma}_i)} f((\hat{\Sigma}_i)^{-1}(Z_i -\hat{\mu}_i))}.\\ \end{aligned} \ \ \ \ \ (10)

As we just argued, {\hat{\mu},\hat{\Sigma}}, {\hat{\mu}_i} and {\hat{\Sigma}_i} can vary all over the space without any restriction, the supremum in the numerator and denominator thus does not depend on the choice {\mu^*} and {\Sigma^*} at all. So our theorem is proved. \Box

Chain rule of convex function

This post studies a specific chain rule of composition of convex functions. Specifically, we have the following theorem.

Theorem 1 For a continuously differentiable increasing function {\phi: \mathbb{R} \rightarrow \mathbb{R}}, a convex function {h: U \rightarrow \mathbb{R}} where {U\in \mathbb{R}^n} is a convex set and an {x\in U}, if {\phi'(h(x))>0} or {x\in \mbox{int}(U)}, then

\displaystyle \begin{aligned} \partial (\phi \circ h) (x) = \phi' (h(x)) \cdot[ \partial h (x)], \end{aligned} \ \ \ \ \ (1)

where {\partial } is the operator of taking subdifferentials of a function, i.e., {\partial h (x) = \{ g\mid h(x)\geq h(y) +\langle{g},{y-x}\rangle,\forall y\in U\}} for any {x\in U}, and {\mbox{int}(U)} is the interior of {U} with respect to the standard topology in {\mathbb{R}^n}.

A negative example. We note that if our condition fails, the inequality may not hold. For example, let {\phi(x) =1} for all {x\in \mathbb{R}} and {h(x) = 1 } defined on {[0,1]}. Then {0} is a point which is not in the interior of {[0,1],\phi'(0) = 0}, {\partial h(0) =(-\infty,0]}. However, in this case {\partial (\phi\circ h)(0)= (-\infty,0]} and {\phi' (h(0)) \cdot[ \partial h (0)] =0}. Thus, the equality fails.

It should be noted that if {U} is open and {h} is also differentiable, then the above reduces to the common chain rule of smooth functions.

Proof: We first prove that {\partial (\phi \circ h) (x) \supset \phi' (h(x)) \cdot[ \partial h (x)]}. We have for all {x\in U , g \in \partial h(x)},

\displaystyle \begin{aligned} \phi (h(y)) &\overset{(a)}{\geq} \phi(h(x)) + \phi' (h(x))(h(y)-h(x))\\ & \overset{(b)}{\geq} \phi (h(x)) + \phi '(h(x)) \langle g,y-x\rangle \\ & = \phi (h(x)) + \langle{\phi' (h(x))g},{y-x}\rangle \end{aligned} \ \ \ \ \ (2)

where {(a),(b)} are just the definition of subdifferential of {\phi} at {h(x)} and {h} at {x}. We also use the fact that {\phi(x)\geq 0} in the inequality {(b)} as {\phi} is increasing.

Now we prove the other direction. Without lost of generality, suppose {0\in U} such that {\partial (\phi \circ h)(0)} is not empty. Let {g\in \partial (\phi \circ h)(0)}, we wish to show that {g} is in the set { \phi' (h(0)) \cdot[ \partial h (0)]}. First according to the definition of subdifferential, we have

\displaystyle \begin{aligned} (\phi \circ h) (x)\geq (\phi\circ h)(0) + \langle{ g},x\rangle, \forall x \in U \end{aligned} \ \ \ \ \ (3)

This gives

\displaystyle \begin{aligned} (\phi \circ h) (\gamma x)\geq (\phi\circ h)(0) + \langle{ g} ,{\gamma x}\rangle, \forall x \in U, \gamma \in [0,1]. \end{aligned} \ \ \ \ \ (4)

Rearranging the above inequality gives

\displaystyle \begin{aligned} &\frac{(\phi \circ h) (\gamma x)- (\phi\circ h)(0)}{\gamma}\geq \langle{ g},{ x}\rangle \\ \implies & \phi'(s)\cdot \frac{h(\gamma x)-h(0)}{\gamma}\geq \langle{g},{x}\rangle \end{aligned} \ \ \ \ \ (5)

for some {s} between {h(\gamma x)} and {h(0)} by mean value theorem. Now, by letting {f(\gamma )=h(\gamma x)}, if {f} is right continuous at {0}, then by Lemma 1 in this post and {\phi} is continuously differentiable, we know {l(\gamma) =\frac{h(\gamma x)-h(0)}{\gamma}} is nondecreasing in {\gamma} and we have

\displaystyle \begin{aligned} \phi'(h(0)) (h(x)-h(0))\geq \phi'(h(0))\cdot \lim_{\gamma \downarrow 0} \frac{h(\gamma x)-h(0)}{\gamma}\geq \langle{g},{x}\rangle,\forall x\in U. \end{aligned} \ \ \ \ \ (6)

If {\phi' (h(0))>0}, then dividing both sides of the above inequality by {\phi'(h(0))} gives

\displaystyle \begin{aligned} h(x)-h(0)\geq \langle{\frac{g}{\phi'(h(0))}},{x}\rangle,\forall x\in U. \end{aligned} \ \ \ \ \ (7) 

This shows that {\frac{g}{\phi'(h(0))}} is indeed a member of {\partial h(x)} and thus {\partial (\phi \circ h) (x) \subset \phi' (h(x)) \cdot[ \partial h (x)]}. In this case, we only need to verify why {f(\gamma ) = h(\gamma x) } must be right continuous at {0}.

If {0\in \mbox{int}(U)}, then {h} is definitely continuous at {x} and so is {f} by standard result in convex analysis. If {\phi' (h(0))>0}, then we are done by inequality (7). If {\phi' (h(0))=0}, using the inequality (6), we have

\displaystyle \begin{aligned} 0 \geq \langle{g},{x}\rangle, \forall x\in U \end{aligned} \ \ \ \ \ (8)

Since {0 \in \mbox{int}(U)}, then {x} can take a small positive and negative multiple of {n} standard basis vectors in {\mathbb{R}^n} in the inequality (8). This shows {g =0} and it indeed belongs to the set {\phi' (h(0)) \cdot[ \partial h (x)] = \{0\}} as {\partial (h(0))\not=\emptyset} for {0\in \mbox{int}(U)} by standard convex analysis result.

Thus our task now is to argue why {f(\gamma ) = h(\gamma x)} is indeed right continuous at {0}. Using Lemma 4 in this post, we know the limit {\lim_{\gamma \downarrow 0 }f(\gamma) = f(0^+)} exists and {f(0^+)\leq f(0)}. Now if {f(0^+) = f(0) = h(0)}, then {f} is indeed right continuous at {0} and our conclusion holds. So we may assume {f(0^+) <f(0) =h(0)}. But in this case {l(\gamma) = \frac{h(\gamma x)-h(0)}{\gamma} = \frac{f(\gamma)-f(0)}{\gamma}} is going to be negative infinity as {\gamma \downarrow 0}. Recall from inequality (5), we have

\displaystyle \phi'(s)l(\gamma )\geq \langle{g},{x}\rangle

where {s} is between {h(0)} and {f(\gamma )=h(\gamma x)}. We claim that as {\gamma\downarrow 0}, {\phi'(s)} approaches a positive number. If this claim is true, then from the above inequality, we will have

\displaystyle -\infty \geq \langle{g},{x}\rangle

which cannot hold. Thus we must have {f} right continuous at {0}.

Finally, we prove our claim that {\phi'(s)} is approaching a positive number if {f(0^+) <f(0)}. Using mean value theorem, we have for some {s_0 \in [ f(0^+), f(0)]}

\displaystyle \begin{aligned} \phi(s_0)(f(0^+)-f(0)) & = \phi(f(0^+)) -\phi(f(0))\\ & = \lim_{ \gamma \downarrow 0 } \phi(f(\gamma))-\phi(f(0))\\ & = \lim_{\gamma \downarrow 0}\phi(s)(f(\gamma)-f(0))\\ & = (f(0^+)-f(0))\lim_{\gamma\downarrow 0}\phi(s). \end{aligned} \ \ \ \ \ (9)

Now cancel the term {f(0^+) -f(0)<0} above, we see that {\phi(s_0) = \lim_{\gamma\downarrow}\phi(s)}. We claim {\phi(s_0)>0}. If {\phi(s_0)=0}, then because {\phi} is increasing, we have that {\phi} is constant in {[f(0^+),f(0)]} as {\phi(f(0^+)) -\phi(f(0)) = \phi(s_0)(f(0^+)-f(0)) =0}. This contradicts our assumption that {\phi'(f(0))>0} and our proof is complete. \Box

Special Properties of 1-D convex function

We discuss a few special properties of one dimensional function convex function. The first asserts about the slope of one dimensional convex function.

Lemma 1 Let {f:[0,1]\rightarrow \mathbb{R}} be a (strict) convex function, then the function {g(x) = \frac{f(x)-f(0)}{x}, x\in(0,1]} is non-decreasing (strictly increasing).

Proof: By the definition of convex function, we have for any {0<s<t\leq 1}

\displaystyle f(s) = f(\frac{s}{t} \times t + 0 \times \frac{t-s}{t}) \leq \frac{s}{t}f(t) +\frac{t-s}{t}f(0).

Rearrange the above inequality gives

\displaystyle \frac{f(s) -f(0)}{s}\leq \frac{f(t)-f(0)}{t}.

The case for strict convexity is simply replacing the above {\leq} by {<}. \Box

The second asserts that slope should always increases.

Lemma 2 If {f:I\rightarrow \mathbb{R}} where {I} is any connected set in {\mathbb{R}},i.e., an interval which can be half open half closed or open or closed. The for any {x<y<z\in I}, we have

\displaystyle \frac{f(x)-f(y)}{x-y}\leq \frac{f(y)-f(z)}{y-z}

Proof: This is just a twist of algebra.

\displaystyle \begin{aligned} & \frac{f(x)-f(y)}{x-y}\leq \frac{f(y)-f(z)}{y-z}\\ \iff& \frac{y-z}{x-y}f(x) +f(z)\geq (1+ \frac{y-z}{x-y})f(y)\\ \iff & \frac{y-z}{x-y}f(x) +f(z)\geq \frac{x-z}{x-y}f(y)\\ \iff & \frac{y-z}{x-z}f(x) +\frac{x-y}{x-z}f(z)\geq f(y)\\ \iff & \frac{z-y}{z-x}f(x) +\frac{y-x}{z-x}f(z)\geq f(\frac{z-y}{z-x}x +\frac{y-x}{z-x}z)\\ \end{aligned} \ \ \ \ \ (1)

where the last one holds due to the definition of convexity. \Box

The next lemma shows that convex function in a real line must be always increasing or always decreasing or first decreasing and then increasing.

Lemma 3 Let {f:I \rightarrow \mathbb{R}} where {I} is an interval in {\mathbb{R}}. The interval {I} can be any one kind of {[a,b],(a,b),(a,b],[a,b)} where {-\infty \leq a<b\leq \infty}. Then {f} is either always increasing or always decreasing or first decreasing and then increasing.

Proof: We may suppose {f} is not always a constant since in this case, the assertion is true. Now suppose that {f} is not always increasing and is also not always decreasing. Then there exists {x<y <z} all in {I} such that

\displaystyle f(x) > f(y) , f(y)< f(z)

or

\displaystyle f(x) <f(y), f(y) > f(z).

The latter case is not possible because it implies that

\displaystyle \frac{f(x)-f(y)}{x-y}> 0 > \frac{f(y) - f(z)}{y-z}

which contradicts the convexity of {f}. Thus only the first case is possible.

Now if {x,z} are interior point of {I}, Then {f} is continuous on the interior of {I} and so is continuous on {[x,z]}. This means that {f} attains a minimum on {[x,z]} at some {x_0\in (x,z)} as {f(y)<\min \{f(x),f(z)\}}. Now for any {x_1<x_2<x<x_3<x_4<x_0}, by convexity and monotonicity of slope, we have

\displaystyle \begin{aligned} \frac{f(x_1)-f(x_2)}{x_1-x_2}\leq \frac{f(x_2)-f(x)}{x_2-x_3}\leq \frac{f(x)-f(x_3)}{x-x_3}\leq \frac{f(x_3)-f(x_4)}{x_3-x_4}\leq \frac{f(x_4)-f(x_0)}{x_4-x_0}\leq 0. \end{aligned} \ \ \ \ \ (2)

which implies {f(x_1)\geq f(x_2)\geq f(x)\geq f(x_3)\geq f(x_4)}. Thus {f} is decreasing on {I\cap (-\infty,x_0]}. Similarly, using monotonicity of slope, we have {f} is increasing on {I\cap(x_0,\infty)}. Now if suppose either {x} or {z} is the end point of {I}. We take the half points {a_1 = \frac{x+y}{2},b_1 = \frac{y+z}{2}}. Then there are only three possible cases

  1. {f(x)\geq f(a_1)\geq f(y)\geq f(b_1)\leq f(z)}
  2. {f(x)\geq f(a_1)\geq f(y)\leq f(b_1)\leq f(z)}
  3. {f(x)\geq f(a_1)\leq f(y)\leq f(b_1)\leq f(z)}

The second case is ideal as {a_1} and {b_1} are interior point and we can proceed our previous argument and show {f} is first decreasing and the then increasing. In the first case, we can take {b_2 = \frac{b_1+z}{2}} and ask the relation between {f(y),f(b_1),f(b_2)} and {f(z)}. The only non-ideal case is that {f(b_1)\geq f(b_2)\leq f(z)} (the other case gives three interior point such that {f(y)\geq f(b_1)\leq f(b_2)} and we can employ our previous argument). We can then further have {b_3 = \frac{b_2+z}{2}}. Again there will be only one non-ideal case that {f(b_1)\geq f(b_2)\geq f(b_3)}.

Continuing this process, we either stop at some point to obtain a pattern {f(y)\geq f(b_1)\leq f(b_i)} or the sequence {b_n} satisfies {f(b_i)\geq f(b_{i+1})} for all {i} and {\lim_{i\rightarrow \infty} b_i=z}. If {z} is an interior point, then continuity of {f} implies that {\lim_{i\rightarrow \infty}f(b_i) = f(z)}. But this is not possible because {f(z)> f(y)\geq f(b_i)}. Thus {z} must be the end point of {I}. Moreover, {f} is actually decreasing on {I/\{z\}}. Indeed, for any {x_1>x_2, x_i\in I/\{z\}} for {i=1,2}, there is a {b_i} such that {x_2>b_i} and so

\displaystyle \frac{f(x_1)-f(x_2)}{x_1-x_2}\leq \frac{f(x_2)-f(b_i)}{x_2-b_i}\leq \frac{f(b_i)-f(b_{i+1})}{b_i-b_{i+1}}\leq 0.

But since {f(z)>f(y)}, we see {f} is indeed decreasing on {I/\{z\}} and increase/jump at the point {z}.

The third case is similar, we will either have {f} always increasing and only jump down at the end point {x} or {f} is just first decreases for some interval and then increases for some interval. \Box

Utilizing the above lemma, we can say 1) one dimensional convex function is almost everywhere differentiable and 2) something about the boundary points of {f} on a finite interval.

Lemma 4 Suppose {f:[a,b] \rightarrow \mathbb{R}} is convex for some {a\in \mathbb{R},b \in \mathbb{R}}. Then {f(a) \geq \lim _{x\downarrow a}f(a) = f(a^+), f(b)\geq \lim_{x\uparrow b}f(x) =f(b^+)}.

The above lemma shows that a convex function on a finite open interval is uniformly continuous as it can be extend to the boundary. However, this property does not hold in higher dimension by considering {f(x,y) = \frac{x^2}{y}} on {x^2 \leq y, 0<y <1} and the point {(0,0)}.

LP Conditioning

We consider global conditioning property of Linear Program (LP) here. Let’s first define the problem. The standard form of LP is

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & c^Tx\\ & {\text{subject to}} & & Ax =b ,\\ & & & x\geq 0 \end{aligned} \ \ \ \ \ (1)

where {A \in \mathbb{R}^{n\times m}, c\in \mathbb{R}^n} and {b \in\mathbb{R}^m}. The dual of the standard form is

\displaystyle \begin{aligned} & \underset{y \in \mathbb{R}^m}{\text{maximize}} & & b^Ty\\ & {\text{subject to}} & & A^Ty\leq c. \\ \end{aligned} \ \ \ \ \ (2)

Now suppose we add perturbation {u\in \mathbb{R}^{n}} to Program (2) in the {c} term, we have

\displaystyle \begin{aligned} & \underset{y \in \mathbb{R}^n}{\text{maximize}} & & b^Ty\\ & {\text{subject to}} & & A^Ty\leq c+u.\\ \end{aligned} \ \ \ \ \ (3)

Let {v(u)} be the optimal value of Program (3) and {y(u)} be an optimal solution to the same program. We would like to ask the following question:

\displaystyle \begin{aligned} \textbf{Q}&: \text{Is}\; v(u)\; \text{Lipschitz continuous?} \;\text{Is} \;y(u)\;\text{Lipschitz}? \end{aligned} \ \ \ \ \ (4)

Such questions are called the conditioning of Program (2). It asks under some perturbation, how does the solution and optimal value changes accordingly. In general, we hope that Program (1) and (2) to be stable in the sense that small perturbation in the data, i.e., {A,b,c}, only induces small changes in the optimal value and solution. The reason is that in real application, there might be some error in the coding of {b,c} or even {A}. If Program (1) and (2) are not stable with respect to the perturbation, then possibly either more costs need to be paid to the encoding of the data or the LP is not worth solving. At first sight, it is actually not even clear that

  1. Whether {v(u)} will be always finite?
  2. Whether {y(u)} is actually unique? If {y(u)} is not unique, then the Lipschitz property of {y(u)} does not even make sense.

To address the first issue, we make our first assumption that

A1: Program (2) has a solution and the optimal value is finite.

Under A1, by strong duality of LP, we have that Program (1) also has a solution and the optimal value is the same as the optimal value of Program (2). This, in particular, implies that (1) is feasible. We now characterize the region where {v(u)} is finite. Let the feasible region of Program (1) be denoted as

\displaystyle \mathcal{F}_p :\,= \{x\mid Ax = b,x\geq 0\},

the set of vertices of {\mathcal{F}_p} be {\mathcal{V}_p=\{x_i\}_{i=1}^d} and the set of extreme rays be {\mathcal{R}_p=\{\gamma_j\}_{j=1}^l}. Using our assumption A1, we know

\displaystyle \mathcal{F}_p \not = \emptyset.

Now according to the Resolution Theorem of the primal polyhedron, we have

\displaystyle \mathcal{F}_p = \{ x\mid x = \sum_{i=1}^d \alpha _ix_i + \sum_{j=1}^l\beta_j \gamma_j, x_i \in \mathcal{V}_p, \gamma _j \in \mathcal{R}_p,\sum_{i=1}^d \alpha_i =1, \alpha_i\geq 0, \beta_j \geq 0, \forall i,j\}.

Using this representation, the fact that the primal of Program (3) is

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & (c+u)^Tx\\ & {\text{subject to}} & & Ax =b \\ & & & x\geq 0, \end{aligned} \ \ \ \ \ (5)

and strong duality continue to hold for (3) and (5) as (5) is feasible by A1, the value of {v(u)} is

\displaystyle v(u) = \min \{ c^Tx_i +u^Tx_i\mid x_ i \in \mathcal{V}_p\} + \min\{c^T\gamma_j +\beta_ju^T\gamma_j \mid \gamma_j \in \mathcal{R}_p,\beta_j\geq 0\}.

Thus it is immediate that the region where {v(u)} is finite is

\displaystyle \begin{aligned} \mathcal{F}_u = \{ u\mid \gamma_j ^Tu\geq 0, \forall \gamma_j \in \mathcal{R}_p\}. \end{aligned} \ \ \ \ \ (6)

In particular, if {\mathcal{F}_p} is bounded, then we have {v(u)} to be always finite.

To address the issue whether {y(u)} is unique, we make the following nondegenerate assumption on vertices.

A2: All the basic feasible solution of (1) are non-degenerate, i.e., all vertices of the feasible region of Program (1) have exactly {m} positive entries.

Under this condition, and the definition of basic feasible solution, we know if {x} is a basic feasible solution, then {A_{\cdot V(x)}\in \mathbb{R}^{m\times m}} where {V(x)=\{i\mid x_i \text{ is nonzero}\}} has all column independent. Here {A_{\cdot V(x)}} is the submatrix of {A} having columns with indices in {\mathcal{V}(x)}. By complementarity of LP (3) and (5), we thus know that if {x} is a baisc feasible solution and it is a solution to Program (5), then

\displaystyle y(u) = (A_{\cdot V(x)})^{-T}(c+u)

is the unique solution to Program (3) where {(A_{\cdot V(x)})^{-T} =((A_{\cdot V(x)})^{T})^{-1} }. What we have done is that under A1 and A2,

\displaystyle v(u) = \min \{ c^Tx_i +u^Tx_i, x_ i \in \mathcal{V}_p\}, u \in \mathcal{F}_u

and

\displaystyle y(u) = (A_{\cdot V(x)})^{-T}(c+u)

where {x = \arg\min_{x_i}\{ c^Tx_i +u^Tx_i, x_ i \in \mathcal{V}_p\}}.

By defining

\displaystyle \Omega_i = \{ u \mid c^Tx_i +u^Tx_i \leq c^Tx_j +u^Tx_j,\forall j\not =i \}, \forall\, 1\leq i\leq d

which is a polyhedron, we may write the above compactly as

\displaystyle v(u) = c^Tx_i +u^Tx_i, \quad y(u) = (A_{\cdot V(x_i)})^{-T}(c+u), u \in \Omega_i.

Then it is easy to see that the Lipschitz condition of {v(u)} is

\displaystyle |v(u_1)-v(u_2)| \leq \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \|u_1-u_2\|

where {\|\cdot\|_*} is the dual norm of {\|\cdot\|}. We note {\|\cdot \|} here is arbitrary.

Similarly, by using the theorem in this post, we see that {y(u)} is also Lipschitz continuous with Lipschitz property

\displaystyle \| y(u_1) - y(u_2)\|_\alpha \leq\max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}\|u_1-u_2\|_{\beta }

where {\|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta } = \max_{\|x\|_{\beta }=1}\|(A_{\cdot V(x)})^{-T}\|_\alpha} and {\|\cdot\|_\alpha, \|\cdot\|_\beta} are arbitrary norms. We conclude the above discussion in the following theorem.

Theorem 1 Recall {\mathcal{V}_p} is the set of extreme points of the feasible region of (1), {\mathcal{R}_p} is the set of extreme points of the feasible region of (1) and {\mathcal{F}_u = \{ u\mid \gamma_j ^Tu\geq 0, \forall \gamma_j \in \mathcal{R}_p\}}. Under assumption A1 and A2, we have for all {x\in \mathcal{F}_u},

\displaystyle |v(u_1)-v(u_2)| \leq \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \|u_1-u_2\|

and

\displaystyle \| y(u_1) - y(u_2)\|_\alpha \leq\max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}\|u_1-u_2\|_\beta

where {\|\cdot\|,\|\cdot\|_{\alpha},\|\cdot\|_\beta } are arbitrary norms. Here {A_{\cdot V(x)}} is the submatrix of {A} having columns with indices in {\mathcal{V}(x)=\{i\mid x_i \text{ is nonzero}\}}.

How might one bound the Lipschitz constants

\displaystyle \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \text{ and } \max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}?

The former might be done by giving a bound of the diameter of the feasible region {\mathcal{F}_p} of Program (1) and the later might be done through some restricted isometry property of {A}.

Lipschitz constant of piecewise linear (vector valued) functions

Suppose we have a collection of sets {\{\Omega_i \subset \mathbb{R}^{n}\}_{i=1}^d} and a set of {d} linear functions {\{f_i(x) = A_ix +c_i\mid x\in \Omega_i, i=1,\dots, d\}} where each {\Omega_i} being closed polyhedrons in {\mathbb{R}^{n}}, i.e., {\Omega_i = \{x\mid a_{i_j}^Tx + b_{i_j} \leq 0,\forall j\leq k(i)\}} for some {a_{i_j} \in \mathbb{R}^{n},b_{i_j}\in \mathbb{R},k(i)\in \mathbb{N}}, and {A_i\in \mathbb{R}^{m\times n}} and {c_i \in \mathbb{R}^{m}}. Moreover, for any {i\not =j}, and {x\in \Omega_i \cap \Omega_j}, we have {f_i(x) = f_j(x)}. Then we can define function {f} on {\Omega =\cup_{i=1}^d \Omega_i} such that

\displaystyle f(x) = f_i(x) \quad \text{if} \; x\in \Omega_i.

We further assume \Omega is convex. We now state a theorem concerning the Lipschitz constant of {f}.

Theorem 1 Under the above setting, given arbitrary norms {\|\cdot\|_\alpha,\|\cdot\|_\beta }, {f} is {\max_{i\leq d}\{\|A_i\|_{\alpha,\beta}\}}-Lipschitz continuous with respect to {\|\cdot\|_\alpha} and {\|\cdot \|_\beta }, i.e., for all {x_1,x_2 \in \Omega},

\displaystyle \begin{aligned} \| f(x_2) - f(x_1)\|_\alpha \leq \max_{i\leq d}\{\|A_i\|_{\alpha,\beta }\}\|x_2-x_1\|_\beta . \end{aligned} \ \ \ \ \ (1)

Here {\|A_i\|} is {\|A_i\|_{\alpha,\beta } = \max_{\|x\|_\beta \leq 1}\|A_ix\|_{\alpha}}.

Proof: By letting {g(t) = x_1 + t(x_2 - x_1)}, {L = \max_{i\leq d}\{\|A_i\|_{\alpha,\beta }\}\|x_2-x_1\|_\beta } and {h= f\circ g}. we see the inequality (1) is equivalent to

\displaystyle \begin{aligned} \|h(1)- h(0)\|_\alpha \leq L. \end{aligned} \ \ \ \ \ (2)

Using the property that {f} is well-defined on {\Omega_i\cap \Omega_j} for any {i,j}, we have that there exists {0=t_0\leq \dots \leq t_l\leq \dots \leq t_K\leq t_{K+1}=1, l=1,\dots,K} and corresponding {i_1,\dots ,i_K} such that

\displaystyle \begin{aligned} h(t_l)= A_{i_l}x^{(l)}+c_{i_l} = A_{i_{l-1}}x^{(l)}+c_{i_{l-1}} \end{aligned} \ \ \ \ \ (3)

for all {1\leq l\leq K} where {x^{(l)} = g(t_l)}. We also let {x^{(0)} =g(t_0) =x_1} and {x^{(K+1)}=g(t_{{K+1}}) =x_2}. These {x_l}s have the property that

\displaystyle \begin{aligned} \sum_{l=0}^{K}\|x_{i_{l+1}} -x_{i_l}\|_\beta = \sum_{l=0}^{K}(t_{l+1}-t_l)\|x_2 -x_1\|_\beta = \|x_2-x_1\|_\beta . \end{aligned} \ \ \ \ \ (4)

Thus we can bound the term {\|h(1)- h(0)\|_\alpha } by

\displaystyle \begin{aligned} \|h(1) - h(0)\|_\alpha &\leq \sum_{l=0}^{K} \|h(t_{l+1}) -h(t_{l})\|_\alpha \\ & \leq \sum_{l=0}^{K} \|A_{i_{l+1}}x_{l+1} +c_{i_{l+1}} - A_{i_l}x_{l} -c_{i_l}\|_\alpha \\ &\overset{(i)}{\leq } \sum_{l=0}^{K-1} \|A_{i_{l}}x_{l+1} +c_{i_{l}} - A_{i_l}x_{l} -c_{i_l}\|_\alpha \\ & = \sum_{l=0}^{K} \|A_{i_{l}}x_{l+1} - A_{i_l}x_{l} \|_\alpha \\ &\overset{(ii)}{\leq}\sum_{l=0}^{K}\|A_{i_l}\|_{\alpha,\beta } \|x_{l+1}-x_{l}\|_\beta \\ & \leq \max_{i}\{\|A_i\|_{\alpha,\beta },i\leq d\}\sum_{l=0}^{K} \|x_{l+1}-x_{l}\|_\beta \\ & \overset{(iii)}{\leq}\max_{i}\{\|A_i\|_{\alpha,\beta },i\leq d\} \|x_2-x_1\|_\beta \\ & = L, \end{aligned} \ \ \ \ \ (5)

where {(i)} is due to inequality (3), {(ii)} is due to the definition of operator norm and {(iii)} is using equality (4). This proves inequality (1). \Box

Close property of conditional expectation under convexity

Suppose we have a random vector {Z\in \mathbb{R}^n} and we know that {Z} only takes value in a convex set {C}, i.e.,

\displaystyle \mathbb{P}(Z\in C) =1.

Previously, we showed in this post that as long as {C} is convex, we will have

\displaystyle \mathbb{E}(Z) \in C,

if {\mathbb{E}(Z)} exists. It is then natural to ask how about conditional expectation. Is it true for any reasonable sigma-algebra {\mathcal{G}} that

\displaystyle \mathbb{P}(\mathbb{E}(Z\mid \mathcal{G}) \in C )=1?

The answer is affirmative. Let us first recall our previous theorem.

Theorem 1 For any borel measurable convex set {C\subset {\mathbb R}^n}, and for any probability measure {\mu} on {(\mathbb{R}^n,\mathcal{B})} with

\displaystyle \int_C \mu(dx) =1, \quad \int_{\mathbb{R}^n} \|x\|\mu(dx) <\infty,

we have

\displaystyle \int_{\mathbb{R}^n} x\mu(dx) \in C.

By utilizing the above theorem and regular conditional distribution, we can prove our previous claim.

Theorem 2 Suppose {Z} is a random vector from a probability space {(\Omega, \mathcal{F},\mathbb{P})} to {(\mathbb{R}^n,\mathcal{B})} where {\mathcal{B}} is the usual Borel sigma algebra on {\mathbb{R}^n}. If {C} is Borel measurable and

\displaystyle \mathbb{P}(Z\in C) =1, \mathbb{E}\|Z\| <\infty ,

then we have

\displaystyle \mathbb{P}( \mathbb{E}(X\mid \mathcal{G})\in C) =1

for any sigma algebra {\mathcal{G} \subset \mathcal{F}}.

Proof: Since {Z} takes value in {\mathbb{R}^n}, we know there is a family of conditional distribution {\{\mu(\omega,\cdot)\}_{\omega \in \Omega}} such that for almost all {\omega \in \Omega}, we have for any {A\in \mathcal{B}}

\displaystyle \mathbb{E}(1_{\{Z\in A\}}\mid \mathcal{G})(\omega) = \int_{A} \mu(\omega, dz)

and for any real valued borel function {f} with domain {\mathbb{R}^n} and {\mathbb{E}(|f(Z)|)}, we have

\displaystyle \mathbb{E}(f(Z)\mid \mathcal{G}) (\omega)= \int_{\mathbb{R}^n} f(z) \mu(\omega, dz).

The above two actually comes from the existence of regular conditional distribution which is an important result in measure-theoretic probability theory.

Now take {A=C} and {f(Z) = Z}, we have for almost all {\omega\in \Omega},

\displaystyle \mathbb{E}(1_{\{Z\in C\}}\mid \mathcal{G})(\omega) = \int_{C} \mu(\omega, dz)

and

\displaystyle \mathbb{E}(Z\mid \mathcal{G}) (\omega)= \int_{\mathbb{R}^n} z \mu(\omega, dz).

But since {\mathop{\mathbb P}(Z\in C)=1}, we know that for almost all {\omega \in \Omega},

\displaystyle 1= \mathbb{E}(1_{\{Z\in C\}}\mid \mathcal{G})(\omega) = \int_{C} \mu(\omega, dz).

Thus for almost all {\omega} we have a probability distribution {\mu(\omega, dz)} on {{\mathbb R}^n} with mean equals to {\mathop{\mathbb E} (Z\mid \mathcal{G} )(\omega)} and {\int_C \mu (\omega ,dz) =1}. This is exactly the situation of Theorem 1. Thus by applying Theorem 1 to the probability measure {\mu(\omega, dz)}, we find that

\displaystyle \mathbb{E}(Z\mid \mathcal{G})(\omega) = \int _{\mathbb{R}^n} z\mu (\omega, dz) \in C,

for almost all {\omega \in \Omega}. \Box

A variance-bias decomposition of L1 norm

Suppose we have real random variables {X,X',Y,Y'}, {X,X'\sim F}, {Y,Y'\sim G} where {F} and {G} are cumulative distribution function and {X,X',Y,Y'} are all independent and {\mathbb{E}|X|,\mathbb{E}|Y|} are finite. We prove the following theorem.

Theorem 1 (Variance-Bias decomposition of {L1} norm) For independent random variables {X,X',Y,Y'} defined above, we have

\displaystyle 2\mathbb{E}|X-Y| = \mathbb{E}|X-X'| + \mathbb{E}|Y-Y'| +2 \int (G(u)-F(u))^2du.

Thus

\displaystyle \begin{aligned} 2\mathbb{E}|X-Y| \geq \mathbb{E}|X-X'| + \mathbb{E}|Y-Y'| \end{aligned} \ \ \ \ \ (1)

with equality holds if and only if {G=F}.

The quantity {\mathbb{E}|X-X'|} is usually referred as mean absolute difference and it measures the spread of a distribution. I don’t know the term for the quantity {\mathbb{E}|X-Y|} but what it measures is the difference between the distribution {F} and {G}. I think cross mean difference would be a nice name.

The equality can be considered as an analogue of the well-known variance-bias decomposition of estimators/predictors in statistics. If we think we are using {X} to estimate/predicts {Y}, then the expected error (the cross mean difference) in terms of absolute value ({L1} norm in more advance term) is the sum of the mean absolute difference in {X} and {Y}, i.e., \mathbb{E}|X-X'|, \mathbb{E}|Y-Y'|, which can be considered as variance and the difference in the two distribution ,i.e., {\int(F-G)^2}, which can be considered as bias.

There is an analogue in terms of the usual square loss (or {L2} norm) and it is

\displaystyle 2\mathbb{E}(X-Y)^2 = \mathbb{E}(X-X')^2 + \mathbb{E}(Y-Y')^2 + 2(\mathbb{E}X-\mathbb{E}Y)^2.

Under this setting, we also see a decomposition of the estimator/prediction error in terms of the variance in {X}and {Y}, i.e., {\mathbb{E}(X-X')^2, \mathbb{E}(Y-Y')^2}, and the difference of mean can be considered as bias as well.

The theorem assume {X,Y} both have first finite moments. In the case either {X} or {Y} has no finite first moment, the equality of the decomposition and inequality is still true by inspecting the proof below. But that the equality holds for inequality (1) does not necessarily imply that {F =G}.

Proof: The trick to establish the equality is to write the quantity {\mathbb{E}|X-Y|} in the following form.

\displaystyle \begin{aligned} 2\mathbb{E}|X-Y| & = 2\mathbb{E}(X-Y) 1_{\{ X\geq Y\}} + 2\mathbb{E}(Y-X) 1_{\{ Y\geq X\}}\\ & =2 \mathbb{E}\int 1_{\{Y\leq u\leq X\}}du + 2 \mathbb{E}\int 1_{\{ X\leq u\leq Y\}}du\\ & =2 \int \mathbb{P}(Y\leq u\leq X)du+ 2\int \mathbb{P}(X\leq u\leq Y)du\\ & =2\int G(u)(1-F(u))du +2\int F(u)(1-G(u))du\\ & = 2\int G(u)(1-F(u)) + F(u)(1-G(u)) du. \end{aligned} \ \ \ \ \ (2)

The third equality is due to Fubini’s theorem and the fourth is because of the independence between {X} and {Y}. Similarly, we have

\displaystyle \mathbb{E}|X-X'| +\mathbb{E}|Y-Y'| =2 \int F(u)(1-F(u)) + G(u)(1-G(u)) du.

Thus the difference of {2\mathbb{E}|X-Y|} and {\mathbb{E}|X-X'|+\mathbb{E}|Y-Y'|} which is finite because {\mathbb{E}|X|,\mathbb{E}|Y|<\infty} is

\displaystyle \begin{aligned} &2\mathbb{E}|X-Y|-\mathbb{E}|X-X'|-\mathbb{E}|Y-Y'|\\ =&2\int G(u)(1-F(u)) + F(u)(1-G(u)) du -2 \int F(u)(1-F(u)) + G(u)(1-G(u)) du\\ = &2\int (G(u)-F(u))^2du \geq 0\\ \end{aligned} \ \ \ \ \ (3)

Thus the equality of decomposition in the theorem and the inequality (1) is established. Now we argue equality of inequality (1) holds if and only {F=G}. If {F=G}, then inequality (1) obviously becomes an equality, Now if inequality (1) becomes an equality, by the last line of inequality (3), we have

\displaystyle \int (G(u)-F(u))^2 du = 0.

This means that

\displaystyle G(u) = F(u)\quad \text{almost everywhere}.

But {G} and {F} are right continuous, we have for all {u\in \mathbb{R}},

\displaystyle G(u) = F(u).

\Box

Sandwich inequality of optimal gap and distance to solutions of LP

Suppose we have a general optimization problem.

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & f(x)\\ & {\text{subject to}} & & x \in \Omega. \\ \end{aligned} \ \ \ \ \ (1)

Also suppose problem (1) has a minimum and the minimum can be achieved by an unique minimizer {x^*}.

Now if I have a point {x} such that {f(x) - f(x^*) } is very small, then how small is the distance {\|x-x^*\|}. We might expect that {f(x) - f(x^*) \rightarrow 0} will imply that {x \rightarrow x^*}. This is true if {\Omega} is compact and {f} is continuous. But this does not tell what is the quantitative relationship between the optimal gap, i.e., {f(x) -f(x^*)} and the distance to the solution, i.e., {\|x-x^*\|}.

In this post, I am going to show that for linear programming (LP) problem, the optimal gap and distance to solutions are the same up to a multiplicative constant which only depend on the problem data.

To start, consider an LP in the standard form, i.e.,

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & c^Tx\\ & {\text{subject to}} & & Ax =b \\ & & & x\geq 0. \end{aligned} \ \ \ \ \ (2)

where the decision variable is {x} and problem data are {A\in \mathbb{R}^{m\times n},b\in \mathbb{R}^m,c\in \mathbb{}R^n}. {x\geq 0} means each coordinate of {x} is nonnegative.

Denote the solution set of problem (1) to be {X^* = \arg \min\{ c^Tx| Ax = b, x\geq 0\} }, and the distance to the solution set {X^*} to be {\text{dist}(x,X^*) = \inf_{x^* \in X^*} \|x-x^*\|}. Note that the norm here is arbitrary, not necessarily the Euclidean norm.

We are now ready to state the theorem.

Theorem 1 (Sandwich inequality of optimal gap and distance to solutions of LP) For problem (1), theres exist constants {C_0, c_0>0} which depends only on {A,b,c} and {\|\cdot\|} such that for all feasible {x}, i.e., {Ax= b, x\geq 0},

\displaystyle C_0 \text{dist}(x,X^*)\geq c^Tx - c^Tx^* \geq c_0 \text{dist}(x,X^*).

The above theorem shows that the role of optimal gap, i.e., {c^Tx - c^Tx^*}, and the distance to the solution set, i.e. {\text{dist}(x,X^*)}, are the same up to a multiplicative constant. The right inequality  of the theorem is usually referred as linear growth  in the optimization literature.

The proof below is constructive and we can in fact take {c_0 = \frac{\epsilon}{\max\{2B,1\}}} where { B = \max_{ 1\leq i \leq l} \|x_i\|} and {\epsilon = \min_{ m+1\leq i\leq l, n+1\leq j \leq s} \{ c^Tx_i-c^Tx^*, c^T\gamma_j\}} for {x^*\in X^*} and {C_0 = \|c\|_*}. Here {\| y\|_* = \sup \{ u^Ty\mid \|u\|\leq 1\}} is the dual norm and x_i,\gamma_j are extreme points and extreme rays. We assume there are l many extreme points and s many extreme rays with first m\; x_is and first n \; \gamma_j are in the optimal set X^*. See the proof for more detail.

The idea of the proof mainly relies on the extreme point and extreme rays representation of the feasible region and the optimal set,i.e., {\{x : Ax = b, x\geq 0\}} and X^*.

Proof: The feasible region can be written as

\displaystyle \{x|Ax=b,x\geq 0\} = \{ \sum_{i=1}^l \alpha_i x_i + \sum_{j=1}^s \beta_j \gamma_j | \sum_i \alpha_i = 1, \alpha_i\geq 0, \beta_j\geq 0\}.

Here {x_i}s are extreme points of the feasible region and {\gamma_j}s are extreme rays. By scaling the extreme rays, we can assume that {\|\gamma_j\| =1} for all {j}.

The optimal set can also be written as

\displaystyle X^* = \{ \sum_{i=1}^m \alpha_i x_i + \sum_{j=1}^n \beta_j \gamma_j | \sum_i \alpha_i = 1, \alpha_i\geq 0, \beta_j\geq 0\}.

We assume here the first {m} many {x_i} and {n} many {\gamma_j} are in the optimal set and the rest of {x_i} and {\gamma_j}s are not for notation simplicity.

We denote {B = \max_{ 1\leq i \leq l} \|x_i\|} and {\epsilon = \min_{ m+1\leq i\leq l, n+1\leq j \leq s} \{ c^Tx_i-c^Tx^*, c^T\gamma_j\}} where {x^*\in X^*}. Note {\epsilon>0} since the {\gamma_j}s not in the optimal set should have inner product with {c} to be positive.

We first prove the second inquality, i.e., {c^Tx - c^Tx^* \geq c_0 \text{dist}(x,X^*)}.

Now take an arbitrary feasible {x}, it can be written as

\displaystyle x = \sum_{i=1}^m a_i x_i + \sum_{i=m+1}^l a_i x_i + \sum _{j=1}^n b_j \gamma_j + \sum_{j=n+1}^s b_j \gamma_j

for some {a_i \geq 0, \sum_i a_i =1} and {b_j\geq 0}.

The objective value of {x} is then

\displaystyle c^Tx = \sum_{i=1}^m a_i c^T x_i + \sum_{i=m+1}^l a_i c^T x_i +\sum_{j=n+1}^s b_j c^T \gamma_j.

We use the fact that {c^T\gamma_j=0} for all {j\leq m} here.

Subtract the above by {c^Tx^*}. We have

\displaystyle \begin{aligned} c^Tx -c^Tx^*& = (\sum_{i=1}^m a_i-1)c^T x_i + \sum_{i=m+1}^l a_i c^T x_i +\sum_{j=n+1}^s b_j c^T \gamma_j \\ & = (-\sum_{i=m+1}^l a_i)c^T x_i + \sum_{i=m+1}^l a_i c^T x_i +\sum_{j=n+1}^s b_j c^T \gamma_j \\ &\geq (\sum_{i=m+1}^l a_i) \epsilon + (\sum_{j=n+1}^s b_j )\epsilon \end{aligned} \ \ \ \ \ (3)

The second equality is due to {\sum_i a_i =1} and the inequality is because of the definition of {\epsilon} and the {b_j,a_i}s are positive.

The distance between {x} and {X^*} is the infimum of

\displaystyle \| \sum_{i=1}^m( a_i-\alpha_i)x_i + \sum_{i=m+1}^l a_i x_i + \sum _{j=1}^n (b_j-\beta_j) \gamma_j + \sum_{j=n+1}^s b_j \gamma_j \|.

By taking {\alpha_i \geq a_i} with {\sum_i^m \alpha_i =1} and {\beta_j = b_j} and apply triangular inequality to the above quantity, we have

\displaystyle \begin{aligned} &\| \sum_{i=1}^m( a_i-\alpha_i)x_i + \sum_{i=m+1}^l a_i x_i + \sum _{j=1}^n (b_j-\beta_j) \gamma_j + \sum_{j=n+1}^s b_j \gamma_j \| \\ &\leq \sum_{i=1}^m( \alpha_i-a_i)\|x_i \|+ \sum_{i=m+1}^l a_i\|x_i \|+\sum_{j=n+1}^s b_j \|\gamma_j \|\\ &\leq \sum_{i=1}^m (\alpha_i -a_i) B + \sum_{i=m+1}^l a_i B + \sum_{j=n+1}^s b_j\\ &= (1-\sum_{i=1}^m a_i)B + (\sum_{i=m+1}^l a_i)B + \sum_{j=n+1}^s b_j\\ &= 2B (\sum_{i=m+1}^l a_i) +\sum_{j=n+1}^s b_j. \end{aligned} \ \ \ \ \ (4)

The first inequality is the triangular inequality and {\alpha_i\geq a_i, a_i\geq 0, b_j\geq 0}. The second inequality is applying the definition of {B} and {\|\gamma_j\|=1}. The first equality is due to {\sum_i \alpha_i =1} and the second equality is due to {\sum_i a_i =1}.

Thus the distance between {x} and {X^*} is bounded above by

\displaystyle \text{dist}(x,X^*) \leq 2B(\sum_{i=m+1}^l a_i) +\sum_{j=n+1}^s b_j.

Since {c^Tx -c^Tx^* \geq( \sum_{i = m+1}^l a_i + \sum_{j=n+1}^s b_j)\epsilon} by our previous argument, we see that setting

\displaystyle c_0 = \frac{\epsilon}{\max\{2B,1\}}

should give us

\displaystyle c^Tx - c^Tx^* \geq c_0 \text{dist}(x,X^*).

We now prove the inequality

\displaystyle C_0 \text{dist}(x,X^*)\geq c^Tx - c^Tx^*.

Note that the infimum in {\text{dist}(x,X^*) = \inf_{x^* \in X^*} \|x-x^*\|} is actually achieved by some {x^*}. The reason is that we can first pick a {x'\in X^*}, then

\displaystyle \inf_{x^* \in X^*} \|x-x^*\| = \inf_{x^* \in X^*, \|x-x^*\| \leq \|x-x'\|} \|x-x^*\|.

But the set {X^* \cap \{x^* | \|x-x^*\| \leq \|x-x'\| \}} is actually bounded and closed ({X^*} is closed as it is a convex combination of finite points plus a conic combination of extreme vectors), thus Weierstrass theorems tells us that the infimum is actually achieved by some {x^*\in X}.

Now take {x^*} such that {\|x-x^*\| =\text{dist}(x,X)}. We have

\displaystyle c^Tx -c^Tx^* \leq \|c\|_* \|x-x^*\|=\|c\|_*\text{dist}(x,X^*)

where {\|\cdot\|_*} is the dual norm of {\|\cdot\|}. Thus letting {C_0 = \|c\|_*} finishes the proof. \Box

From the proof, we see that two possible choice of {c_0} and {C_0} are {c_0 = \frac{\epsilon}{\max\{2B,1\}}} where { B = \max_{ 1\leq i \leq l} \|x_i\|} and {\epsilon = \min_{ m+1\leq i\leq l, n+1\leq j \leq s} \{ c^Tx_i-c^Tx^*, c^T\gamma_j\}} for {x^*\in X^*} and {C_0 = \|c\|_*}. These are not optimal and can be sharpened. I probably will give a sharper constant in a future post.

Close property of expectation under convexity

Suppose we have a random vector {Z} and a convex set {C\in {\mathbb R} ^n} such that

\displaystyle \mathop{\mathbb P}(Z\in C) =1.

If you are doing things with convexity, then you may wonder whether

\displaystyle \mathop{\mathbb E}(Z) \in C.

This is certainly true if {Z} only takes finitely many value in {C} or {C} is closed. In the first case, you just verify the definition of convexity and the second case, you may use the strong law of large numbers. But if you draw a picture and think for a while, you might wonder whether these conditions are needed as it looks like no matter what value {Z} takes, it can not go out of {C} and the average should still belong to {C} as long as {C} is convex. In this post, we are going to show that it is indeed the case and we then have a theorem.

Theorem 1 For any convex set {C\subset {\mathbb R}^n}, and for any random vector {Z} such that

\displaystyle \mathop{\mathbb P}(Z\in C)=1,

its expectation is still in {C}, i.e,

\displaystyle \mathop{\mathbb E}(Z) \in C

as long as the mean exists.

Skip the following remark if you don’t know or not familiar with measure theory.

Remark 1 If you are a measure theoretic person, you might wonder whether {C} should be Borel measurable. The answer is no. The set {C} needs not to be Borel measurable. To make the point clear, suppose there is an underlying probability {(\Omega, \mathcal{F},\mathop{\mathbb P})} and {Z} is a random variable from this probability space to {(\mathbb{R}^n, \mathcal{B})} where {\mathcal{B}} is the borel sigma-algebra. Then we can either add the condition that the event {\{\omega \in \Omega \mid Z\in C \} = F\in \mathcal{F}} or {\mathop{\mathbb P}(F)=1} is understood as {F} is a measurable event with respect to the completed measure space {(\Omega, \bar{\mathcal{F}},\bar{\mathop{\mathbb P}})} and we overload the notation {\mathop{\mathbb P}} to mean {\bar{\mathop{\mathbb P}}}. The probability space {(\Omega, \mathcal{F},\mathop{\mathbb P})} is completed by the probability measure {\mathop{\mathbb P}}.

To have some preparation for the proof, recall the separating hyperplane theorem of convex set.

Theorem 2 (Separating Hyperplane theorem) Suppose {C} and {D} are convex sets in {{\mathbb R}^n} and {C\cap D = \emptyset}, then there exists a nonzero {a \in {\mathbb R}^{n}} such that

\displaystyle a^Tx \geq a^Ty

for all {x\in C,y\in D}.

Also recall the following little facts about convexity.

  • Any convex set in {{\mathbb R}} is always an interval.
  • Any affine space of {n-m} dimension in {{\mathbb R}^n} is of the form {\{x\in {\mathbb R}^{n}:Ax=b\}} for some {A\in {\mathbb R}^{m\times n}} and {b \in {\mathbb R}^m}.

We are now ready to prove Theorem 1.

Proof of Theorem 1: We may suppose that {C} has nonempty interior. Since if it is not, we can take the affine plane containing {C} with smallest dimension. Suppose {L =\mathop{\mathbb E} (Z)} is not in {C}, then by separating hyperplane theorem, there exists a nonzero a such that

\displaystyle L= a^T\mathop{\mathbb E}(Z) \geq a^Tx, \forall x \in C.

Since {Z\in C} almost surely, we should have

\displaystyle a^TZ \leq L

almost surely. Since {\mathop{\mathbb E}( a^T Z)= a^T\mathop{\mathbb E}( Z)}, we see that {a^T Z= L} with probability {1}. Since intersection of the hyperplane of {a^Tx = L} and {C} is still convex, we see that that {Z} only takes value in a convex set in a {n-1} dimensional affine space.

Repeat the above argument, we can decrease the dimension until {n=1}. After a proper translation and rotation, we can say that {Z} takes its value in an interval in {\mathbb{R}} and we want to argue that the mean of {Z} is always in the interval.

This is almost trivial. Suppose the interval is bounded. If the interval is closed, then since taking expectation preserves order, i.e., {X\geq Y \implies \mathop{\mathbb E} X\geq \mathop{\mathbb E} Y}, we should have its mean in the interval. If the interval is half open and half closed and if the means is not in the interval, then {\mathop{\mathbb E} Z} must be the open end of the interval since expectation preserves order, but this means that {Z} has full measure on the open end which contradicts the assumption that {Z} is in the interval with probability one. The case both open is handled in the same way. If the interval is unbounded one way, then the previous argument still works and if it is just {\mathbb{R}}, then for sure that {\mathop{\mathbb E} Z \in\mathbb{R}}. This completes the proof. \Box

Solution to (general) Truncated Moment Problem

We are going to solve the truncated moment problem in this post. The theorem we are going to establish is more general than the original problem itself. The following theorem is a bit abstract, you can skip to Corollary 2 to see what the truncated moment problem is and why it has a generalization in the form of Theorem 1.

Theorem 1 Suppose {X} is a random transformation from a probability space {(A,\mathcal{A},\mathop{\mathbb P})} to a measurable space {(B,\mathcal{B})} where each singleton set of B is in \mathcal{B}. Let each {f_i} be a real valued (Borel measurable) function with its domain to be {B}, {i=1,\dots,m}. Given

\displaystyle (\mathbb{E}f_i(X))_{i=1,\dots,m}

and they are all finite, there exists a random variable {Y\in B} such that {Y} takes no more than {m+1} values in {B}, and

\displaystyle (\mathbb{E}f_i(Y))_{i=1,\dots,m} = (\mathbb{E}f_i(X))_{i=1,\dots,m}.

(If you are not familiar with terms Borel measurable, measurable space and sigma-algebras \mathcal{A}, \mathcal{B},  then just ignore these. I put these term here just to make sure the that the theorem is rigorous enough.)

Let me parse the theorem for you. Essentially, the theorem is trying to say that given {m} many expectations, no matter what kind of source the randomness comes from, i.e., what {X} is, we can always find a finite valued random variable (which is {Y} in the theorem) that achieves the same expectation.

To have a concrete sense of what is going on, consider the following Corollary of Theorem 1. It is the original truncated moment problem.

Corollary 2 (Truncated Moment Problem) For any real valued random variable {X\in {\mathbb R}} with its first {m} moments all finite, i.e., for all {1\leq i\leq m}

\displaystyle \mathop{\mathbb E}|X|^i < \infty,

there exists a real valued discrete random variable {Y} which takes no more than {m+1} values in {{\mathbb R}} and its first {m} moments are the same as {X}, i.e.,

\displaystyle (\mathbb{E}Y,\mathbb{E}(Y^2),\dots, \mathbb{E}(Y^m) )=(\mathbb{E}X,\mathbb{E}(X^2),\dots, \mathbb{E}(X^m)).

This original truncated moment problem is asking that given the (uncentered) moments, can we always find a finite discrete random variable that matches all the moments. It should be clear that is a simple consequence of Theorem 1 by letting {B={\mathbb R}} and {f_i(x) = x^{i},, i=1,\dots,m}.

There is also a multivariate version of truncated moment problem which can also be regarded as a special case of Theorem 1.

Corollary 3 (Truncated Moment Problem, Multivariate Version) For any real random vector {X=(X_1,\dots,X_n)\in \mathbb{R}^n} and its all {k}th order moments are finite, i.e.,

\displaystyle \mathop{\mathbb E}(\Pi_{i=1}^n|X_{i}|^{\alpha_i}) <\infty

for any {{1\leq \sum \alpha_i\leq k}}. Each {\alpha_i} here is a nonnegative integer. The total number of moments in this case is {n+k \choose k}. Then there is a real random vector {Y \in \mathbb{R}^n} such that it takes no more than {{n+k \choose k}+1} values, and

\displaystyle (\mathop{\mathbb E}(\Pi_{i=1}^nX_{i}^{\alpha_i}))_{1\leq \sum \alpha_i\leq k} = (\mathop{\mathbb E}(\Pi_{i=1}^nY_{i}^{\alpha_i})) _{1\leq \sum \alpha_i\leq k}.

Though the form of Theorem 1 is quite general and looks scary, it is actually a simple consequence of the following lemma and the use of convex hull.

Lemma 4 For any convex set {C \in \mathbb{R}^k}, and any random variable {Z} which has finite mean and takes value only in {C} , i.e,

\displaystyle \mathop{\mathbb E}(Z) \in \mathbb{R}^k, \mathop{\mathbb P}(Z\in C) =1,

we have

\displaystyle \mathop{\mathbb E} (Z) \in C.

The above proposition is trivially true if {C} is closed or Z takes only finitely many value. But it is true that {C} is only assumed to be convex. We will show it in this post.

We are now ready to show Theorem 1.

Proof of Theorem 1: Consider the set

\displaystyle L = \{ (f_i(x))_{i=1,\dots,m}\mid x\in B \},

The convex hull of this set {L} is

\displaystyle \text{conv}(L) = \{ \sum_{j=1}^l \alpha _j a_j\mid \alpha_j \geq 0 ,\sum_{j=1}^l \alpha_j =1, a_j\in L, l \in {\mathbb N}\}.

Now take the random variable {Z=(f_i(X))_{i=1,\dots,m}} which takes value only in {L\subset \text{conv}(L)}, by Lemma 4 of convex set, we know that

\displaystyle \mathop{\mathbb E} Z \in \text{conv}(L).

Note that every element in {\text{conv}(L)} has a FINITE representation in terms of {a_j}s!

This means we can find {l\in {\mathbb N}}, {\alpha_j\geq 0, \sum_{j = 1}^l \alpha_j =1} and {a_j \in L, j=1,\dots,l} such that

\displaystyle \sum_{j=1}^l \alpha_ja_j = \mathop{\mathbb E} Z = (\mathop{\mathbb E} f_i(X))_{i=1,\dots,m}.

Since each {a_j = (f(x_j))_{i=1,\dots,m}} for some {x_j \in B}, we can simply take the distribution of {Y} to be

\displaystyle \mathop{\mathbb P}(Y = x_j) = \alpha_j, \forall i =1,\dots,l.

Finally, apply the theorem of Caratheodory to conclude that {l\leq m+1}. \Box