# A singular value inequality for nonnegative matrices

This post concerns a question regarding nonnegative matrices, i.e., matrices with all entries nonnegative:

For two nonnegative matrices ${A,B \in {\mathbb R}^{m\times n}}$, if ${A-B\geq 0}$, i.e., ${A-B}$ is nonnegative as well, is there any relation with their singular values?

As we shall see, indeed, the largest singular value of ${A}$, denoted as ${\sigma_1(A)}$, is larger than the largest singular value of ${B}$, ${\sigma_1(B)}$:

$\displaystyle \sigma_1(A)\geq \sigma_1(B).$

Let us first consider a simpler case when ${A,B\in {\mathbb R}^{n\times n}}$ are symmetric, so that ${\sigma_1(A)=\max\{|\lambda_1(A)|,|\lambda_n(A)|\}}$, ${\sigma_1(B) =\{|\lambda_1(A)|,|\lambda_n(B)|\}}$. Here for any symmetric matrix ${C}$, we denote its eigenvalues as ${\lambda_1(C)\geq \dots \geq \lambda_n(C)}$.

Lemma 1
If ${A,B,A-B\in {\mathbb R}^{n\times n}}$ are all nonnegative and symmetric, then

$\displaystyle \sigma_1(A)\geq \sigma_1(B).$

Proof:
To prove the lemma, we first recall the Perron-Frobenius theorem which states that the largest eigenvalue (in magnitude) of a
nonnegative matrix is nonnegative and the eigenvalue admits an eigenvector which is entrywise nonnegative as well.

Using this theorem, we can pick a nonnegative unit norm eigenvector ${v_B}$ corresponding to the eigenvalue ${\lambda_1(B)}$, which is both nonnegative and largest in magnitude. Next, by multiplying left and right of ${A-B}$ by ${v_B^\top}$ and ${v_B}$ respectively, we have

$\displaystyle 0\overset{(a)}{\leq} v_B^\top Av_B - v_B^\top Bv_B \overset{(b)}{=} v_B^\top Av_B-\lambda_1(B)\overset{(c)}{\leq} \lambda_1(A)-\lambda_1(B)\overset{(d)}{=}\sigma_1(A)-\sigma_1(B).$

Here step ${(a)}$ is because ${A-B}$ is nonnegative and ${v_B}$ is nonnegative.
The step ${(b)}$ is because ${v_B}$ has unit norm and ${Bv_B =\lambda_1(B)v_B}$. The step ${(c)}$ is because ${\lambda_1(A) =\max_{\|v\|= 1}v^\top Av}$. The step ${(d)}$ is because ${A,B}$ are symmetric and both are nonnegative so largest eigenvalue is indeed just the singular value due to Perron-Frobenius theorem. $\Box$

To prove the general rectangular case, we use a dilation argument.

Theorem 2
If ${A,B,A-B\in {\mathbb R}^{m\times n}}$ are all nonnegative, then

$\displaystyle \sigma_1(A)\geq \sigma_1(B).$

Proof:
Consider the symmetric matrices ${\tilde{A}}$ and ${\tilde{B}}$ in ${{\mathbb R}^{(n+m)\times (n+m)}}$:

$\displaystyle \tilde{A} = \begin{bmatrix} 0 & A \\ A^\top & 0 \end{bmatrix},\quad \text{and} \quad \tilde{B} = \begin{bmatrix} 0 & B \\ B^\top & 0 \end{bmatrix}.$

Note that ${\tilde{A}}$ has the largest singular value as ${A}$ and ${\tilde{B}}$ has the largest singular value as ${B}$.
Now since ${A,B,A-B\geq 0}$, we also have ${\tilde{A},\tilde{B},\tilde{A}-\tilde{B}\geq 0}$. Using Lemma 1, we prove the theorem.
$\Box$

How about the second singular value of ${A}$ and ${B}$? We don’t have ${\sigma_2(A)\geq \sigma_2(B)}$ in this case by considering

$\displaystyle A = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}, \quad B = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}.$

# Random matrices and their rank

In this post, we study a few probability one results conerning the rank of random matrices: with probability one,

1. a random asymmetric matrix has full rank;
2. a random symmetric matrix has full rank;
3. ${\frac{n(n+1)}{2}}$ many random rank one positive semidefinite matrices (belonging to ${\mathbb{R}^{n\times n}}$) are linearly independent.

At least for the first two, they look quite natural and should be correct at first glance. However, to rigorously prove them does require some effort. The post is trying to show that one can use the concept of independence to prove the above assertions. Indeed, all the proof follows an inductive argument where independence makes life much easier:

1. Draw the first sample, it satisfies the property we want to show,
2. Suppose first ${k}$ samples satisfy the property we want to show, draw the next sample. Since it is independent of the previous ${k}$ samples, the first ${k}$ samples can be considered as fixed. Then we make some effort to say the first ${k+1}$ sample does satisfy the property.

Before we going into each example, we further note for any probability measure that is absolutely continuous respect to the measure introduced below and vice versa (so independence is not needed), the probability ${1}$ results listed above still hold.

1. Random Matrices are full rank

Suppose ${A\in \mathbb{R}^{n\times m}}$ with ${A=[a_{ij}]}$ and each entry of ${A }$ is drawn independently from standard Gaussian ${N(0,1)}$. We prove that ${A}$ has rank ${\min\{m,n\}}$ with probability ${1}$.

To start, we should assume without loss of generality that ${m\leq n}$. Otherwise, we just repeat the following argument for ${A^\top}$.

The trick is to use the independence structure and consider columns of ${A}$ are drawn from left to right. Let’s write the ${i}$-th column of ${A}$ by ${a_{\cdot i}}$

For the first column, we know ${a_{\cdot 1}}$ is linearly independent because it is ${0}$ with probability ${0}$. For the second column, because it is drawn independently from the first column, we can consider the first column is fixed, and the probability of ${a_{\cdot 2}}$ fall in to the span of a fixed column is ${0}$ because ${a_{\cdot 2}}$ has a continuous density in ${\mathbb{R}^n}$. Thus the first two columns are linearly independent.

For general ${k\leq n}$, because ${m\leq n}$, we know the first ${k-1}$ columns forms a subspace in ${\mathbb{R}^n}$ and so ${a_{\cdot k}}$ falls into that subspace with probability ${0}$ (linear subspace has Lebesgue measure ${0}$ in ${\mathbb{R}^n}$ and Gaussian measure is absolutely continuous with respect to Lebesgue measure and vice versa). Thus we see first ${k}$ columns are linearly independent. Proceeding the argument until ${k=m}$ completes the proof.

2. Symmetric Matrices are full rank

Suppose ${A\in \mathbb{S}^n\subset \mathbb{R}^{n\times n}}$ with ${A=[a_{ij}]}$ and each entry of the upper half of ${A }$, ${a_{ij},j\geq i}$, is drawn independently from standard Gaussian ${N(0,1)}$. We prove that ${A}$ has rank ${n}$.

Let’s write the ${i}$-th row of ${A}$ by ${a_{i\cdot}\in \mathbb{R}^{1\times n}}$. For the first ${k}$ entries of the ${i}$-th row, we write ${a_{i,1:k}\in \mathbb{R}^{1\times k}}$. Similarly, ${a_{1:k,i}}$ denotes the first ${k}$ entries of the ${i}$-th column. For the top left ${k\times k}$ submatrix of ${A}$, we denote it by ${a_{1:k,1:k}\in \mathbb{R}^{k\times k}}$. Similar notation applies to other submatrices.

The idea is drawing each column of ${A}$ from left to right sequentially and using the independence structure. Starting from the first column, with probability one, we have that ${a_{\cdot 1}}$ is linearly independent (meaning ${a_{\cdot 1}\not =0}$).

Now since ${a_{\cdot 1}}$ is drawn and the second column except the first entry is independent of ${a_{\cdot 1}}$, we know it is in the span of ${a_{\cdot 1}}$ if the second column ${a_{\cdot 2}}$ is ${\frac{a_{12}}{a_{11}}}$ (${a_{11}}$ is ${0}$ with probability 0) multiple of ${a_{\cdot 1}}$. This still happens with probability ${0}$ since each entry of ${a_{\cdot2}}$ has continuous probability density and is drawn independently from ${a_{\cdot 1}}$.

Now assume we have drawn the first ${k}$ columns of ${A}$, the first ${k}$ columns are linearly independent with probability ${1}$, and the left top ${k\times k}$ submatrix of ${A}$ is invertible (the base case is what we just proved). We want to show after we draw the ${k+1}$ column, the first ${k+1}$ columns are still linearly independent with probability ${1}$. Since the first ${k}$ columns is drawn, the first ${k}$ entries of the ${k+1}$ column ${a_{\cdot (k+1)}}$ is fixed, we wish to show the rest ${n-k}$ entries which are drawn independently from all previous ${k}$ columns will ensure linearly independence of first ${k+1}$ columns.

Suppose instead ${a_{\cdot (k+1)}}$ is linearly dependent with first ${k}$ columns of ${A}$. Since the left top ${k\times k}$ submatrix of ${A}$ is invertible, we know there is only one way to write ${a_{\cdot (k+1)}}$ as a linear combination of previous columns. More precisely, we have

$\displaystyle a_{\cdot (k+1)} = a_{1:n,1:k}a_{1:k, 1:k}^{-1}a_{k+1,1:k}^\top.$

This means ${a_{\cdot (k+1)}}$ has to be a fixed vector which happens with probability ${0}$ because the last ${n-k}$ entries of ${a_{\cdot (k+1)}}$ is drawn independently from ${a_{\cdot i}}$, ${i=1,\dots ,k.}$ Note this implies that the first ${k+1}$ columns of ${A}$ are linearly independent as well as the left top ${(k+1) \times (k+1)}$ submatrix of ${A}$ is invertible because we can repeat the argument simply by thinking ${A}$ is ${(k+1) \times (k+1)}$ (instead of ${n\times n}$). Thus the induction is complete and we prove ${A}$ indeed has rank ${n}$ with probability ${1}$.

3. Rank ${1}$ positive semidefinite matrices are linearly independent

This time, we consider ${\mathbf{a}_i}$, ${i=1,\dots, \frac{n(n+1)}{2}}$ are drawn independently from standard normal distribution in ${\mathbb{R}^n}$. Denote the set of symmetric matrices in ${\mathbb{R}^{n\times n}}$ as ${\mathbb{S}^n}$. We show that with probability ${1}$ the matrices

$\displaystyle A_i = \mathbf{a}_i\mathbf{a}_i^\top, \quad i =1,\dots , \frac{n(n+1)}{2}$

are linearly independent in ${\mathbb{S}^n}$.

Denote the standard ${i}$-th basis vector in ${\mathbb{R}^n}$ as ${e_i}$. First, let us consider the following basis (which can be verified easily) in ${\mathbb{S}^n}$,

$\displaystyle E_{ii} = e_ie_i, \quad i=1,\dots,n,\,\text{and}\quad E_{ij} = e_ie_i +e_{j}e_j + e_ie_j +e_je_i, 1\leq i

We order the basis according to the lexicographical order on ${(i,j)}$. Suppose ${\mathbf{a}}$ follows standard normal distribution in ${\mathbb{R}^n}$, we prove that for any fixed linear subspace ${V\subset \mathbb{S}^{n^2}}$ with dimension less than ${\frac{n(n+1)}{2}}$, we have

$\displaystyle \mathop{\mathbb P}(\mathbf{a}\mathbf{a} ^\top \in V ) = 0.$

We start by applying an fixed invertible linear map ${\mathcal{F}:\mathbb{S}^n \rightarrow \mathbb{S}^n}$ to ${V}$ so that the basis of ${\mathcal{F}(V)}$ is the first ${\dim(V)}$ many basis vector of ${\{E_{ij}\}_{i\leq j}}$ ({in lexicographical order.}), and the basis of the orthogonal complement (defined via the trace inner product) of ${V}$, ${V^\perp\subset \mathbb{S}^n}$, is mapped to the rest of the basis vector of ${\{E_{ij}\}_{i\leq j}}$ under ${\mathcal{F}}$. We then only need to prove

$\displaystyle \mathop{\mathbb P}(\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)) =0.$

We prove by contradiction. Suppose ${\mathop{\mathbb P}(\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)) >0}$. It implies that there exists infinitely many ${\mathbf{a}\in \mathbb{R}^n}$ such that ${\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)}$. Moreover, each component of ${\mathbf{a}}$ takes infinitely many values. We show such situation cannot occur. Denote the ${\dim (V)}$-th largest element in ${\{(i,j)\}_{i\leq j}}$ as ${(i_0,j_0)}$. Since ${V}$ has dimension less than ${\frac{n(n+1)}{2}}$, ${\mathcal{F}(V^\top)}$ contains nonzero element and so

$\displaystyle \mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in \mathcal{F}(V) \iff [\mathcal{F}(\mathbf{a}\mathbf{a}^\top)]_{ij}=0,\quad \forall (i,j) > (i_0,j_0) \,\text{in lexicographical order.}$

Now fix an ${(i_1,j_1)>(i_0,j_0)}$, we know ${ [\mathcal{F}(\mathbf{a}\mathbf{a}^\top)]_{i_1j_1}=0}$ means that there is some ${F\in \mathbb{S}^n}$ (depending only on ${V}$) such that

$\displaystyle \mathbf{tr}(F\mathbf{a}\mathbf{a}^\top) = \sum_{1\leq i,j\leq n} F_{ij} a_ia_j =0,$

where ${a_i}$ is the ${i}$-th complement of ${\mathbf{a}}$. Now if we fix ${a_i, i=2,\dots,n}$ and only vary ${a_1}$, then we have

$\displaystyle \sum_{1\leq i,j\leq n} F_{ij} a_ia_j =F_{11}a_1^2 +2 (\sum_{1

which holds for infinitely many ${a_i \in \mathbb{R}}$ only if

$\displaystyle F_{ii}=0, \quad (\sum_{1

Thus we can repeat the argument as before and conclude that ${\mathbf{tr}(F\mathbf{a}\mathbf{a}^\top) = \sum_{1\leq i,j\leq n} F_{ij} a_ia_j =0,}$ holds for infinitely many ${\mathbf{a}}$ with each component taking infinitely many values only if ${F=0}$. This contradicts the fact that ${\mathcal{F}}$ is invertible and hence we must have

$\displaystyle \mathop{\mathbb P}(\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)) =0.$

Now proving ${A_i=\mathbf{a}_i \mathbf{a}_i^\top}$ spans ${\mathbb{S}^n}$ with probability ${1}$ is easy. Because for each ${k}$, the ${A_i}$ with ${i being drawn can be considered fixed because ${A_k}$ is drawn independently of ${A_i}$. But the probability of ${A_k}$ stays in the span of ${A_i,i is ${0}$ by previous result and hence ${A_k}$ is linearly independent of all ${A_i}$. This argument continues to hold for ${k= \frac{n(n+1)}{2}}$ because the space spanned by ${A_1,\dots, A_{\frac{n(n+1)}{2}-1}}$ is at most ${\frac{n(n+1)}{2}-1}$ and so previous result applies. This completes the proof.

# Testing homogeneity of data for location-scale family

Suppose you have a sequence of independent data ${X_1,\dots, X_n\in \mathbb{R}^d}$, how do you test that ${X_i}$s all come from the same distribution, i.e., how do you test homogeneity of the data?

To make the problem more precise, suppose we have a distribution family indexed by ${\theta \in \mathbb{R}^s}$, i.e., a set

$\displaystyle \Omega = \{ F_\theta\mid F_\theta \;\text{is a probability cumulative function,} \;\theta \in \mathbb{R}^s\},$

and each ${X_i}$ follows the distribution ${F_{\theta_i}}$ for some ${\theta_i}$. Our problem is

Is ${F_{\theta_1} =F_{\theta_2}=\dots =F_{\theta_n}}$?

If we have that ${\theta \not =\bar{\theta} \implies F_{\theta}\not= F_{\bar{\theta}}}$ (known as the identifiability of ${\theta}$), then our question becomes

Is ${\theta_1 =\theta_2=\dots =\theta_n}$?

Now suppose further that each ${F_\theta\in \Omega}$ has a density ${f_\theta}$ (so that we can write down the likelihood), the likelihood of seeing the independent sequence ${X_1,\dots, X_n}$ is

$\displaystyle \mathcal{L}_{(X_1,\dots,X_n) }{(\theta_1,\dots,\theta_n)}= \prod_{i=1}^nf_{\theta_i}(X_i).$

To test our question in a statistical way, we use hypothesis testing. Our null hypothesis is

\displaystyle \begin{aligned} H_0:\quad \theta_1=\theta_2=\cdots = \theta_n, \end{aligned} \ \ \ \ \ (1)

and our alternative hypothesis is

\displaystyle \begin{aligned} H_1:\quad \theta_i\not= \theta_j \quad \text{for some}\; i\not=j. \end{aligned} \ \ \ \ \ (2)

Further denote the space of the null as ${\Theta_0=\{ (\theta_1,\dots,\theta_n)\mid \theta_1=\dots =\theta_n\}}$ and the space of the alternative as ${\Theta_1=\mathbb{R}^{sn} / \Theta_0}$. A popular and natural approach is the likelihood ratio test. We construct the test statistic which is called likelihood ratio as

\displaystyle \begin{aligned} R = \frac{\sup_{(\theta_1,\dots, \theta_n)\in \Theta_0}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}{\sup_{(\theta_1,\dots, \theta_n)\in (\Theta_0\cup \Theta_1)}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}. \end{aligned} \ \ \ \ \ (3)

Intuitively, if our null hypothesis is indeed true, i.e., there is some ${\theta^*}$ such that ${\theta_1=\dots =\theta_n=\theta^*}$ and ${X_i}$ follows ${ f_{\theta^*}}$, then this ratio should be large and we have confidence that our null hypothesis is true. This means we should reject our null hypothesis if we find ${R}$ is small. Thus if we want to have a significance level ${\alpha}$ test of our null hypothesis, we should reject null hypothesis when ${R\leq c}$ where ${c}$ satisfies

\displaystyle \begin{aligned} \mathbb{P}(R

However, the main issue is that we don’t know the distribution of ${R}$ under ${H_0}$ even if we know how to sample from each ${f_\theta}$ and the functional form of ${f_\theta}$ for each ${\theta}$. The reason is that ${H_0}$ did not specify which ${\theta^*}$ (which equals to ${\theta^*=\theta_1=\dots =\theta_n}$) generates the data. So the distribution of ${R}$ may depend on ${\theta^*}$ as well and the real thing we need for ${c}$ is

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R

Thus even if we want to know approximate the ${\sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R through computational methods, we have to simulate for each ${\theta^* \in \mathbb{R}^s}$. As ${\Theta_0}$ could be rather large (in fact as large as ${\mathbb{R}^s}$), approximation can be time consuming as well.

Fortunately, if ${\Omega}$ is the so called location-scale family, we find that the distribution of ${R}$ is independent of ${\theta^*}$ and we are free to chose whichever ${\theta^*}$ we like. Let us define what is location-scale family, then state the theorem and prove it.

Definition 1 Suppose we have a family of probability densities ${\Omega}$ on ${\mathbb{R}^d}$ indexed by ${\theta =(\mu,\Sigma)}$ where ${\mu \in \mathbb{R}^d}$ and ${\Sigma\in GL(\mathbb{R},d)}$, the set ofĀ  invertible matrices in ${\mathbb{R}^{d \times d}}$. The family ${\Omega}$ is a local-scale family if there is a family member ${f}$ (called pivot) such that for any other ${f_\theta}$ with ${\theta =(\mu,\Sigma)}$,

$\displaystyle f_\theta (x) = f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}.$

Thus if ${Z\in \mathbb{R}^d}$ follows ${f}$, then ${X=\Sigma Z+\mu}$ has probability density ${f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}$. Indeed, for any Borel set ${B\subset \mathbb{R}^d }$

\displaystyle \begin{aligned} \mathop{\mathbb P}(X\in B ) &=\mathop{\mathbb P}(\Sigma Z+\mu\in B)\\ & = \mathop{\mathbb P}(Z\in \Sigma ^{-1}(B-\mu)) \\ & = \int_{ \Sigma ^{-1}(B-\mu )} f(z) dz \\ & =\int_B f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}dx \end{aligned} \ \ \ \ \ (6)

where we use a change of variable ${x=\Sigma z+\mu}$ in the last equality and the last equality shows ${X}$ follows ${f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}$. We are now ready to state the theorem and prove it.

Theorem 2 Suppose our family of distribution ${\Omega}$ is a local-scale family, then under the null hypothesis, there is a ${\theta^*= (\mu^*,\Sigma^*)}$ such that each ${X_i}$ follows ${f_{\theta^*}(x) = \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))}$ and the distribution of ${R}$ is independent of ${\theta^*}$.

Since the distribution of ${R}$ is independent of ${\theta^*}$ under the null. This means that for any ${\theta \in \mathbb{R}^d}$, and any ${c}$

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R

Thus we can choose any family member of ${\Omega}$ to sample ${X_i}$ and approximates the distribution of ${R}$ using empirical distribution as long as ${\Omega}$ is a location-scale family!

Proof:Ā  We need to show that the ratio ${R}$ has distribution independent of ${\theta^*}$. Since ${X_i \sim \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))}$ and ${\Omega}$ is a location scale family, we can assume they are generated via ${Z_i}$ where ${Z_i}$ follows a pivot ${f}$ and ${X_i = \Sigma^*Z_i+\mu^*}$. Then the likelihood of ${\theta_1,\dots, \theta_n}$ is

\displaystyle \begin{aligned} \mathcal{L}_{(X_1,\dots,X_n) }(\theta_1,\dots,\theta_n) &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ & = \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(\Sigma^*Z_i+\mu^*-\mu_i))\\ &= \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*)))\\ & = \bigr(|\det((\Sigma^*)^{-1}|)\bigr)^n\prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))).\\ \end{aligned} \ \ \ \ \ (8)

Thus the likelihood ratio ${R}$ reduces to

\displaystyle \begin{aligned} R &= \frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*)))}{\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d),\,i=1,\dots,n} \prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ &=\frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*))) }{\prod_{i=1}^n\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ \end{aligned} \ \ \ \ \ (9)

Now let’s define ${\hat{\Sigma} = (\Sigma^*)^{-1}\Sigma}$, ${\hat{\mu} = (\Sigma^*)^{-1}(\mu-\mu^*) }$, ${\hat{\mu}_i = (\Sigma^*)^{-1}(\mu_i -\mu^*) }$ and ${\hat{\Sigma}_i= (\Sigma^*)^{-1}\Sigma_i }$. Note that since ${\mu,\Sigma}$, ${\mu_i,\Sigma_i}$ can vary all over the space ${\mathbb{R}^d\times GL(\mathbb{R},d)}$, so is ${\hat{\mu},\hat{\Sigma}}$, ${\hat{\mu}_i}$ and ${\hat{\Sigma}_i}$. The equality (10) can be rewritten as

\displaystyle \begin{aligned} R &= \frac{\sup_{(\hat{\mu},\hat{\Sigma})\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{\det(\hat{\Sigma})} f((\hat{\Sigma})^{-1}(Z_i-\hat{\mu})) }{\prod_{i=1}^n\sup_{(\hat{\mu}_i,\hat{\Sigma}_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{\det(\hat{\Sigma}_i)} f((\hat{\Sigma}_i)^{-1}(Z_i -\hat{\mu}_i))}.\\ \end{aligned} \ \ \ \ \ (10)

As we just argued, ${\hat{\mu},\hat{\Sigma}}$, ${\hat{\mu}_i}$ and ${\hat{\Sigma}_i}$ can vary all over the space without any restriction, the supremum in the numerator and denominator thus does not depend on the choice ${\mu^*}$ and ${\Sigma^*}$ at all. So our theorem is proved. $\Box$

# Chain rule of convex function

This post studies a specific chain rule of composition of convex functions. Specifically, we have the following theorem.

Theorem 1 For a continuously differentiable increasing function ${\phi: \mathbb{R} \rightarrow \mathbb{R}}$, a convex function ${h: U \rightarrow \mathbb{R}}$ where ${U\in \mathbb{R}^n}$ is a convex set and an ${x\in U}$, if ${\phi'(h(x))>0}$ or ${x\in \mbox{int}(U)}$, then

\displaystyle \begin{aligned} \partial (\phi \circ h) (x) = \phi' (h(x)) \cdot[ \partial h (x)], \end{aligned} \ \ \ \ \ (1)

where ${\partial }$ is the operator of taking subdifferentials of a function, i.e., ${\partial h (x) = \{ g\mid h(x)\geq h(y) +\langle{g},{y-x}\rangle,\forall y\in U\}}$ for any ${x\in U}$, and ${\mbox{int}(U)}$ is the interior of ${U}$ with respect to the standard topology in ${\mathbb{R}^n}$.

A negative example. We note that if our condition fails, the inequality may not hold. For example, let ${\phi(x) =1}$ for all ${x\in \mathbb{R}}$ and ${h(x) = 1 }$ defined on ${[0,1]}$. Then ${0}$ is a point which is not in the interior of ${[0,1],\phi'(0) = 0}$, ${\partial h(0) =(-\infty,0]}$. However, in this case ${\partial (\phi\circ h)(0)= (-\infty,0]}$ and ${\phi' (h(0)) \cdot[ \partial h (0)] =0}$. Thus, the equality fails.

It should be noted that if ${U}$ is open and ${h}$ is also differentiable, then the above reduces to the common chain rule of smooth functions.

Proof: We first prove that ${\partial (\phi \circ h) (x) \supset \phi' (h(x)) \cdot[ \partial h (x)]}$. We have for all ${x\in U , g \in \partial h(x)}$,

\displaystyle \begin{aligned} \phi (h(y)) &\overset{(a)}{\geq} \phi(h(x)) + \phi' (h(x))(h(y)-h(x))\\ & \overset{(b)}{\geq} \phi (h(x)) + \phi '(h(x)) \langle g,y-x\rangle \\ & = \phi (h(x)) + \langle{\phi' (h(x))g},{y-x}\rangle \end{aligned} \ \ \ \ \ (2)

where ${(a),(b)}$ are just the definition of subdifferential of ${\phi}$ at ${h(x)}$ and ${h}$ at ${x}$. We also use the fact that ${\phi(x)\geq 0}$ in the inequality ${(b)}$ as ${\phi}$ is increasing.

Now we prove the other direction. Without lost of generality, suppose ${0\in U}$ such that ${\partial (\phi \circ h)(0)}$ is not empty. Let ${g\in \partial (\phi \circ h)(0)}$, we wish to show that ${g}$ is in the set ${ \phi' (h(0)) \cdot[ \partial h (0)]}$. First according to the definition of subdifferential, we have

\displaystyle \begin{aligned} (\phi \circ h) (x)\geq (\phi\circ h)(0) + \langle{ g},x\rangle, \forall x \in U \end{aligned} \ \ \ \ \ (3)

This gives

\displaystyle \begin{aligned} (\phi \circ h) (\gamma x)\geq (\phi\circ h)(0) + \langle{ g} ,{\gamma x}\rangle, \forall x \in U, \gamma \in [0,1]. \end{aligned} \ \ \ \ \ (4)

Rearranging the above inequality gives

\displaystyle \begin{aligned} &\frac{(\phi \circ h) (\gamma x)- (\phi\circ h)(0)}{\gamma}\geq \langle{ g},{ x}\rangle \\ \implies & \phi'(s)\cdot \frac{h(\gamma x)-h(0)}{\gamma}\geq \langle{g},{x}\rangle \end{aligned} \ \ \ \ \ (5)

for some ${s}$ between ${h(\gamma x)}$ and ${h(0)}$ by mean value theorem. Now, by letting ${f(\gamma )=h(\gamma x)}$, if ${f}$ is right continuous at ${0}$, then by Lemma 1 in this post and ${\phi}$ is continuously differentiable, we know ${l(\gamma) =\frac{h(\gamma x)-h(0)}{\gamma}}$ is nondecreasing in ${\gamma}$ and we have

\displaystyle \begin{aligned} \phi'(h(0)) (h(x)-h(0))\geq \phi'(h(0))\cdot \lim_{\gamma \downarrow 0} \frac{h(\gamma x)-h(0)}{\gamma}\geq \langle{g},{x}\rangle,\forall x\in U. \end{aligned} \ \ \ \ \ (6)

If ${\phi' (h(0))>0}$, then dividing both sides of the above inequality by ${\phi'(h(0))}$ gives

\displaystyle \begin{aligned} h(x)-h(0)\geq \langle{\frac{g}{\phi'(h(0))}},{x}\rangle,\forall x\in U. \end{aligned} \ \ \ \ \ (7)Ā

This shows that ${\frac{g}{\phi'(h(0))}}$ is indeed a member of ${\partial h(x)}$ and thus ${\partial (\phi \circ h) (x) \subset \phi' (h(x)) \cdot[ \partial h (x)]}$. In this case, we only need to verify why ${f(\gamma ) = h(\gamma x) }$ must be right continuous at ${0}$.

If ${0\in \mbox{int}(U)}$, then ${h}$ is definitely continuous at ${x}$ and so is ${f}$ by standard result in convex analysis. If ${\phi' (h(0))>0}$, then we are done by inequality (7). If ${\phi' (h(0))=0}$, using the inequality (6), we have

\displaystyle \begin{aligned} 0 \geq \langle{g},{x}\rangle, \forall x\in U \end{aligned} \ \ \ \ \ (8)

Since ${0 \in \mbox{int}(U)}$, then ${x}$ can take a small positive and negative multiple of ${n}$ standard basis vectors in ${\mathbb{R}^n}$ in the inequality (8). This shows ${g =0}$ and it indeed belongs to the set ${\phi' (h(0)) \cdot[ \partial h (x)] = \{0\}}$ as ${\partial (h(0))\not=\emptyset}$ for ${0\in \mbox{int}(U)}$ by standard convex analysis result.

Thus our task now is to argue why ${f(\gamma ) = h(\gamma x)}$ is indeed right continuous at ${0}$. Using Lemma 4 in this post, we know the limit ${\lim_{\gamma \downarrow 0 }f(\gamma) = f(0^+)}$ exists and ${f(0^+)\leq f(0)}$. Now if ${f(0^+) = f(0) = h(0)}$, then ${f}$ is indeed right continuous at ${0}$ and our conclusion holds. So we may assume ${f(0^+) . But in this case ${l(\gamma) = \frac{h(\gamma x)-h(0)}{\gamma} = \frac{f(\gamma)-f(0)}{\gamma}}$ is going to be negative infinity as ${\gamma \downarrow 0}$. Recall from inequality (5), we have

$\displaystyle \phi'(s)l(\gamma )\geq \langle{g},{x}\rangle$

where ${s}$ is between ${h(0)}$ and ${f(\gamma )=h(\gamma x)}$. We claim that as ${\gamma\downarrow 0}$, ${\phi'(s)}$ approaches a positive number. If this claim is true, then from the above inequality, we will have

$\displaystyle -\infty \geq \langle{g},{x}\rangle$

which cannot hold. Thus we must have ${f}$ right continuous at ${0}$.

Finally, we prove our claim that ${\phi'(s)}$ is approaching a positive number if ${f(0^+) . Using mean value theorem, we have for some ${s_0 \in [ f(0^+), f(0)]}$

\displaystyle \begin{aligned} \phi(s_0)(f(0^+)-f(0)) & = \phi(f(0^+)) -\phi(f(0))\\ & = \lim_{ \gamma \downarrow 0 } \phi(f(\gamma))-\phi(f(0))\\ & = \lim_{\gamma \downarrow 0}\phi(s)(f(\gamma)-f(0))\\ & = (f(0^+)-f(0))\lim_{\gamma\downarrow 0}\phi(s). \end{aligned} \ \ \ \ \ (9)

Now cancel the term ${f(0^+) -f(0)<0}$ above, we see that ${\phi(s_0) = \lim_{\gamma\downarrow}\phi(s)}$. We claim ${\phi(s_0)>0}$. If ${\phi(s_0)=0}$, then because ${\phi}$ is increasing, we have that ${\phi}$ is constant in ${[f(0^+),f(0)]}$ as ${\phi(f(0^+)) -\phi(f(0)) = \phi(s_0)(f(0^+)-f(0)) =0}$. This contradicts our assumption that ${\phi'(f(0))>0}$ and our proof is complete. $\Box$

# LP Conditioning

We consider global conditioning property of Linear Program (LP) here. Let’s first define the problem. The standard form of LP is

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & c^Tx\\ & {\text{subject to}} & & Ax =b ,\\ & & & x\geq 0 \end{aligned} \ \ \ \ \ (1)

where ${A \in \mathbb{R}^{n\times m}, c\in \mathbb{R}^n}$ and ${b \in\mathbb{R}^m}$. The dual of the standard form is

\displaystyle \begin{aligned} & \underset{y \in \mathbb{R}^m}{\text{maximize}} & & b^Ty\\ & {\text{subject to}} & & A^Ty\leq c. \\ \end{aligned} \ \ \ \ \ (2)

Now suppose we add perturbation ${u\in \mathbb{R}^{n}}$ to Program (2) in the ${c}$ term, we have

\displaystyle \begin{aligned} & \underset{y \in \mathbb{R}^n}{\text{maximize}} & & b^Ty\\ & {\text{subject to}} & & A^Ty\leq c+u.\\ \end{aligned} \ \ \ \ \ (3)

Let ${v(u)}$ be the optimal value of Program (3) and ${y(u)}$ be an optimal solution to the same program. We would like to ask the following question:

\displaystyle \begin{aligned} \textbf{Q}&: \text{Is}\; v(u)\; \text{Lipschitz continuous?} \;\text{Is} \;y(u)\;\text{Lipschitz}? \end{aligned} \ \ \ \ \ (4)

Such questions are called the conditioning of Program (2). It asks under some perturbation, how does the solution and optimal value changes accordingly. In general, we hope that Program (1) and (2) to be stable in the sense that small perturbation in the data, i.e., ${A,b,c}$, only induces small changes in the optimal value and solution. The reason is that in real application, there might be some error in the coding of ${b,c}$ or even ${A}$. If Program (1) and (2) are not stable with respect to the perturbation, then possibly either more costs need to be paid to the encoding of the data or the LP is not worth solving. At first sight, it is actually not even clear that

1. Whether ${v(u)}$ will be always finite?
2. Whether ${y(u)}$ is actually unique? If ${y(u)}$ is not unique, then the Lipschitz property of ${y(u)}$ does not even make sense.

To address the first issue, we make our first assumption that

A1: Program (2) has a solution and the optimal value is finite.

Under A1, by strong duality of LP, we have that Program (1) also has a solution and the optimal value is the same as the optimal value of Program (2). This, in particular, implies that (1) is feasible. We now characterize the region where ${v(u)}$ is finite. Let the feasible region of Program (1) be denoted as

$\displaystyle \mathcal{F}_p :\,= \{x\mid Ax = b,x\geq 0\},$

the set of vertices of ${\mathcal{F}_p}$Ā be ${\mathcal{V}_p=\{x_i\}_{i=1}^d}$ and the set of extreme rays be ${\mathcal{R}_p=\{\gamma_j\}_{j=1}^l}$. Using our assumption A1, we know

$\displaystyle \mathcal{F}_p \not = \emptyset.$

Now according to the Resolution Theorem of the primal polyhedron, we have

$\displaystyle \mathcal{F}_p = \{ x\mid x = \sum_{i=1}^d \alpha _ix_i + \sum_{j=1}^l\beta_j \gamma_j, x_i \in \mathcal{V}_p, \gamma _j \in \mathcal{R}_p,\sum_{i=1}^d \alpha_i =1, \alpha_i\geq 0, \beta_j \geq 0, \forall i,j\}.$

Using this representation, the fact that the primal of Program (3) is

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & (c+u)^Tx\\ & {\text{subject to}} & & Ax =b \\ & & & x\geq 0, \end{aligned} \ \ \ \ \ (5)

and strong duality continue to hold for (3) and (5) as (5) is feasible by A1, the value of ${v(u)}$ is

$\displaystyle v(u) = \min \{ c^Tx_i +u^Tx_i\mid x_ i \in \mathcal{V}_p\} + \min\{c^T\gamma_j +\beta_ju^T\gamma_j \mid \gamma_j \in \mathcal{R}_p,\beta_j\geq 0\}.$

Thus it is immediate that the region where ${v(u)}$ is finite is

\displaystyle \begin{aligned} \mathcal{F}_u = \{ u\mid \gamma_j ^Tu\geq 0, \forall \gamma_j \in \mathcal{R}_p\}. \end{aligned} \ \ \ \ \ (6)

In particular, if ${\mathcal{F}_p}$ is bounded, then we have ${v(u)}$ to be always finite.

To address the issue whether ${y(u)}$ is unique, we make the following nondegenerate assumption on vertices.

A2: All the basic feasible solution of (1) are non-degenerate, i.e., all vertices of the feasible region of Program (1) have exactly ${m}$ positive entries.

Under this condition, and the definition of basic feasible solution, we know if ${x}$ is a basic feasible solution, then ${A_{\cdot V(x)}\in \mathbb{R}^{m\times m}}$ where ${V(x)=\{i\mid x_i \text{ is nonzero}\}}$ has all column independent. Here ${A_{\cdot V(x)}}$ is the submatrix of ${A}$ having columns with indices in ${\mathcal{V}(x)}$. By complementarity of LP (3) and (5), we thus know that if ${x}$ is a baisc feasible solution and it is a solution to Program (5), then

$\displaystyle y(u) = (A_{\cdot V(x)})^{-T}(c+u)$

is the unique solution to Program (3) where ${(A_{\cdot V(x)})^{-T} =((A_{\cdot V(x)})^{T})^{-1} }$. What we have done is that under A1 and A2,

$\displaystyle v(u) = \min \{ c^Tx_i +u^Tx_i, x_ i \in \mathcal{V}_p\}, u \in \mathcal{F}_u$

and

$\displaystyle y(u) = (A_{\cdot V(x)})^{-T}(c+u)$

where ${x = \arg\min_{x_i}\{ c^Tx_i +u^Tx_i, x_ i \in \mathcal{V}_p\}}$.

By defining

$\displaystyle \Omega_i = \{ u \mid c^Tx_i +u^Tx_i \leq c^Tx_j +u^Tx_j,\forall j\not =i \}, \forall\, 1\leq i\leq d$

which is a polyhedron, we may write the above compactly as

$\displaystyle v(u) = c^Tx_i +u^Tx_i, \quad y(u) = (A_{\cdot V(x_i)})^{-T}(c+u), u \in \Omega_i.$

Then it is easy to see that the Lipschitz condition of ${v(u)}$ is

$\displaystyle |v(u_1)-v(u_2)| \leq \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \|u_1-u_2\|$

where ${\|\cdot\|_*}$ is the dual norm of ${\|\cdot\|}$. We note ${\|\cdot \|}$ here is arbitrary.

Similarly, by using the theorem in this post, we see that ${y(u)}$ is also Lipschitz continuous with Lipschitz property

$\displaystyle \| y(u_1) - y(u_2)\|_\alpha \leq\max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}\|u_1-u_2\|_{\beta }$

where ${\|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta } = \max_{\|x\|_{\beta }=1}\|(A_{\cdot V(x)})^{-T}\|_\alpha}$ and ${\|\cdot\|_\alpha, \|\cdot\|_\beta}$ are arbitrary norms. We conclude the above discussion in the following theorem.

Theorem 1 Recall ${\mathcal{V}_p}$ is the set of extreme points of the feasible region of (1), ${\mathcal{R}_p}$ is the set of extreme points of the feasible region of (1) and ${\mathcal{F}_u = \{ u\mid \gamma_j ^Tu\geq 0, \forall \gamma_j \in \mathcal{R}_p\}}$. Under assumption A1 and A2, we have for all ${x\in \mathcal{F}_u}$,

$\displaystyle |v(u_1)-v(u_2)| \leq \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \|u_1-u_2\|$

and

$\displaystyle \| y(u_1) - y(u_2)\|_\alpha \leq\max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}\|u_1-u_2\|_\beta$

where ${\|\cdot\|,\|\cdot\|_{\alpha},\|\cdot\|_\beta }$ are arbitrary norms. Here ${A_{\cdot V(x)}}$ is the submatrix of ${A}$ having columns with indices in ${\mathcal{V}(x)=\{i\mid x_i \text{ is nonzero}\}}$.

How might one bound the Lipschitz constants

$\displaystyle \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \text{ and } \max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}?$

The former might be done by giving a bound of the diameter of the feasible region ${\mathcal{F}_p}$ of Program (1) and the later might be done through some restricted isometry property of ${A}$.

# Lipschitz constant of piecewise linear (vector valued) functions

Suppose we have a collection of sets ${\{\Omega_i \subset \mathbb{R}^{n}\}_{i=1}^d}$ and a set of ${d}$ linear functions ${\{f_i(x) = A_ix +c_i\mid x\in \Omega_i, i=1,\dots, d\}}$ where each ${\Omega_i}$ being closed polyhedrons in ${\mathbb{R}^{n}}$, i.e., ${\Omega_i = \{x\mid a_{i_j}^Tx + b_{i_j} \leq 0,\forall j\leq k(i)\}}$ for some ${a_{i_j} \in \mathbb{R}^{n},b_{i_j}\in \mathbb{R},k(i)\in \mathbb{N}}$, and ${A_i\in \mathbb{R}^{m\times n}}$ and ${c_i \in \mathbb{R}^{m}}$. Moreover, for any ${i\not =j}$, and ${x\in \Omega_i \cap \Omega_j}$, we have ${f_i(x) = f_j(x)}$. Then we can define function ${f}$ on ${\Omega =\cup_{i=1}^d \Omega_i}$ such that

$\displaystyle f(x) = f_i(x) \quad \text{if} \; x\in \Omega_i.$

We further assume $\Omega$ is convex. We now state a theorem concerning the Lipschitz constant of ${f}$.

Theorem 1 Under the above setting, given arbitrary norms ${\|\cdot\|_\alpha,\|\cdot\|_\beta }$, ${f}$ is ${\max_{i\leq d}\{\|A_i\|_{\alpha,\beta}\}}$-Lipschitz continuous with respect to ${\|\cdot\|_\alpha}$ and ${\|\cdot \|_\beta }$, i.e., for all ${x_1,x_2 \in \Omega}$,

\displaystyle \begin{aligned} \| f(x_2) - f(x_1)\|_\alpha \leq \max_{i\leq d}\{\|A_i\|_{\alpha,\beta }\}\|x_2-x_1\|_\beta . \end{aligned} \ \ \ \ \ (1)

Here ${\|A_i\|}$ isĀ ${\|A_i\|_{\alpha,\beta } = \max_{\|x\|_\beta \leq 1}\|A_ix\|_{\alpha}}.$

Proof: By letting ${g(t) = x_1 + t(x_2 - x_1)}$, ${L = \max_{i\leq d}\{\|A_i\|_{\alpha,\beta }\}\|x_2-x_1\|_\beta }$ and ${h= f\circ g}$. we see the inequality (1) is equivalent to

\displaystyle \begin{aligned} \|h(1)- h(0)\|_\alpha \leq L. \end{aligned} \ \ \ \ \ (2)

Using the property that ${f}$ is well-defined on ${\Omega_i\cap \Omega_j}$ for any ${i,j}$, we have that there exists ${0=t_0\leq \dots \leq t_l\leq \dots \leq t_K\leq t_{K+1}=1, l=1,\dots,K}$ and corresponding ${i_1,\dots ,i_K}$ such that

\displaystyle \begin{aligned} h(t_l)= A_{i_l}x^{(l)}+c_{i_l} = A_{i_{l-1}}x^{(l)}+c_{i_{l-1}} \end{aligned} \ \ \ \ \ (3)

for all ${1\leq l\leq K}$ where ${x^{(l)} = g(t_l)}$. We also let ${x^{(0)} =g(t_0) =x_1}$ and ${x^{(K+1)}=g(t_{{K+1}}) =x_2}$. These ${x_l}$s have the property that

\displaystyle \begin{aligned} \sum_{l=0}^{K}\|x_{i_{l+1}} -x_{i_l}\|_\beta = \sum_{l=0}^{K}(t_{l+1}-t_l)\|x_2 -x_1\|_\beta = \|x_2-x_1\|_\beta . \end{aligned} \ \ \ \ \ (4)

Thus we can bound the term ${\|h(1)- h(0)\|_\alpha }$ by

\displaystyle \begin{aligned} \|h(1) - h(0)\|_\alpha &\leq \sum_{l=0}^{K} \|h(t_{l+1}) -h(t_{l})\|_\alpha \\ & \leq \sum_{l=0}^{K} \|A_{i_{l+1}}x_{l+1} +c_{i_{l+1}} - A_{i_l}x_{l} -c_{i_l}\|_\alpha \\ &\overset{(i)}{\leq } \sum_{l=0}^{K-1} \|A_{i_{l}}x_{l+1} +c_{i_{l}} - A_{i_l}x_{l} -c_{i_l}\|_\alpha \\ & = \sum_{l=0}^{K} \|A_{i_{l}}x_{l+1} - A_{i_l}x_{l} \|_\alpha \\ &\overset{(ii)}{\leq}\sum_{l=0}^{K}\|A_{i_l}\|_{\alpha,\beta } \|x_{l+1}-x_{l}\|_\beta \\ & \leq \max_{i}\{\|A_i\|_{\alpha,\beta },i\leq d\}\sum_{l=0}^{K} \|x_{l+1}-x_{l}\|_\beta \\ & \overset{(iii)}{\leq}\max_{i}\{\|A_i\|_{\alpha,\beta },i\leq d\} \|x_2-x_1\|_\beta \\ & = L, \end{aligned} \ \ \ \ \ (5)

where ${(i)}$ is due to inequality (3), ${(ii)}$ is due to the definition of operator norm and ${(iii)}$ is using equality (4). This proves inequality (1). $\Box$

# Close property of conditional expectation under convexity

Suppose we have a random vector ${Z\in \mathbb{R}^n}$ and we know that ${Z}$ only takes value in a convex set ${C}$, i.e.,

$\displaystyle \mathbb{P}(Z\in C) =1.$

Previously, we showed in this post that as long as ${C}$ is convex, we will have

$\displaystyle \mathbb{E}(Z) \in C,$

if ${\mathbb{E}(Z)}$ exists. It is then natural to ask how about conditional expectation. Is it true for any reasonable sigma-algebra ${\mathcal{G}}$ that

$\displaystyle \mathbb{P}(\mathbb{E}(Z\mid \mathcal{G}) \in C )=1?$

The answer is affirmative. Let us first recall our previous theorem.

Theorem 1 For any borel measurable convex set ${C\subset {\mathbb R}^n}$, and for any probability measure ${\mu}$ on ${(\mathbb{R}^n,\mathcal{B})}$ with

$\displaystyle \int_C \mu(dx) =1, \quad \int_{\mathbb{R}^n} \|x\|\mu(dx) <\infty,$

we have

$\displaystyle \int_{\mathbb{R}^n} x\mu(dx) \in C.$

By utilizing the above theorem and regular conditional distribution, we can prove our previous claim.

Theorem 2 Suppose ${Z}$ is a random vector from a probability space ${(\Omega, \mathcal{F},\mathbb{P})}$ to ${(\mathbb{R}^n,\mathcal{B})}$ where ${\mathcal{B}}$ is the usual Borel sigma algebra on ${\mathbb{R}^n}$. If ${C}$ is Borel measurable and

$\displaystyle \mathbb{P}(Z\in C) =1, \mathbb{E}\|Z\| <\infty ,$

then we have

$\displaystyle \mathbb{P}( \mathbb{E}(Z\mid \mathcal{G})\in C) =1$

for any sigma algebra ${\mathcal{G} \subset \mathcal{F}}$.

Proof: Since ${Z}$ takes value in ${\mathbb{R}^n}$, we know there is a family of conditional distribution ${\{\mu(\omega,\cdot)\}_{\omega \in \Omega}}$ such that for almost all ${\omega \in \Omega}$, we have for any ${A\in \mathcal{B}}$

$\displaystyle \mathbb{E}(1_{\{Z\in A\}}\mid \mathcal{G})(\omega) = \int_{A} \mu(\omega, dz)$

and for any real valued borel function ${f}$ with domain ${\mathbb{R}^n}$ and ${\mathbb{E}(|f(Z)|)}<+\infty$, we have

$\displaystyle \mathbb{E}(f(Z)\mid \mathcal{G}) (\omega)= \int_{\mathbb{R}^n} f(z) \mu(\omega, dz).$

The above two actually comes from the existence of regular conditional distribution which is an important result in measure-theoretic probability theory.

Now take ${A=C}$ and ${f(Z) = Z}$, we have for almost all ${\omega\in \Omega}$,

$\displaystyle \mathbb{E}(1_{\{Z\in C\}}\mid \mathcal{G})(\omega) = \int_{C} \mu(\omega, dz)$

and

$\displaystyle \mathbb{E}(Z\mid \mathcal{G}) (\omega)= \int_{\mathbb{R}^n} z \mu(\omega, dz).$

But since ${\mathop{\mathbb P}(Z\in C)=1}$, we know that for almost all ${\omega \in \Omega}$,

$\displaystyle 1= \mathbb{E}(1_{\{Z\in C\}}\mid \mathcal{G})(\omega) = \int_{C} \mu(\omega, dz).$

Thus for almost all ${\omega}$ we have a probability distribution ${\mu(\omega, dz)}$ on ${{\mathbb R}^n}$ with mean equals to ${\mathop{\mathbb E} (Z\mid \mathcal{G} )(\omega)}$ and ${\int_C \mu (\omega ,dz) =1}$. This is exactly the situation of Theorem 1. Thus by applying Theorem 1 to the probability measure ${\mu(\omega, dz)}$, we find that

$\displaystyle \mathbb{E}(Z\mid \mathcal{G})(\omega) = \int _{\mathbb{R}^n} z\mu (\omega, dz) \in C,$

for almost all ${\omega \in \Omega}$. $\Box$