One dimensional least absolute deviation problem

We consider the following {\ell_1} minimization problem:

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R} }{\text{minimize}} & &f(x):\,= \sum_{i=1}^n w_i | x - b_i|. \end{aligned} \ \ \ \ \ (1)

The weights {w_i} are all positive and {b_i \in \mathbb{R}}. We call the optimziation problem (1) the least absolute deviation problem. Note the optimziation problem (1) has a single variable {x}. If {w_i} are all equal, then it is well-known that least absolute deviation has median of {\{b_i\}_{i=1}^n} as its minimizer. In the case of unequal weights {w_i}, we are going to show the median defined by the weights {w_i} is again the minimizer.

To make the presentation simpler, we assume the weights are normalized:

\displaystyle \sum_{i=1}^n w_i = 1. \ \ \ \ \ (2)

Without loss of generality, we assume the values {b_i} are ordered and distinct:

\displaystyle b_1 <b_2<\dots < b_n. \ \ \ \ \ (3)

Next, we give the definition of median of {\{b_i\}_{i=1}^n} with respect to the weights {\{w_i\}_{i=1}^n}. For each {x\in \mathbb{R}}, we denote {I_x^-}, {I_x^=}, and {I_x^+} be the set of indices {i} such that {x<b_i}, {x=b_i}, and {x>b_i} respectively. Note that {I_x^=} is either a single index or an empty set.

Definition 1 We say an {x} is a median of {\{b_i\}_{i=1}^n} with respect to the weights {\{w_i\}_{i=1}^n} if {\sum_{i\in I^+_x} w_i \leq 0.5} and {\sum_{i\in I^+_x} w_i + w_{I^=_x} \geq 0.5}, where {w_{I^=_x} =0} if {I^=_x} is empty.

Note that under this definition, the median is acutally not always unique.

We now state the theorem that asserts the optimal solution to (1) is the weighted median.

Theorem 2 Under assumption (2) and (3), the optimal solution to (1) is any median {x} in the sense of Definition 1.

The key of the proof is to note that (1) is convex and a point {x} is optimal if and only if { 0 \in \partial f(x)}, where {\partial f(x)} is the subdifferential of {f} at {x}, a generalization of derivative at {x}.

Proof: For any {x\in \mathbb{R}}, the subdifferential at {x} is

\displaystyle \partial f(x) = - \sum_{i\in I^-_x} w_i + \sum_{i\in I^+_x} w_i + w_{I_x^=} \times [-1,1], \ \ \ \ \ (4)

where {w_{I_x^=} \times [-1,1] = [-w_{I_x^=}, w_{I_x^=}]} if {I_x^=} is nonempty and {w_{I_x^=} \times [-1,1] = 0} if {I_x^=} is empty. The addition in (4) should be understood as set addition: {A+B = \{a+b \mid a\in A, b\in B\}}.

Since the function {f} is a convex function with whole domain {\mathbb{R}}. Hence, an {x} is optimal if and only if

\displaystyle 0 \in \partial f(x) = - \sum_{i\in I^-_x} w_i + \sum_{i\in I^+_x} w_i + w_{I_x^=} \times [-1,1]. \ \ \ \ \ (5)

From the the definition of median and the assumption that the weights added up to 1, it is immediate that the median satisfies the above condition and any solution satisfies the above condition is a median. \Box

From the Theorem 2, we know that we only need to compute the median of {\{b_i\}_{i=1}^n} w.r.t. {\{w_i\}_{i=1}^n}. This can be computed first by ordering the {b_i} so that condition (3) is satisfied (combining same quantities). This procedure only takes {\mathcal{O}(n\log n)} comparisons plus {\mathcal{O}(n)} operations in creating a new set of values {\{b'_i\}_{i=1}^{n'}} and their weights {\{w_i'\}_{i=1}^{n'}} of combining terms and adding weights up for the same {b_i}. Here {n'\leq n}. Then we add {w_i'} up until the {0.5} threshold is reached to determine {i}. This procedure again only takes {\mathcal{O}(n)} additions.

Strict complementarity of linear programs

This note will show the following:

strict complementarity always holds for feasible linear programs (LPs).

This result is known in an very early paper by Goldman and Tucker (1956). Here we aim to give a constructive and hopefully easy to understand proof.

To explain the meaning of this statement, let us start with the the standard form of LP as the primal program

\displaystyle  		\begin{aligned} 			& \underset{x \in \mathbb{R}^n}{\text{minimize}} 			& & c^Tx\\ 			& {\text{subject to}} & & Ax =b ,\\ 			& & & x\geq 0 		\end{aligned} 	\ \ \ \ \ (1)

where {A \in \mathbb{R}^{n\times m}, c\in \mathbb{R}^n} and {b \in\mathbb{R}^n}. The dual program is accordingly

\displaystyle  		\begin{aligned} 			& \underset{y \in \mathbb{R}^n}{\text{maximize}} 			& & b^Ty\\ 			& {\text{subject to}} & & A^Ty\leq c. \\ 		\end{aligned} 	\ \ \ \ \ (2)

Let us now introduce the concept strict complementarity. We start with complementarity, which is a consequence of strong duality. Indeed, suppose that primal (1) and (2) are both feasible. Hence according to LP strong duality, we know there is a primal and dual pair {(x,y)\in \mathbb{R}^n \times \mathbb{R}^n} that are feasible for (1) and (2) respectively, and

\displaystyle  \textbf{strong duality:}\quad 	c^\top x = b^\top y. \ \ \ \ \ (3)

Using the fact {x} is feasible, we know {Ax = b}, hence the above condition is equivalent to

\displaystyle  0 = c^\top x - (Ax) ^\top y =(c- A^\top y)^\top x.	 \ \ \ \ \ (4)

Letting the slack map {z (y) = c-A^\top y}, and using the feasibility {x\geq 0 } and {z(y) = c - A^\top y\geq 0}, we conclude

\displaystyle  \textbf{complementarity:} \quad x_i [z(y) ]_i= 0 \quad \text{for all}\;1\leq i\leq n. \ \ \ \ \ (5)

Here {x_i} and {[z(y)]_i} are {i}-th element of {x} and {z(y)} respectively. The above derivation shows that strong duality (3) and complementarity (5) are equivalent for feasible {x} and {y}.

The equality (5) implies that at least one of {x_i} and {[z(y)]_i} has to be {0}. However, it is possible that both term are zero. Strict complementarity, in this context, means that exactly one of {x_i} and {[z(y)]_i} is zero. In another words, strict complementarity means that

\displaystyle  	\textbf{strict complementarity:} \quad \text{for all}\; i \quad x_i = 0 \iff 	[z(y^*)]_i > 0. \ \ \ \ \ (6)

The strict complemetarity theorem for LP states that so long as strong duality holds, then there is always a strict complementary pair, which does not hold in general for convex optimization problems. We give a formal statement below.

Theorem 1 Suppose both (1) and (2) are feasible. Then there is always a feasible pair {(x,y)\in \mathbb{R}^n \times \mathbb{R}^n} such that (1) it is optimal for the primal and dual programs, and (2) it satisfies the strict complementarity condition (6).

We shalll utilize the following lemma to prove Theorem 1.

Lemma 2 Suppose the assumption of Theorem 1 holds. Then for any {1\leq j\leq n}, there either exists an primal optimal {x} s.t. {x_j>0} or there exists an dual optimal {y} such that {[z(y)]_j >0}.

Let us prove Theorem 1

Proof of Theorem 1: From Lemma 2, we know for each {1\leq j\leq n}, either there is an optimal {x^j} such that {x^j_j >0} or there is an optimal {y^j} such that {[z(y^j)]_j >0} but not both due to (5). Hence, we can let {J} be the subset of {\{1,\dots,n\}} consists of {j} such that {x^j} exists with {x^j_j>0}. Consequently, using Lemma 2, we know the complement set {J^c} consists of indices s.t. an optimal {y^j} exists and {[z(y^j)]_j>0}. Then the desired {x} and {y} in (6) can be taken as

\displaystyle  		x = \frac{1}{|J|} \sum_{j\in J} x^j,\quad y = \frac{1}{|J^c|} \sum_{j\in J^c} y^j. 	\ \ \ \ \ (7)

Here {|J|} is the cardinality of {J}. Both {x} and {y} are optimal due to optimality of {x^j} and {y^j}. \Box

We now prove Lemma 2.

Proof of Lemma 2: Suppose the optimal value of (1) is {\gamma}. Fix an index {j}. Consider the following program

\displaystyle  	\begin{aligned} 		& \underset{x \in \mathbb{R}^n}{\text{minimize}} 		& & -e_j^\top x \\ 		& {\text{subject to}} & & Ax =b , \quad c^\top x \leq \gamma \\ 		& & & x\geq 0. 	\end{aligned} \ \ \ \ \ (8)

Here {e_j} is the {j}-th standard coordidate basis in {\mathbb{R}^n}. Note that (8) is feasible as (1) admits optimal solutions (due to LP strong duality following from feasibility of (1) and (2)).

The dual program is

\displaystyle  	\begin{aligned} 		& \underset{y \in \mathbb{R}^n,\gamma \in \mathbb{R}}{\text{maximize}} 		& & b^Ty - \gamma \lambda \\ 		& {\text{subject to}} & & \lambda c - A^Ty - e_j \geq 0, \lambda \geq 0 \\ 	\end{aligned} \ \ \ \ \ (9)

Let us now consider two situations:

  1. The program (8) has optimal value {0}.
  2. The program (8) has optimal value less than {0}.

Suppose that we are in the first situation. Then according to LP strong duality, the dual program (9) is feasible and has optimal value {0}, i.e., there exist {y,\lambda} such that

\displaystyle  	b^\top y - \gamma \lambda =0 , \lambda c - A^\top y -e_j \geq 0, \text{ and } \lambda \geq 0. \ \ \ \ \ (10)

If {\lambda >0}, then {\frac{y}{\lambda}} is optimal for (2) and we have

\displaystyle  	c - A^\top\left( \frac{y}{\lambda} \right)\geq \frac{1}{\lambda }e_j \implies [z(\frac{y}{\lambda})]_j >0. \ \ \ \ \ (11)

In this case, we see that {\frac{y}{\lambda}} satisfies the conclusion in Lemma 2.

Otherwise, we have {\lambda =0}. Then {b^\top y =0} and {A^\top y \leq - e_j}. Now take any {\bar{y}} that is optimal for (2). Then {\bar{y}+y} satisfies the conclusion in Lemma 2.

The above shows that in the first situation, we always have a dual optimal {y} such that {[z(y)]_j>0}. Suppose we are now in the second situation. Then this directly means that there is an feasible {x} for (8). Note such {x} is optimal for (1) and {x_j>0}.

Our proof is complete. \Box

An interlacing property of order statistics

In this post, we present an interlacing property of order statistics.

Suppose we have a distribution {F} on real line {\mathbb{R}} and {\theta_i \overset{\text{iid}}{\sim} F} where {i=1,\dots, n} and {\overset{\text{iid}}{\sim}} means “independently and identically distributed as”. We denote {\theta_{(i)}^n} to be the {n-i+1}-th largest random variable of the {n} many {\theta_i}s. So {\theta_{(n)}^n} is the largest value and {\theta_{(1)}^n} is the smallest of the {n} many iid {\theta}s. Note here the subscript {n} in {\theta^n_{(i)}} does not mean taking to the {n}th power but means how many {\theta_i}s we are considering.

The ordered variables {\{\theta^{n}_{(i)}\}_{i=1}^n} are called order statistics in Statistics. They are sufficient statistics for distribution {F} when {\theta_i}s are independent and all follow {F}. We now present an interlacing property of order statistics.

Theorem 1 For any {n,m>0}, { l\leq n, m-n+1\leq i\leq m} and {x\in \mathbb{R}}, we have

\displaystyle \begin{aligned} \mathbb{P}( \theta_{(n+i)}^{n+m}\leq x)\leq \mathbb{P} (\theta^n_{(n-m+i)}\leq x)\text{ and } \mathbb{P}( \theta_{(l)}^n\leq x) \leq \mathbb{ P}(\theta_{(l)}^{n+m}\leq x) \end{aligned} \ \ \ \ \ (1)

If {\theta_i} has finite mean, then we have

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n+i)}^{n+m})\geq \mathbb{E} (\theta^n_{(n-m+i)} )\text{ and } \mathbb{E}( \theta_{(l)}^n) \geq \mathbb{ E}(\theta_{(l)}^{n+m}) \end{aligned} \ \ \ \ \ (2)

The meaning of the right inequalities of the inequality (1) and the inequality (2) is that the larger the sample size {n}, the smaller the {n-l+1}-th largest value. The left inequalities of the inequality (1) and the inequality (2) show that the larger sample size {n}, the larger the {m+i}-th order statistic is in expectation.

The first inequality chain is the stochastic dominance version of the second inequality chain. The first chain is a stronger property than the second.

Let us now explain why we call Theorem 1 an interlacing property. Consider the case that {n\geq 2} and take {m=1}, {l =n}, and {i=0} in Theorem 1. The inequality (2) gives that

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n)}^n) \geq \mathbb{ E}(\theta_{(n)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-1)}). \end{aligned} \ \ \ \ \ (3)

If {n\geq 3}, then we can take {l=n-1}, {m=1}, and {i=-1}. Then equation (2) gives

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n-1)}^{n}) \geq \mathbb{ E}(\theta_{(n-1)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-2)}). \end{aligned} \ \ \ \ \ (4)

Thus from (3) and (4) we have

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n)}^n) \geq \mathbb{ E}(\theta_{(n)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-1)})\geq \mathbb{ E}(\theta_{(n-1)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-2)}). \end{aligned} \ \ \ \ \ (5)

Continuing the choices of {l} and {i} while fixing {m=1}, we see that

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n+1)}^{n+1})\geq \mathbb{E}( \theta_{(n)}^n) \geq \mathbb{ E}(\theta_{(n)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-1)})\geq\dots \geq\mathbb{E} (\theta^{n+1}_{(2)})\geq \mathbb{E} (\theta^{n}_{(1)})\geq \mathbb{E} (\theta^{n+1}_{(1)}) . \end{aligned} \ \ \ \ \ (6)

Thus we see the quantities {\{\theta_{(i)}^n\}_{i=1}^n} and {\{\theta_{(i)}^{n+1}\}_{i=1}^{n+1}} are interlacing. The main idea of the proof is putting {\theta_{(l)}^n} and {\theta_{(l)}^{n+m}} and {\theta_{(l-m)}^n} in the same probability space and arguing that we actually have almost sure inequality instead of just expectation or stochastically dominance. This technique is known as coupling.

Proof: Let us first prove the right inequality. Suppose we first generate {n} many iid {\theta_i}s following {F} and order them so that we have {\theta_{(i)}^n}s. Now if we generate (independently) {m} many more {\theta_{i+n}} follows {F} as well with {i=1,\dots,m}. We now consider all the {\theta} we generated, then we find {\theta_{(l)}^{n+m}} is at most {\theta_{(l)}^n} since we have

\displaystyle \theta_{(l)}^{n+m} = \theta_{(l)}^n \text{ if } \theta_{i+n} \geq \theta_{(l)}^n,\forall 1\leq i\leq m

and

\displaystyle \theta_{(l)}^{n+m} < \theta_{(l)}^n, \text{ if } \exists i \text{ s.t. } \theta_{i+m}< \theta_{(l)}^n.

Since we have almost sure inequality, we surely have stochastic dominance and inequality in expectation. Now consider the left inequality. We still follow the previous ways of generating {\theta_i}s. After we have {\theta_{(i)}^n}s, the {n-(l-m)+1}-th largest value {\theta_{(l-m)}^{n}} is always smaller than the {n-(l-m)+1}-th largest value {\theta_{(l)}^{n+m}} since

\displaystyle \theta_{(l)}^{n+m} = \theta_{(l-m)}^{n} \text{ if } \theta_{i+n} \leq \theta_{(l-m)}^n,\forall 1\leq i\leq m

and

\displaystyle \theta_{(l)}^{n+m} > \theta_{(l-m)}^{n} \text{ if } \exists i \text{ s.t. } \theta_{i+n} >\theta_{(l-m)}^n.

Thus we have shown that we have stochastically dominance for the random variables we are considering. The proof here is a bit lengthy and maybe not so clear but correct. One may draw a picture of the data-generating process and then compare the {\theta_i}s to see the correctness. \Box

Interestingly, the above proof actually did not use the assumption that {\theta}s are iid. This means we can have the same conclusion for any joint distribution. We state this result explicitly below.

Theorem 2 Let {\theta_1,\dots, \theta_{n+m}} have joint distribution {F_{m+n}}. Denote {\theta_{(1)}\leq \theta_{(2)}\leq \dots \leq \theta_{(n)} \leq \dots \leq \theta_{(n+m)}} to be the order statistics. Now consider only the first {n} many {\theta_i}s, i.e., {\theta_1,\dots,\theta_n}. Denote their marginal distribution as {F_{n}}. We denote the order statistics of these {n} many {\theta}s to be {\tilde{\theta}_{(1)} \leq \dots \leq \tilde{\theta}_{(n)}}. Then for any {n,m>0}, {m< l\leq n,1\leq i\leq m} and {x\in \mathbb{R}}, we have

\displaystyle \begin{aligned} \mathbb{P}( \theta_{(n+i)}\leq x)\leq \mathbb{P}( \tilde{\theta}_{(l)}\leq x) \leq \mathbb{ P}(\theta_{(l)}\leq x)\leq \mathbb{P} (\tilde{\theta}_{(l-m)}\leq x). \end{aligned} \ \ \ \ \ (8)

If each {\theta_i} has finite mean, then we have

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n+i)})\geq \mathbb{E}( \tilde{\theta}_{(l)}) \geq\mathbb{ E}(\theta_{(l)})\geq\mathbb{E} (\tilde{\theta}_{(l-m)}). \end{aligned} \ \ \ \ \ (9)

Proof: The proof is a repetition of the previous argument. Let us first prove the middle inequality of (8) and (9). Suppose we first generate {n} many {\theta_i}s following {F_n} and order them so that we have {\theta_{(i)}}s. Now if we generate the extra {m} many more {\theta_{i+n}} follows the conditional distribution given by the conditional defined by {F_{n+m}} and the first {n} many thetas. Specifically, the conditional distribution is

\displaystyle P(\theta_{i+m}\leq x_1,\dots, \theta_{n+m} \leq x_m | \theta_1,\dots,\theta_n ).

We now consider all the {\theta} we generated. Note that the unconditional distribution of all the {\theta}s are just {F_{n+m}}. We find {\tilde{\theta}_{(l)}} is at most {\theta_{(l)}} since we have

\displaystyle \tilde{\theta}_{(l)}= \theta_{(l)} \text{ if } \theta_{i+m} \geq \theta_{(l)},\forall 1\leq i\leq m

and

\displaystyle \tilde{\theta}_{(l)} < \theta_{(l)}, \text{ if } \exists i \text{ s.t. } \tilde{\theta}_{i+m}< \theta_{(l)}.

Since we have almost sure inequality, we surely have stochastic dominance and inequality in expectation. Now consider the right-most inequality of (8) and (9). . We still follow the previous ways of generating {\theta_i}s. After we have {\tilde{\theta}_{(i)}}s, the {n-(l-m)+1} th largest value {\theta_{(l-m)}} is always smaller than the {n-(l-m)+1}th largest value {\theta_{(l-m)}} since

\displaystyle \tilde{\theta}_{(l)} = \theta_{(l-m)}\text{ if } \tilde{\theta}_{i+m} \leq \theta_{(l-m)},\forall 1\leq i\leq m

and

\displaystyle \tilde{\theta}_{(l)} > \theta_{(l-m)} \text{ if } \exists i \text{ s.t. } \tilde{\theta}_{i+m} >\theta_{(l-m)}.

Thus we have shown that we have stochastically dominance for the random variables we are considering. The left-most inequality of of (8) and (9) can be proved using the fact that {\theta_{(n+i)}\geq \tilde{\theta}_{(l)}} for any {l\leq n}. \Box

I found this interlacing result while taking a take-home exam on game theory. It turns out this result is not useful for the exam though. I would also like to thank Yueyang Liu for capturing a mistake in the previous version of the post.

Natural sufficient statistic is not necessarily minimal for exponential family

This post gives a simple example of an exponential family that has natural parameter space being a point and that its natural sufficient statistic is not minimal.

Let us define the concept of exponential family with natural parameters.

Definition 1 (Natural exponential family)
A family of probability densities (or probability mass function) {f(\mathbf{x}|\mathbf{\eta} )} with parameter (index) {\mathbf{\eta}\in {\mathbb R}^k} is said to be a natural exponential family if {f} can be written as

\displaystyle f(\mathbf{x}|\mathbf{\eta}) = h(\mathbf{x}) \exp \left\{ \sum_{i=1}^k \mathbf{\eta}_i T_i(\mathbf{x}) - \mathcal{A}(\mathbf {\eta})\right\}. \ \ \ \ \ (1)

Here for {i = 1,\ldots,k}, we call

  • {T_i}: natural sufficient statistic
  • {\eta_i}: natural parameter
  • {H := \{ \mathbf{\eta} = (\eta_1,\ldots,\eta_k): \int h(\mathbf{x}) \exp \left\{ \sum_{i=1}^k \eta_i T_i(\mathbf{x})\right\} \, d \mathbf{x} < \infty \} }: natural parameter space.

A lot of well-known distributions actually are exponential families, e.g., normal distribution, Binomial distribution, Poisson distribution, beta distribution, and gamma distribution. The exponential family is central to the modeling and analysis of data.

Next, we define the concept of sufficiency and minimal sufficiency.

Definition 2 (Sufficiency and minimal sufficiency)
For a family of probability density {f_\eta}, {\eta \in \Theta \subset {\mathbb R}^k}, and a random variable {X\sim f_\eta}, a statistic {T(X)}
is sufficient if the conditional distribution {P(X|T(X))} is independent of {\eta}. A sufficient statistic {T(X)} is minimal if for any other sufficient statistic {S(X)}, there is a function {g} such that {T(X)=g(S(X))}.

Intuitively, a sufficient statistic captures all the information of the underlying parameter {\eta}. Indeed, suppose someone hands you a sufficient statistic {T(X)}. Because {P(X|T(X))} is independent of {\eta}, you know the distribution {P(X|T(X))} already. Now if you can generate the data {X} according to {P(X|T(X))}, then the unconditional distribution of {X} is simply {f_\theta}! So even though you don’t know the underlying distribution {f_\theta}, you can generate {X} so long as {T(X)} is available.

The minimality of a sufficient statistic means the data is reduced in an optimal way. As all other sufficient statistics actually contain more information than needed. One thing to note is that this definition of minimality has nothing to say about the dimension of a minimal sufficient statistic. Indeed, if {T(X)} is minimal, then {(T(X),1,2)} is
also minimal.

It is easily verified that natural sufficient statistic is actually sufficient using the factorization theorem. A natural question occurs at this point, is the natural sufficient statistic always minimal? The following example reveals that we do need to put a few more condition on {\eta} or {T(x)}.

Example 1
Consider the density family

\displaystyle f(x,y\mid \eta) = \frac{1}{(1+x^2)(1+y^2)}\exp\left (\eta \left (x^2-y^2\right)-A(\eta)\right).

where {x,y\in {\mathbb R}}.
The natural parameter space is the place where the integeral

\displaystyle \int_{{\mathbb R}^2} \frac{ \exp\left(\eta \left(x^2-y^2\right)\right)}{(1+x^2)(1+y^2)} dxdy = \int_{\mathbb R}\frac{\exp (\eta x^2)}{1+x^2}dx \int_{\mathbb R} \frac{\exp (-\eta y^2)}{1+y^2}dy

is finite (we use Fubini’s theorem in the middle step). Hence the parameter {\eta} needs to satisfy that
{\eta \leq 0 } for the integral {\int_{\mathbb R}\frac{\exp (\eta x^2)}{1+x^2}dx} to be finite and {\eta\geq 0} for the integral {\int_{\mathbb R} \frac{\exp (-\eta y^2)}{1+y^2}dy} to be finite. This means actually {\eta=0} and so the natural parameter space is a single point {\{0\}\subset {\mathbb R}}.

The natural sufficient statistic {X^2-Y^2} is indeed sufficient. But any constant estimator is also sufficient and minimal as any other sufficient statistic under a constant function is a constant. But {X^2-Y^2} is not a constant estimator so we see the natural sufficient statistic is not always necessarily minimal.

It can be shown that so long as the natural parameter space contains an open set then the natural sufficient is indeed minimal. See Theorem 4.5 b) of this note. 

A singular value inequality for nonnegative matrices

This post concerns a question regarding nonnegative matrices, i.e., matrices with all entries nonnegative:

For two nonnegative matrices {A,B \in {\mathbb R}^{m\times n}}, if {A-B\geq 0}, i.e., {A-B} is nonnegative as well, is there any relation with their singular values?

As we shall see, indeed, the largest singular value of {A}, denoted as {\sigma_1(A)}, is larger than the largest singular value of {B}, {\sigma_1(B)}:

\displaystyle \sigma_1(A)\geq \sigma_1(B).

Let us first consider a simpler case when {A,B\in {\mathbb R}^{n\times n}} are symmetric, so that {\sigma_1(A)=\max\{|\lambda_1(A)|,|\lambda_n(A)|\}}, {\sigma_1(B) =\{|\lambda_1(A)|,|\lambda_n(B)|\}}. Here for any symmetric matrix {C}, we denote its eigenvalues as {\lambda_1(C)\geq \dots \geq \lambda_n(C)}.

Lemma 1
If {A,B,A-B\in {\mathbb R}^{n\times n}} are all nonnegative and symmetric, then 

\displaystyle \sigma_1(A)\geq \sigma_1(B).

Proof:
To prove the lemma, we first recall the Perron-Frobenius theorem which states that the largest eigenvalue (in magnitude) of a
nonnegative matrix is nonnegative and the eigenvalue admits an eigenvector which is entrywise nonnegative as well.

Using this theorem, we can pick a nonnegative unit norm eigenvector {v_B} corresponding to the eigenvalue {\lambda_1(B)}, which is both nonnegative and largest in magnitude. Next, by multiplying left and right of {A-B} by {v_B^\top} and {v_B} respectively, we have

\displaystyle 0\overset{(a)}{\leq} v_B^\top Av_B - v_B^\top Bv_B \overset{(b)}{=} v_B^\top Av_B-\lambda_1(B)\overset{(c)}{\leq} \lambda_1(A)-\lambda_1(B)\overset{(d)}{=}\sigma_1(A)-\sigma_1(B).

Here step {(a)} is because {A-B} is nonnegative and {v_B} is nonnegative.
The step {(b)} is because {v_B} has unit norm and {Bv_B =\lambda_1(B)v_B}. The step {(c)} is because {\lambda_1(A) =\max_{\|v\|= 1}v^\top Av}. The step {(d)} is because {A,B} are symmetric and both are nonnegative so largest eigenvalue is indeed just the singular value due to Perron-Frobenius theorem. \Box

To prove the general rectangular case, we use a dilation argument.

Theorem 2
If {A,B,A-B\in {\mathbb R}^{m\times n}} are all nonnegative, then

\displaystyle \sigma_1(A)\geq \sigma_1(B).

Proof:
Consider the symmetric matrices {\tilde{A}} and {\tilde{B}} in {{\mathbb R}^{(n+m)\times (n+m)}}:

\displaystyle \tilde{A} = \begin{bmatrix} 0 & A \\ A^\top & 0 \end{bmatrix},\quad \text{and} \quad \tilde{B} = \begin{bmatrix} 0 & B \\ B^\top & 0 \end{bmatrix}.

Note that {\tilde{A}} has the largest singular value as {A} and {\tilde{B}} has the largest singular value as {B}.
Now since {A,B,A-B\geq 0}, we also have {\tilde{A},\tilde{B},\tilde{A}-\tilde{B}\geq 0}. Using Lemma 1, we prove the theorem.
\Box

How about the second singular value of {A} and {B}? We don’t have {\sigma_2(A)\geq \sigma_2(B)} in this case by considering

\displaystyle A = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}, \quad B = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}.

Random matrices and their rank

In this post, we study a few probability one results conerning the rank of random matrices: with probability one,

  1. a random asymmetric matrix has full rank;
  2. a random symmetric matrix has full rank;
  3. {\frac{n(n+1)}{2}} many random rank one positive semidefinite matrices (belonging to {\mathbb{R}^{n\times n}}) are linearly independent.

At least for the first two, they look quite natural and should be correct at first glance. However, to rigorously prove them does require some effort. The post is trying to show that one can use the concept of independence to prove the above assertions. Indeed, all the proof follows an inductive argument where independence makes life much easier:

  1. Draw the first sample, it satisfies the property we want to show,
  2. Suppose first {k} samples satisfy the property we want to show, draw the next sample. Since it is independent of the previous {k} samples, the first {k} samples can be considered as fixed. Then we make some effort to say the first {k+1} sample does satisfy the property.

Before we going into each example, we further note for any probability measure that is absolutely continuous respect to the measure introduced below and vice versa (so independence is not needed), the probability {1} results listed above still hold.

1. Random Matrices are full rank

Suppose {A\in \mathbb{R}^{n\times m}} with {A=[a_{ij}]} and each entry of {A } is drawn independently from standard Gaussian {N(0,1)}. We prove that {A} has rank {\min\{m,n\}} with probability {1}.

To start, we should assume without loss of generality that {m\leq n}. Otherwise, we just repeat the following argument for {A^\top}.

The trick is to use the independence structure and consider columns of {A} are drawn from left to right. Let’s write the {i}-th column of {A} by {a_{\cdot i}}

For the first column, we know {a_{\cdot 1}} is linearly independent because it is {0} with probability {0}. For the second column, because it is drawn independently from the first column, we can consider the first column is fixed, and the probability of {a_{\cdot 2}} fall in to the span of a fixed column is {0} because {a_{\cdot 2}} has a continuous density in {\mathbb{R}^n}. Thus the first two columns are linearly independent.

For general {k\leq n}, because {m\leq n}, we know the first {k-1} columns forms a subspace in {\mathbb{R}^n} and so {a_{\cdot k}} falls into that subspace with probability {0} (linear subspace has Lebesgue measure {0} in {\mathbb{R}^n} and Gaussian measure is absolutely continuous with respect to Lebesgue measure and vice versa). Thus we see first {k} columns are linearly independent. Proceeding the argument until {k=m} completes the proof.

2. Symmetric Matrices are full rank

Suppose {A\in \mathbb{S}^n\subset \mathbb{R}^{n\times n}} with {A=[a_{ij}]} and each entry of the upper half of {A }, {a_{ij},j\geq i}, is drawn independently from standard Gaussian {N(0,1)}. We prove that {A} has rank {n}.

Let’s write the {i}-th row of {A} by {a_{i\cdot}\in \mathbb{R}^{1\times n}}. For the first {k} entries of the {i}-th row, we write {a_{i,1:k}\in \mathbb{R}^{1\times k}}. Similarly, {a_{1:k,i}} denotes the first {k} entries of the {i}-th column. For the top left {k\times k} submatrix of {A}, we denote it by {a_{1:k,1:k}\in \mathbb{R}^{k\times k}}. Similar notation applies to other submatrices.

The idea is drawing each column of {A} from left to right sequentially and using the independence structure. Starting from the first column, with probability one, we have that {a_{\cdot 1}} is linearly independent (meaning {a_{\cdot 1}\not =0}).

Now since {a_{\cdot 1}} is drawn and the second column except the first entry is independent of {a_{\cdot 1}}, we know it is in the span of {a_{\cdot 1}} if the second column {a_{\cdot 2}} is {\frac{a_{12}}{a_{11}}} ({a_{11}} is {0} with probability 0) multiple of {a_{\cdot 1}}. This still happens with probability {0} since each entry of {a_{\cdot2}} has continuous probability density and is drawn independently from {a_{\cdot 1}}.

Now assume we have drawn the first {k} columns of {A}, the first {k} columns are linearly independent with probability {1}, and the left top {k\times k} submatrix of {A} is invertible (the base case is what we just proved). We want to show after we draw the {k+1} column, the first {k+1} columns are still linearly independent with probability {1}. Since the first {k} columns is drawn, the first {k} entries of the {k+1} column {a_{\cdot (k+1)}} is fixed, we wish to show the rest {n-k} entries which are drawn independently from all previous {k} columns will ensure linearly independence of first {k+1} columns.

Suppose instead {a_{\cdot (k+1)}} is linearly dependent with first {k} columns of {A}. Since the left top {k\times k} submatrix of {A} is invertible, we know there is only one way to write {a_{\cdot (k+1)}} as a linear combination of previous columns. More precisely, we have

\displaystyle a_{\cdot (k+1)} = a_{1:n,1:k}a_{1:k, 1:k}^{-1}a_{k+1,1:k}^\top.

This means {a_{\cdot (k+1)}} has to be a fixed vector which happens with probability {0} because the last {n-k} entries of {a_{\cdot (k+1)}} is drawn independently from {a_{\cdot i}}, {i=1,\dots ,k.} Note this implies that the first {k+1} columns of {A} are linearly independent as well as the left top {(k+1) \times (k+1)} submatrix of {A} is invertible because we can repeat the argument simply by thinking {A} is {(k+1) \times (k+1)} (instead of {n\times n}). Thus the induction is complete and we prove {A} indeed has rank {n} with probability {1}.

3. Rank {1} positive semidefinite matrices are linearly independent

This time, we consider {\mathbf{a}_i}, {i=1,\dots, \frac{n(n+1)}{2}} are drawn independently from standard normal distribution in {\mathbb{R}^n}. Denote the set of symmetric matrices in {\mathbb{R}^{n\times n}} as {\mathbb{S}^n}. We show that with probability {1} the matrices

\displaystyle A_i = \mathbf{a}_i\mathbf{a}_i^\top, \quad i =1,\dots , \frac{n(n+1)}{2}

are linearly independent in {\mathbb{S}^n}.

Denote the standard {i}-th basis vector in {\mathbb{R}^n} as {e_i}. First, let us consider the following basis (which can be verified easily) in {\mathbb{S}^n},

\displaystyle E_{ii} = e_ie_i, \quad i=1,\dots,n,\,\text{and}\quad E_{ij} = e_ie_i +e_{j}e_j + e_ie_j +e_je_i, 1\leq i<j\leq n.

We order the basis according to the lexicographical order on {(i,j)}. Suppose {\mathbf{a}} follows standard normal distribution in {\mathbb{R}^n}, we prove that for any fixed linear subspace {V\subset \mathbb{S}^{n^2}} with dimension less than {\frac{n(n+1)}{2}}, we have

\displaystyle \mathop{\mathbb P}(\mathbf{a}\mathbf{a} ^\top \in V ) = 0.

We start by applying an fixed invertible linear map {\mathcal{F}:\mathbb{S}^n \rightarrow \mathbb{S}^n} to {V} so that the basis of {\mathcal{F}(V)} is the first {\dim(V)} many basis vector of {\{E_{ij}\}_{i\leq j}} ({in lexicographical order.}), and the basis of the orthogonal complement (defined via the trace inner product) of {V}, {V^\perp\subset \mathbb{S}^n}, is mapped to the rest of the basis vector of {\{E_{ij}\}_{i\leq j}} under {\mathcal{F}}. We then only need to prove

\displaystyle \mathop{\mathbb P}(\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)) =0.

We prove by contradiction. Suppose {\mathop{\mathbb P}(\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)) >0}. It implies that there exists infinitely many {\mathbf{a}\in \mathbb{R}^n} such that {\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)}. Moreover, each component of {\mathbf{a}} takes infinitely many values. We show such situation cannot occur. Denote the {\dim (V)}-th largest element in {\{(i,j)\}_{i\leq j}} as {(i_0,j_0)}. Since {V} has dimension less than {\frac{n(n+1)}{2}}, {\mathcal{F}(V^\top)} contains nonzero element and so

\displaystyle \mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in \mathcal{F}(V) \iff [\mathcal{F}(\mathbf{a}\mathbf{a}^\top)]_{ij}=0,\quad \forall (i,j) > (i_0,j_0) \,\text{in lexicographical order.}

Now fix an {(i_1,j_1)>(i_0,j_0)}, we know { [\mathcal{F}(\mathbf{a}\mathbf{a}^\top)]_{i_1j_1}=0} means that there is some {F\in \mathbb{S}^n} (depending only on {V}) such that

\displaystyle \mathbf{tr}(F\mathbf{a}\mathbf{a}^\top) = \sum_{1\leq i,j\leq n} F_{ij} a_ia_j =0,

where {a_i} is the {i}-th complement of {\mathbf{a}}. Now if we fix {a_i, i=2,\dots,n} and only vary {a_1}, then we have

\displaystyle \sum_{1\leq i,j\leq n} F_{ij} a_ia_j =F_{11}a_1^2 +2 (\sum_{1<j\leq n}F_{1j}a_j)a_1 + \sum_{i\not =1,j\not =1}F_{ij}a_ia_j=0,

which holds for infinitely many {a_i \in \mathbb{R}} only if

\displaystyle F_{ii}=0, \quad (\sum_{1<j\leq n}F_{1j}a_j)=0,\quad\text{and} \sum_{i\not =1,j\not =1}F_{ij}a_ia_j =0.

Thus we can repeat the argument as before and conclude that {\mathbf{tr}(F\mathbf{a}\mathbf{a}^\top) = \sum_{1\leq i,j\leq n} F_{ij} a_ia_j =0,} holds for infinitely many {\mathbf{a}} with each component taking infinitely many values only if {F=0}. This contradicts the fact that {\mathcal{F}} is invertible and hence we must have

\displaystyle \mathop{\mathbb P}(\mathcal{F}(\mathbf{a}\mathbf{a}^\top) \in\mathcal{F}(V)) =0.

Now proving {A_i=\mathbf{a}_i \mathbf{a}_i^\top} spans {\mathbb{S}^n} with probability {1} is easy. Because for each {k}, the {A_i} with {i<k} being drawn can be considered fixed because {A_k} is drawn independently of {A_i}. But the probability of {A_k} stays in the span of {A_i,i<k} is {0} by previous result and hence {A_k} is linearly independent of all {A_i}. This argument continues to hold for {k= \frac{n(n+1)}{2}} because the space spanned by {A_1,\dots, A_{\frac{n(n+1)}{2}-1}} is at most {\frac{n(n+1)}{2}-1} and so previous result applies. This completes the proof.

Testing homogeneity of data for location-scale family

Suppose you have a sequence of independent data {X_1,\dots, X_n\in \mathbb{R}^d}, how do you test that {X_i}s all come from the same distribution, i.e., how do you test homogeneity of the data?

To make the problem more precise, suppose we have a distribution family indexed by {\theta \in \mathbb{R}^s}, i.e., a set

\displaystyle \Omega = \{ F_\theta\mid F_\theta \;\text{is a probability cumulative function,} \;\theta \in \mathbb{R}^s\},

and each {X_i} follows the distribution {F_{\theta_i}} for some {\theta_i}. Our problem is

Is {F_{\theta_1} =F_{\theta_2}=\dots =F_{\theta_n}}?

If we have that {\theta \not =\bar{\theta} \implies F_{\theta}\not= F_{\bar{\theta}}} (known as the identifiability of {\theta}), then our question becomes

Is {\theta_1 =\theta_2=\dots =\theta_n}?

Now suppose further that each {F_\theta\in \Omega} has a density {f_\theta} (so that we can write down the likelihood), the likelihood of seeing the independent sequence {X_1,\dots, X_n} is

\displaystyle \mathcal{L}_{(X_1,\dots,X_n) }{(\theta_1,\dots,\theta_n)}= \prod_{i=1}^nf_{\theta_i}(X_i).

To test our question in a statistical way, we use hypothesis testing. Our null hypothesis is

\displaystyle \begin{aligned} H_0:\quad \theta_1=\theta_2=\cdots = \theta_n, \end{aligned} \ \ \ \ \ (1)

and our alternative hypothesis is

\displaystyle \begin{aligned} H_1:\quad \theta_i\not= \theta_j \quad \text{for some}\; i\not=j. \end{aligned} \ \ \ \ \ (2)

Further denote the space of the null as {\Theta_0=\{ (\theta_1,\dots,\theta_n)\mid \theta_1=\dots =\theta_n\}} and the space of the alternative as {\Theta_1=\mathbb{R}^{sn} / \Theta_0}. A popular and natural approach is the likelihood ratio test. We construct the test statistic which is called likelihood ratio as

\displaystyle \begin{aligned} R = \frac{\sup_{(\theta_1,\dots, \theta_n)\in \Theta_0}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}{\sup_{(\theta_1,\dots, \theta_n)\in (\Theta_0\cup \Theta_1)}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}. \end{aligned} \ \ \ \ \ (3)

Intuitively, if our null hypothesis is indeed true, i.e., there is some {\theta^*} such that {\theta_1=\dots =\theta_n=\theta^*} and {X_i} follows { f_{\theta^*}}, then this ratio should be large and we have confidence that our null hypothesis is true. This means we should reject our null hypothesis if we find {R} is small. Thus if we want to have a significance level {\alpha} test of our null hypothesis, we should reject null hypothesis when {R\leq c} where {c} satisfies

\displaystyle \begin{aligned} \mathbb{P}(R<c\mid H_0)\leq \alpha. \end{aligned} \ \ \ \ \ (4)

However, the main issue is that we don’t know the distribution of {R} under {H_0} even if we know how to sample from each {f_\theta} and the functional form of {f_\theta} for each {\theta}. The reason is that {H_0} did not specify which {\theta^*} (which equals to {\theta^*=\theta_1=\dots =\theta_n}) generates the data. So the distribution of {R} may depend on {\theta^*} as well and the real thing we need for {c} is

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)\leq \alpha. \end{aligned} \ \ \ \ \ (5)

Thus even if we want to know approximate the {\sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)} through computational methods, we have to simulate for each {\theta^* \in \mathbb{R}^s}. As {\Theta_0} could be rather large (in fact as large as {\mathbb{R}^s}), approximation can be time consuming as well.

Fortunately, if {\Omega} is the so called location-scale family, we find that the distribution of {R} is independent of {\theta^*} and we are free to chose whichever {\theta^*} we like. Let us define what is location-scale family, then state the theorem and prove it.

Definition 1 Suppose we have a family of probability densities {\Omega} on {\mathbb{R}^d} indexed by {\theta =(\mu,\Sigma)} where {\mu \in \mathbb{R}^d} and {\Sigma\in GL(\mathbb{R},d)}, the set of  invertible matrices in {\mathbb{R}^{d \times d}}. The family {\Omega} is a local-scale family if there is a family member {f} (called pivot) such that for any other {f_\theta} with {\theta =(\mu,\Sigma)},

\displaystyle f_\theta (x) = f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}.

Thus if {Z\in \mathbb{R}^d} follows {f}, then {X=\Sigma Z+\mu} has probability density {f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}. Indeed, for any Borel set {B\subset \mathbb{R}^d }

\displaystyle \begin{aligned} \mathop{\mathbb P}(X\in B ) &=\mathop{\mathbb P}(\Sigma Z+\mu\in B)\\ & = \mathop{\mathbb P}(Z\in \Sigma ^{-1}(B-\mu)) \\ & = \int_{ \Sigma ^{-1}(B-\mu )} f(z) dz \\ & =\int_B f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}dx \end{aligned} \ \ \ \ \ (6)

where we use a change of variable {x=\Sigma z+\mu} in the last equality and the last equality shows {X} follows {f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}. We are now ready to state the theorem and prove it.

Theorem 2 Suppose our family of distribution {\Omega} is a local-scale family, then under the null hypothesis, there is a {\theta^*= (\mu^*,\Sigma^*)} such that each {X_i} follows {f_{\theta^*}(x) = \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))} and the distribution of {R} is independent of {\theta^*}.

Since the distribution of {R} is independent of {\theta^*} under the null. This means that for any {\theta \in \mathbb{R}^d}, and any {c}

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*) = \mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta). \end{aligned} \ \ \ \ \ (7)

Thus we can choose any family member of {\Omega} to sample {X_i} and approximates the distribution of {R} using empirical distribution as long as {\Omega} is a location-scale family!

Proof:  We need to show that the ratio {R} has distribution independent of {\theta^*}. Since {X_i \sim \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))} and {\Omega} is a location scale family, we can assume they are generated via {Z_i} where {Z_i} follows a pivot {f} and {X_i = \Sigma^*Z_i+\mu^*}. Then the likelihood of {\theta_1,\dots, \theta_n} is

\displaystyle \begin{aligned} \mathcal{L}_{(X_1,\dots,X_n) }(\theta_1,\dots,\theta_n) &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ & = \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(\Sigma^*Z_i+\mu^*-\mu_i))\\ &= \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*)))\\ & = \bigr(|\det((\Sigma^*)^{-1}|)\bigr)^n\prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))).\\ \end{aligned} \ \ \ \ \ (8)

Thus the likelihood ratio {R} reduces to

\displaystyle \begin{aligned} R &= \frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*)))}{\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d),\,i=1,\dots,n} \prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ &=\frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*))) }{\prod_{i=1}^n\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ \end{aligned} \ \ \ \ \ (9)

Now let’s define {\hat{\Sigma} = (\Sigma^*)^{-1}\Sigma}, {\hat{\mu} = (\Sigma^*)^{-1}(\mu-\mu^*) }, {\hat{\mu}_i = (\Sigma^*)^{-1}(\mu_i -\mu^*) } and {\hat{\Sigma}_i= (\Sigma^*)^{-1}\Sigma_i }. Note that since {\mu,\Sigma}, {\mu_i,\Sigma_i} can vary all over the space {\mathbb{R}^d\times GL(\mathbb{R},d)}, so is {\hat{\mu},\hat{\Sigma}}, {\hat{\mu}_i} and {\hat{\Sigma}_i}. The equality (10) can be rewritten as

\displaystyle \begin{aligned} R &= \frac{\sup_{(\hat{\mu},\hat{\Sigma})\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{\det(\hat{\Sigma})} f((\hat{\Sigma})^{-1}(Z_i-\hat{\mu})) }{\prod_{i=1}^n\sup_{(\hat{\mu}_i,\hat{\Sigma}_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{\det(\hat{\Sigma}_i)} f((\hat{\Sigma}_i)^{-1}(Z_i -\hat{\mu}_i))}.\\ \end{aligned} \ \ \ \ \ (10)

As we just argued, {\hat{\mu},\hat{\Sigma}}, {\hat{\mu}_i} and {\hat{\Sigma}_i} can vary all over the space without any restriction, the supremum in the numerator and denominator thus does not depend on the choice {\mu^*} and {\Sigma^*} at all. So our theorem is proved. \Box

Chain rule of convex function

This post studies a specific chain rule of composition of convex functions. Specifically, we have the following theorem.

Theorem 1 For a continuously differentiable increasing function {\phi: \mathbb{R} \rightarrow \mathbb{R}}, a convex function {h: U \rightarrow \mathbb{R}} where {U\in \mathbb{R}^n} is a convex set and an {x\in U}, if {\phi'(h(x))>0} or {x\in \mbox{int}(U)}, then

\displaystyle \begin{aligned} \partial (\phi \circ h) (x) = \phi' (h(x)) \cdot[ \partial h (x)], \end{aligned} \ \ \ \ \ (1)

where {\partial } is the operator of taking subdifferentials of a function, i.e., {\partial h (x) = \{ g\mid h(x)\geq h(y) +\langle{g},{y-x}\rangle,\forall y\in U\}} for any {x\in U}, and {\mbox{int}(U)} is the interior of {U} with respect to the standard topology in {\mathbb{R}^n}.

A negative example. We note that if our condition fails, the equality may not hold. For example, let {\phi(x) =1} for all {x\in \mathbb{R}} and {h(x) = 1 } defined on {[0,1]}. Then {0} is a point which is not in the interior of {[0,1],\phi'(0) = 0}, {\partial h(0) =(-\infty,0]}. However, in this case {\partial (\phi\circ h)(0)= (-\infty,0]} and {\phi' (h(0)) \cdot[ \partial h (0)] =0}. Thus, the equality fails.

It should be noted that if {U} is open and {h} is also differentiable, then the above reduces to the common chain rule of smooth functions.

Proof: We first prove that {\partial (\phi \circ h) (x) \supset \phi' (h(x)) \cdot[ \partial h (x)]}. We have for all {x\in U , g \in \partial h(x)},

\displaystyle \begin{aligned} \phi (h(y)) &\overset{(a)}{\geq} \phi(h(x)) + \phi' (h(x))(h(y)-h(x))\\ & \overset{(b)}{\geq} \phi (h(x)) + \phi '(h(x)) \langle g,y-x\rangle \\ & = \phi (h(x)) + \langle{\phi' (h(x))g},{y-x}\rangle \end{aligned} \ \ \ \ \ (2)

where {(a),(b)} are just the definition of subdifferential of {\phi} at {h(x)} and {h} at {x}. We also use the fact that {\phi(x)\geq 0} in the inequality {(b)} as {\phi} is increasing.

Now we prove the other direction. Without lost of generality, suppose {0\in U} such that {\partial (\phi \circ h)(0)} is not empty. Let {g\in \partial (\phi \circ h)(0)}, we wish to show that {g} is in the set { \phi' (h(0)) \cdot[ \partial h (0)]}. First according to the definition of subdifferential, we have

\displaystyle \begin{aligned} (\phi \circ h) (x)\geq (\phi\circ h)(0) + \langle{ g},x\rangle, \forall x \in U \end{aligned} \ \ \ \ \ (3)

This gives

\displaystyle \begin{aligned} (\phi \circ h) (\gamma x)\geq (\phi\circ h)(0) + \langle{ g} ,{\gamma x}\rangle, \forall x \in U, \gamma \in [0,1]. \end{aligned} \ \ \ \ \ (4)

Rearranging the above inequality gives

\displaystyle \begin{aligned} &\frac{(\phi \circ h) (\gamma x)- (\phi\circ h)(0)}{\gamma}\geq \langle{ g},{ x}\rangle \\ \implies & \phi'(s)\cdot \frac{h(\gamma x)-h(0)}{\gamma}\geq \langle{g},{x}\rangle \end{aligned} \ \ \ \ \ (5)

for some {s} between {h(\gamma x)} and {h(0)} by mean value theorem. Now, by letting {f(\gamma )=h(\gamma x)}, if {f} is right continuous at {0}, then by Lemma 1 in this post and {\phi} is continuously differentiable, we know {l(\gamma) =\frac{h(\gamma x)-h(0)}{\gamma}} is nondecreasing in {\gamma} and we have

\displaystyle \begin{aligned} \phi'(h(0)) (h(x)-h(0))\geq \phi'(h(0))\cdot \lim_{\gamma \downarrow 0} \frac{h(\gamma x)-h(0)}{\gamma}\geq \langle{g},{x}\rangle,\forall x\in U. \end{aligned} \ \ \ \ \ (6)

If {\phi' (h(0))>0}, then dividing both sides of the above inequality by {\phi'(h(0))} gives

\displaystyle \begin{aligned} h(x)-h(0)\geq \langle{\frac{g}{\phi'(h(0))}},{x}\rangle,\forall x\in U. \end{aligned} \ \ \ \ \ (7) 

This shows that {\frac{g}{\phi'(h(0))}} is indeed a member of {\partial h(x)} and thus {\partial (\phi \circ h) (x) \subset \phi' (h(x)) \cdot[ \partial h (x)]}. In this case, we only need to verify why {f(\gamma ) = h(\gamma x) } must be right continuous at {0}.

If {0\in \mbox{int}(U)}, then {h} is definitely continuous at {x} and so is {f} by standard result in convex analysis. If {\phi' (h(0))>0}, then we are done by inequality (7). If {\phi' (h(0))=0}, using the inequality (6), we have

\displaystyle \begin{aligned} 0 \geq \langle{g},{x}\rangle, \forall x\in U \end{aligned} \ \ \ \ \ (8)

Since {0 \in \mbox{int}(U)}, then {x} can take a small positive and negative multiple of {n} standard basis vectors in {\mathbb{R}^n} in the inequality (8). This shows {g =0} and it indeed belongs to the set {\phi' (h(0)) \cdot[ \partial h (x)] = \{0\}} as {\partial (h(0))\not=\emptyset} for {0\in \mbox{int}(U)} by standard convex analysis result.

Thus our task now is to argue why {f(\gamma ) = h(\gamma x)} is indeed right continuous at {0}. Using Lemma 4 in this post, we know the limit {\lim_{\gamma \downarrow 0 }f(\gamma) = f(0^+)} exists and {f(0^+)\leq f(0)}. Now if {f(0^+) = f(0) = h(0)}, then {f} is indeed right continuous at {0} and our conclusion holds. So we may assume {f(0^+) <f(0) =h(0)}. But in this case {l(\gamma) = \frac{h(\gamma x)-h(0)}{\gamma} = \frac{f(\gamma)-f(0)}{\gamma}} is going to be negative infinity as {\gamma \downarrow 0}. Recall from inequality (5), we have

\displaystyle \phi'(s)l(\gamma )\geq \langle{g},{x}\rangle

where {s} is between {h(0)} and {f(\gamma )=h(\gamma x)}. We claim that as {\gamma\downarrow 0}, {\phi'(s)} approaches a positive number. If this claim is true, then from the above inequality, we will have

\displaystyle -\infty \geq \langle{g},{x}\rangle

which cannot hold. Thus we must have {f} right continuous at {0}.

Finally, we prove our claim that {\phi'(s)} is approaching a positive number if {f(0^+) <f(0)}. Using mean value theorem, we have for some {s_0 \in [ f(0^+), f(0)]}

\displaystyle \begin{aligned} \phi(s_0)(f(0^+)-f(0)) & = \phi(f(0^+)) -\phi(f(0))\\ & = \lim_{ \gamma \downarrow 0 } \phi(f(\gamma))-\phi(f(0))\\ & = \lim_{\gamma \downarrow 0}\phi(s)(f(\gamma)-f(0))\\ & = (f(0^+)-f(0))\lim_{\gamma\downarrow 0}\phi(s). \end{aligned} \ \ \ \ \ (9)

Now cancel the term {f(0^+) -f(0)<0} above, we see that {\phi(s_0) = \lim_{\gamma\downarrow}\phi(s)}. We claim {\phi(s_0)>0}. If {\phi(s_0)=0}, then because {\phi} is increasing, we have that {\phi} is constant in {[f(0^+),f(0)]} as {\phi(f(0^+)) -\phi(f(0)) = \phi(s_0)(f(0^+)-f(0)) =0}. This contradicts our assumption that {\phi'(f(0))>0} and our proof is complete. \Box

Special Properties of 1-D convex function

We discuss a few special properties of one-dimensional function convex functions. The first says something about the slope of one-dimensional convex functions.

Lemma 1 Let {f:[0,1]\rightarrow \mathbb{R}} be a (strictly) convex function, then the function {g(x) = \frac{f(x)-f(0)}{x}, x\in(0,1]} is non-decreasing (strictly increasing).

Proof: By the definition of convex function, we have for any {0<s<t\leq 1}

\displaystyle f(s) = f(\frac{s}{t} \times t + 0 \times \frac{t-s}{t}) \leq \frac{s}{t}f(t) +\frac{t-s}{t}f(0).

Rearrange the above inequality gives

\displaystyle \frac{f(s) -f(0)}{s}\leq \frac{f(t)-f(0)}{t}.

The case for strict convexity is simply replacing the above {\leq} by {<}. \Box

The second asserts that the slope should always increase.

Lemma 2 If {f:I\rightarrow \mathbb{R}} where {I} is any connected set in {\mathbb{R}},i.e., an interval which can be half open half closed or open or closed. The for any {x<y<z\in I}, we have

\displaystyle \frac{f(x)-f(y)}{x-y}\leq \frac{f(y)-f(z)}{y-z}

Proof: This is just a twist of algebra.

\displaystyle \begin{aligned} & \frac{f(x)-f(y)}{x-y}\leq \frac{f(y)-f(z)}{y-z}\\ \iff& \frac{y-z}{x-y}f(x) +f(z)\geq (1+ \frac{y-z}{x-y})f(y)\\ \iff & \frac{y-z}{x-y}f(x) +f(z)\geq \frac{x-z}{x-y}f(y)\\ \iff & \frac{y-z}{x-z}f(x) +\frac{x-y}{x-z}f(z)\geq f(y)\\ \iff & \frac{z-y}{z-x}f(x) +\frac{y-x}{z-x}f(z)\geq f(\frac{z-y}{z-x}x +\frac{y-x}{z-x}z)\\ \end{aligned} \ \ \ \ \ (1)

where the last one holds due to the definition of convexity. \Box

The next lemma shows that convex functions in a real line must always be increasing or always decreasing or first decreasing and then increasing.

Lemma 3 Let {f:I \rightarrow \mathbb{R}} where {I} is an interval in {\mathbb{R}}. The interval {I} can be any one kind of {[a,b],(a,b),(a,b],[a,b)} where {-\infty \leq a<b\leq \infty}. Then {f} is either always increasing or always decreasing or first decreasing and then increasing.

Proof: We may suppose {f} is not always a constant since in this case, the assertion is true. Now suppose that {f} is not always increasing and is also not always decreasing. Then there exists {x<y <z} all in {I} such that

\displaystyle f(x) > f(y) , f(y)< f(z)

or

\displaystyle f(x) <f(y), f(y) > f(z).

The latter case is not possible because it implies that

\displaystyle \frac{f(x)-f(y)}{x-y}> 0 > \frac{f(y) - f(z)}{y-z}

which contradicts the convexity of {f}. Thus only the first case is possible.

Now if {x,z} are interior point of {I}, Then {f} is continuous on the interior of {I} and so is continuous on {[x,z]}. This means that {f} attains a minimum on {[x,z]} at some {x_0\in (x,z)} as {f(y)<\min \{f(x),f(z)\}}. Now for any {x_1<x_2<x<x_3<x_4<x_0}, by convexity and monotonicity of slope, we have

\displaystyle \begin{aligned} \frac{f(x_1)-f(x_2)}{x_1-x_2}\leq \frac{f(x_2)-f(x)}{x_2-x_3}\leq \frac{f(x)-f(x_3)}{x-x_3}\leq \frac{f(x_3)-f(x_4)}{x_3-x_4}\leq \frac{f(x_4)-f(x_0)}{x_4-x_0}\leq 0. \end{aligned} \ \ \ \ \ (2)

which implies {f(x_1)\geq f(x_2)\geq f(x)\geq f(x_3)\geq f(x_4)}. Thus {f} is decreasing on {I\cap (-\infty,x_0]}. Similarly, using monotonicity of slope, we have {f} is increasing on {I\cap(x_0,\infty)}. Now if suppose either {x} or {z} is the end point of {I}. We take the half points {a_1 = \frac{x+y}{2},b_1 = \frac{y+z}{2}}. Then there are only three possible cases

  1. {f(x)\geq f(a_1)\geq f(y)\geq f(b_1)\leq f(z)}
  2. {f(x)\geq f(a_1)\geq f(y)\leq f(b_1)\leq f(z)}
  3. {f(x)\geq f(a_1)\leq f(y)\leq f(b_1)\leq f(z)}

The second case is ideal as {a_1} and {b_1} are interior point and we can proceed our previous argument and show {f} is first decreasing and the then increasing. In the first case, we can take {b_2 = \frac{b_1+z}{2}} and ask the relation between {f(y),f(b_1),f(b_2)} and {f(z)}. The only non-ideal case is that {f(b_1)\geq f(b_2)\leq f(z)} (the other case gives three interior point such that {f(y)\geq f(b_1)\leq f(b_2)} and we can employ our previous argument). We can then further have {b_3 = \frac{b_2+z}{2}}. Again there will be only one non-ideal case that {f(b_1)\geq f(b_2)\geq f(b_3)}.

Continuing this process, we either stop at some point to obtain a pattern {f(y)\geq f(b_1)\leq f(b_i)} or the sequence {b_n} satisfies {f(b_i)\geq f(b_{i+1})} for all {i} and {\lim_{i\rightarrow \infty} b_i=z}. If {z} is an interior point, then continuity of {f} implies that {\lim_{i\rightarrow \infty}f(b_i) = f(z)}. But this is not possible because {f(z)> f(y)\geq f(b_i)}. Thus {z} must be the end point of {I}. Moreover, {f} is actually decreasing on {I/\{z\}}. Indeed, for any {x_1>x_2, x_i\in I/\{z\}} for {i=1,2}, there is a {b_i} such that {x_2>b_i} and so

\displaystyle \frac{f(x_1)-f(x_2)}{x_1-x_2}\leq \frac{f(x_2)-f(b_i)}{x_2-b_i}\leq \frac{f(b_i)-f(b_{i+1})}{b_i-b_{i+1}}\leq 0.

But since {f(z)>f(y)}, we see {f} is indeed decreasing on {I/\{z\}} and increase/jump at the point {z}.

The third case is similar, we will either have {f} always increasing and only jump down at the end point {x} or {f} is just first decreases for some interval and then increases for some interval. \Box

Utilizing the above lemma, we can say 1) one-dimensional convex function is almost everywhere differentiable and 2) something about the boundary points of {f} on a finite interval.

Lemma 4 Suppose {f:[a,b] \rightarrow \mathbb{R}} is convex for some {a\in \mathbb{R},b \in \mathbb{R}}. Then {f(a) \geq \lim _{x\downarrow a}f(a) = f(a^+), f(b)\geq \lim_{x\uparrow b}f(x) =f(b^+)}.

The above lemma shows that a convex function on a finite open interval is uniformly continuous as it can be extended to the boundary. However, this property does not hold in higher dimension by considering {f(x,y) = \frac{x^2}{y}} on {x^2 \leq y, 0<y <1} and the point {(0,0)}.

LP Conditioning

We consider global conditioning property of Linear Program (LP) here. Let’s first define the problem. The standard form of LP is

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & c^Tx\\ & {\text{subject to}} & & Ax =b ,\\ & & & x\geq 0 \end{aligned} \ \ \ \ \ (1)

where {A \in \mathbb{R}^{n\times m}, c\in \mathbb{R}^n} and {b \in\mathbb{R}^m}. The dual of the standard form is

\displaystyle \begin{aligned} & \underset{y \in \mathbb{R}^m}{\text{maximize}} & & b^Ty\\ & {\text{subject to}} & & A^Ty\leq c. \\ \end{aligned} \ \ \ \ \ (2)

Now suppose we add perturbation {u\in \mathbb{R}^{n}} to Program (2) in the {c} term, we have

\displaystyle \begin{aligned} & \underset{y \in \mathbb{R}^n}{\text{maximize}} & & b^Ty\\ & {\text{subject to}} & & A^Ty\leq c+u.\\ \end{aligned} \ \ \ \ \ (3)

Let {v(u)} be the optimal value of Program (3) and {y(u)} be an optimal solution to the same program. We would like to ask the following question:

\displaystyle \begin{aligned} \textbf{Q}&: \text{Is}\; v(u)\; \text{Lipschitz continuous?} \;\text{Is} \;y(u)\;\text{Lipschitz}? \end{aligned} \ \ \ \ \ (4)

Such questions are called the conditioning of Program (2). It asks under some perturbation, how does the solution and optimal value changes accordingly. In general, we hope that Program (1) and (2) to be stable in the sense that small perturbation in the data, i.e., {A,b,c}, only induces small changes in the optimal value and solution. The reason is that in real application, there might be some error in the coding of {b,c} or even {A}. If Program (1) and (2) are not stable with respect to the perturbation, then possibly either more costs need to be paid to the encoding of the data or the LP is not worth solving. At first sight, it is actually not even clear that

  1. Whether {v(u)} will be always finite?
  2. Whether {y(u)} is actually unique? If {y(u)} is not unique, then the Lipschitz property of {y(u)} does not even make sense.

To address the first issue, we make our first assumption that

A1: Program (2) has a solution and the optimal value is finite.

Under A1, by strong duality of LP, we have that Program (1) also has a solution and the optimal value is the same as the optimal value of Program (2). This, in particular, implies that (1) is feasible. We now characterize the region where {v(u)} is finite. Let the feasible region of Program (1) be denoted as

\displaystyle \mathcal{F}_p :\,= \{x\mid Ax = b,x\geq 0\},

the set of vertices of {\mathcal{F}_p} be {\mathcal{V}_p=\{x_i\}_{i=1}^d} and the set of extreme rays be {\mathcal{R}_p=\{\gamma_j\}_{j=1}^l}. Using our assumption A1, we know

\displaystyle \mathcal{F}_p \not = \emptyset.

Now according to the Resolution Theorem of the primal polyhedron, we have

\displaystyle \mathcal{F}_p = \{ x\mid x = \sum_{i=1}^d \alpha _ix_i + \sum_{j=1}^l\beta_j \gamma_j, x_i \in \mathcal{V}_p, \gamma _j \in \mathcal{R}_p,\sum_{i=1}^d \alpha_i =1, \alpha_i\geq 0, \beta_j \geq 0, \forall i,j\}.

Using this representation, the fact that the primal of Program (3) is

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & (c+u)^Tx\\ & {\text{subject to}} & & Ax =b \\ & & & x\geq 0, \end{aligned} \ \ \ \ \ (5)

and strong duality continue to hold for (3) and (5) as (5) is feasible by A1, the value of {v(u)} is

\displaystyle v(u) = \min \{ c^Tx_i +u^Tx_i\mid x_ i \in \mathcal{V}_p\} + \min\{c^T\gamma_j +\beta_ju^T\gamma_j \mid \gamma_j \in \mathcal{R}_p,\beta_j\geq 0\}.

Thus it is immediate that the region where {v(u)} is finite is

\displaystyle \begin{aligned} \mathcal{F}_u = \{ u\mid \gamma_j ^Tu\geq 0, \forall \gamma_j \in \mathcal{R}_p\}. \end{aligned} \ \ \ \ \ (6)

In particular, if {\mathcal{F}_p} is bounded, then we have {v(u)} to be always finite.

To address the issue whether {y(u)} is unique, we make the following nondegenerate assumption on vertices.

A2: All the basic feasible solution of (1) are non-degenerate, i.e., all vertices of the feasible region of Program (1) have exactly {m} positive entries.

Under this condition, and the definition of basic feasible solution, we know if {x} is a basic feasible solution, then {A_{\cdot V(x)}\in \mathbb{R}^{m\times m}} where {V(x)=\{i\mid x_i \text{ is nonzero}\}} has all column independent. Here {A_{\cdot V(x)}} is the submatrix of {A} having columns with indices in {\mathcal{V}(x)}. By complementarity of LP (3) and (5), we thus know that if {x} is a baisc feasible solution and it is a solution to Program (5), then

\displaystyle y(u) = (A_{\cdot V(x)})^{-T}(c+u)

is the unique solution to Program (3) where {(A_{\cdot V(x)})^{-T} =((A_{\cdot V(x)})^{T})^{-1} }. What we have done is that under A1 and A2,

\displaystyle v(u) = \min \{ c^Tx_i +u^Tx_i, x_ i \in \mathcal{V}_p\}, u \in \mathcal{F}_u

and

\displaystyle y(u) = (A_{\cdot V(x)})^{-T}(c+u)

where {x = \arg\min_{x_i}\{ c^Tx_i +u^Tx_i, x_ i \in \mathcal{V}_p\}}.

By defining

\displaystyle \Omega_i = \{ u \mid c^Tx_i +u^Tx_i \leq c^Tx_j +u^Tx_j,\forall j\not =i \}, \forall\, 1\leq i\leq d

which is a polyhedron, we may write the above compactly as

\displaystyle v(u) = c^Tx_i +u^Tx_i, \quad y(u) = (A_{\cdot V(x_i)})^{-T}(c+u), u \in \Omega_i.

Then it is easy to see that the Lipschitz condition of {v(u)} is

\displaystyle |v(u_1)-v(u_2)| \leq \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \|u_1-u_2\|

where {\|\cdot\|_*} is the dual norm of {\|\cdot\|}. We note {\|\cdot \|} here is arbitrary.

Similarly, by using the theorem in this post, we see that {y(u)} is also Lipschitz continuous with Lipschitz property

\displaystyle \| y(u_1) - y(u_2)\|_\alpha \leq\max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}\|u_1-u_2\|_{\beta }

where {\|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta } = \max_{\|x\|_{\beta }=1}\|(A_{\cdot V(x)})^{-T}\|_\alpha} and {\|\cdot\|_\alpha, \|\cdot\|_\beta} are arbitrary norms. We conclude the above discussion in the following theorem.

Theorem 1 Recall {\mathcal{V}_p} is the set of extreme points of the feasible region of (1), {\mathcal{R}_p} is the set of extreme points of the feasible region of (1) and {\mathcal{F}_u = \{ u\mid \gamma_j ^Tu\geq 0, \forall \gamma_j \in \mathcal{R}_p\}}. Under assumption A1 and A2, we have for all {x\in \mathcal{F}_u},

\displaystyle |v(u_1)-v(u_2)| \leq \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \|u_1-u_2\|

and

\displaystyle \| y(u_1) - y(u_2)\|_\alpha \leq\max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}\|u_1-u_2\|_\beta

where {\|\cdot\|,\|\cdot\|_{\alpha},\|\cdot\|_\beta } are arbitrary norms. Here {A_{\cdot V(x)}} is the submatrix of {A} having columns with indices in {\mathcal{V}(x)=\{i\mid x_i \text{ is nonzero}\}}.

How might one bound the Lipschitz constants

\displaystyle \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \text{ and } \max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}?

The former might be done by giving a bound of the diameter of the feasible region {\mathcal{F}_p} of Program (1) and the later might be done through some restricted isometry property of {A}.