A singular value inequality for nonnegative matrices

This post concerns a question regarding nonnegative matrices, i.e., matrices with all entries nonnegative:

For two nonnegative matrices {A,B \in {\mathbb R}^{m\times n}}, if {A-B\geq 0}, i.e., {A-B} is nonnegative as well, is there any relation with their singular values?

As we shall see, indeed, the largest singular value of {A}, denoted as {\sigma_1(A)}, is larger than the largest singular value of {B}, {\sigma_1(B)}:

\displaystyle \sigma_1(A)\geq \sigma_1(B).

Let us first consider a simpler case when {A,B\in {\mathbb R}^{n\times n}} are symmetric, so that {\sigma_1(A)=\max\{|\lambda_1(A)|,|\lambda_n(A)|\}}, {\sigma_1(B) =\{|\lambda_1(A)|,|\lambda_n(B)|\}}. Here for any symmetric matrix {C}, we denote its eigenvalues as {\lambda_1(C)\geq \dots \geq \lambda_n(C)}.

Lemma 1
If {A,B,A-B\in {\mathbb R}^{n\times n}} are all nonnegative and symmetric, then 

\displaystyle \sigma_1(A)\geq \sigma_1(B).

Proof:
To prove the lemma, we first recall the Perron-Frobenius theorem which states that the largest eigenvalue (in magnitude) of a
nonnegative matrix is nonnegative and the eigenvalue admits an eigenvector which is entrywise nonnegative as well.

Using this theorem, we can pick a nonnegative unit norm eigenvector {v_B} corresponding to the eigenvalue {\lambda_1(B)}, which is both nonnegative and largest in magnitude. Next, by multiplying left and right of {A-B} by {v_B^\top} and {v_B} respectively, we have

\displaystyle 0\overset{(a)}{\leq} v_B^\top Av_B - v_B^\top Bv_B \overset{(b)}{=} v_B^\top Av_B-\lambda_1(B)\overset{(c)}{\leq} \lambda_1(A)-\lambda_1(B)\overset{(d)}{=}\sigma_1(A)-\sigma_1(B).

Here step {(a)} is because {A-B} is nonnegative and {v_B} is nonnegative.
The step {(b)} is because {v_B} has unit norm and {Bv_B =\lambda_1(B)v_B}. The step {(c)} is because {\lambda_1(A) =\max_{\|v\|= 1}v^\top Av}. The step {(d)} is because {A,B} are symmetric and both are nonnegative so largest eigenvalue is indeed just the singular value due to Perron-Frobenius theorem. \Box

To prove the general rectangular case, we use a dilation argument.

Theorem 2
If {A,B,A-B\in {\mathbb R}^{m\times n}} are all nonnegative, then

\displaystyle \sigma_1(A)\geq \sigma_1(B).

Proof:
Consider the symmetric matrices {\tilde{A}} and {\tilde{B}} in {{\mathbb R}^{(n+m)\times (n+m)}}:

\displaystyle \tilde{A} = \begin{bmatrix} 0 & A \\ A^\top & 0 \end{bmatrix},\quad \text{and} \quad \tilde{B} = \begin{bmatrix} 0 & B \\ B^\top & 0 \end{bmatrix}.

Note that {\tilde{A}} has the largest singular value as {A} and {\tilde{B}} has the largest singular value as {B}.
Now since {A,B,A-B\geq 0}, we also have {\tilde{A},\tilde{B},\tilde{A}-\tilde{B}\geq 0}. Using Lemma 1, we prove the theorem.
\Box

How about the second singular value of {A} and {B}? We don’t have {\sigma_2(A)\geq \sigma_2(B)} in this case by considering

\displaystyle A = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}, \quad B = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}.

Chain rule of convex function

This post studies a specific chain rule of composition of convex functions. Specifically, we have the following theorem.

Theorem 1 For a continuously differentiable increasing function {\phi: \mathbb{R} \rightarrow \mathbb{R}}, a convex function {h: U \rightarrow \mathbb{R}} where {U\in \mathbb{R}^n} is a convex set and an {x\in U}, if {\phi'(h(x))>0} or {x\in \mbox{int}(U)}, then

\displaystyle \begin{aligned} \partial (\phi \circ h) (x) = \phi' (h(x)) \cdot[ \partial h (x)], \end{aligned} \ \ \ \ \ (1)

where {\partial } is the operator of taking subdifferentials of a function, i.e., {\partial h (x) = \{ g\mid h(x)\geq h(y) +\langle{g},{y-x}\rangle,\forall y\in U\}} for any {x\in U}, and {\mbox{int}(U)} is the interior of {U} with respect to the standard topology in {\mathbb{R}^n}.

A negative example. We note that if our condition fails, the inequality may not hold. For example, let {\phi(x) =1} for all {x\in \mathbb{R}} and {h(x) = 1 } defined on {[0,1]}. Then {0} is a point which is not in the interior of {[0,1],\phi'(0) = 0}, {\partial h(0) =(-\infty,0]}. However, in this case {\partial (\phi\circ h)(0)= (-\infty,0]} and {\phi' (h(0)) \cdot[ \partial h (0)] =0}. Thus, the equality fails.

It should be noted that if {U} is open and {h} is also differentiable, then the above reduces to the common chain rule of smooth functions.

Proof: We first prove that {\partial (\phi \circ h) (x) \supset \phi' (h(x)) \cdot[ \partial h (x)]}. We have for all {x\in U , g \in \partial h(x)},

\displaystyle \begin{aligned} \phi (h(y)) &\overset{(a)}{\geq} \phi(h(x)) + \phi' (h(x))(h(y)-h(x))\\ & \overset{(b)}{\geq} \phi (h(x)) + \phi '(h(x)) \langle g,y-x\rangle \\ & = \phi (h(x)) + \langle{\phi' (h(x))g},{y-x}\rangle \end{aligned} \ \ \ \ \ (2)

where {(a),(b)} are just the definition of subdifferential of {\phi} at {h(x)} and {h} at {x}. We also use the fact that {\phi(x)\geq 0} in the inequality {(b)} as {\phi} is increasing.

Now we prove the other direction. Without lost of generality, suppose {0\in U} such that {\partial (\phi \circ h)(0)} is not empty. Let {g\in \partial (\phi \circ h)(0)}, we wish to show that {g} is in the set { \phi' (h(0)) \cdot[ \partial h (0)]}. First according to the definition of subdifferential, we have

\displaystyle \begin{aligned} (\phi \circ h) (x)\geq (\phi\circ h)(0) + \langle{ g},x\rangle, \forall x \in U \end{aligned} \ \ \ \ \ (3)

This gives

\displaystyle \begin{aligned} (\phi \circ h) (\gamma x)\geq (\phi\circ h)(0) + \langle{ g} ,{\gamma x}\rangle, \forall x \in U, \gamma \in [0,1]. \end{aligned} \ \ \ \ \ (4)

Rearranging the above inequality gives

\displaystyle \begin{aligned} &\frac{(\phi \circ h) (\gamma x)- (\phi\circ h)(0)}{\gamma}\geq \langle{ g},{ x}\rangle \\ \implies & \phi'(s)\cdot \frac{h(\gamma x)-h(0)}{\gamma}\geq \langle{g},{x}\rangle \end{aligned} \ \ \ \ \ (5)

for some {s} between {h(\gamma x)} and {h(0)} by mean value theorem. Now, by letting {f(\gamma )=h(\gamma x)}, if {f} is right continuous at {0}, then by Lemma 1 in this post and {\phi} is continuously differentiable, we know {l(\gamma) =\frac{h(\gamma x)-h(0)}{\gamma}} is nondecreasing in {\gamma} and we have

\displaystyle \begin{aligned} \phi'(h(0)) (h(x)-h(0))\geq \phi'(h(0))\cdot \lim_{\gamma \downarrow 0} \frac{h(\gamma x)-h(0)}{\gamma}\geq \langle{g},{x}\rangle,\forall x\in U. \end{aligned} \ \ \ \ \ (6)

If {\phi' (h(0))>0}, then dividing both sides of the above inequality by {\phi'(h(0))} gives

\displaystyle \begin{aligned} h(x)-h(0)\geq \langle{\frac{g}{\phi'(h(0))}},{x}\rangle,\forall x\in U. \end{aligned} \ \ \ \ \ (7) 

This shows that {\frac{g}{\phi'(h(0))}} is indeed a member of {\partial h(x)} and thus {\partial (\phi \circ h) (x) \subset \phi' (h(x)) \cdot[ \partial h (x)]}. In this case, we only need to verify why {f(\gamma ) = h(\gamma x) } must be right continuous at {0}.

If {0\in \mbox{int}(U)}, then {h} is definitely continuous at {x} and so is {f} by standard result in convex analysis. If {\phi' (h(0))>0}, then we are done by inequality (7). If {\phi' (h(0))=0}, using the inequality (6), we have

\displaystyle \begin{aligned} 0 \geq \langle{g},{x}\rangle, \forall x\in U \end{aligned} \ \ \ \ \ (8)

Since {0 \in \mbox{int}(U)}, then {x} can take a small positive and negative multiple of {n} standard basis vectors in {\mathbb{R}^n} in the inequality (8). This shows {g =0} and it indeed belongs to the set {\phi' (h(0)) \cdot[ \partial h (x)] = \{0\}} as {\partial (h(0))\not=\emptyset} for {0\in \mbox{int}(U)} by standard convex analysis result.

Thus our task now is to argue why {f(\gamma ) = h(\gamma x)} is indeed right continuous at {0}. Using Lemma 4 in this post, we know the limit {\lim_{\gamma \downarrow 0 }f(\gamma) = f(0^+)} exists and {f(0^+)\leq f(0)}. Now if {f(0^+) = f(0) = h(0)}, then {f} is indeed right continuous at {0} and our conclusion holds. So we may assume {f(0^+) <f(0) =h(0)}. But in this case {l(\gamma) = \frac{h(\gamma x)-h(0)}{\gamma} = \frac{f(\gamma)-f(0)}{\gamma}} is going to be negative infinity as {\gamma \downarrow 0}. Recall from inequality (5), we have

\displaystyle \phi'(s)l(\gamma )\geq \langle{g},{x}\rangle

where {s} is between {h(0)} and {f(\gamma )=h(\gamma x)}. We claim that as {\gamma\downarrow 0}, {\phi'(s)} approaches a positive number. If this claim is true, then from the above inequality, we will have

\displaystyle -\infty \geq \langle{g},{x}\rangle

which cannot hold. Thus we must have {f} right continuous at {0}.

Finally, we prove our claim that {\phi'(s)} is approaching a positive number if {f(0^+) <f(0)}. Using mean value theorem, we have for some {s_0 \in [ f(0^+), f(0)]}

\displaystyle \begin{aligned} \phi(s_0)(f(0^+)-f(0)) & = \phi(f(0^+)) -\phi(f(0))\\ & = \lim_{ \gamma \downarrow 0 } \phi(f(\gamma))-\phi(f(0))\\ & = \lim_{\gamma \downarrow 0}\phi(s)(f(\gamma)-f(0))\\ & = (f(0^+)-f(0))\lim_{\gamma\downarrow 0}\phi(s). \end{aligned} \ \ \ \ \ (9)

Now cancel the term {f(0^+) -f(0)<0} above, we see that {\phi(s_0) = \lim_{\gamma\downarrow}\phi(s)}. We claim {\phi(s_0)>0}. If {\phi(s_0)=0}, then because {\phi} is increasing, we have that {\phi} is constant in {[f(0^+),f(0)]} as {\phi(f(0^+)) -\phi(f(0)) = \phi(s_0)(f(0^+)-f(0)) =0}. This contradicts our assumption that {\phi'(f(0))>0} and our proof is complete. \Box

Special Properties of 1-D convex function

We discuss a few special properties of one dimensional function convex function. The first asserts about the slope of one dimensional convex function.

Lemma 1 Let {f:[0,1]\rightarrow \mathbb{R}} be a (strict) convex function, then the function {g(x) = \frac{f(x)-f(0)}{x}, x\in(0,1]} is non-decreasing (strictly increasing).

Proof: By the definition of convex function, we have for any {0<s<t\leq 1}

\displaystyle f(s) = f(\frac{s}{t} \times t + 0 \times \frac{t-s}{t}) \leq \frac{s}{t}f(t) +\frac{t-s}{t}f(0).

Rearrange the above inequality gives

\displaystyle \frac{f(s) -f(0)}{s}\leq \frac{f(t)-f(0)}{t}.

The case for strict convexity is simply replacing the above {\leq} by {<}. \Box

The second asserts that slope should always increases.

Lemma 2 If {f:I\rightarrow \mathbb{R}} where {I} is any connected set in {\mathbb{R}},i.e., an interval which can be half open half closed or open or closed. The for any {x<y<z\in I}, we have

\displaystyle \frac{f(x)-f(y)}{x-y}\leq \frac{f(y)-f(z)}{y-z}

Proof: This is just a twist of algebra.

\displaystyle \begin{aligned} & \frac{f(x)-f(y)}{x-y}\leq \frac{f(y)-f(z)}{y-z}\\ \iff& \frac{y-z}{x-y}f(x) +f(z)\geq (1+ \frac{y-z}{x-y})f(y)\\ \iff & \frac{y-z}{x-y}f(x) +f(z)\geq \frac{x-z}{x-y}f(y)\\ \iff & \frac{y-z}{x-z}f(x) +\frac{x-y}{x-z}f(z)\geq f(y)\\ \iff & \frac{z-y}{z-x}f(x) +\frac{y-x}{z-x}f(z)\geq f(\frac{z-y}{z-x}x +\frac{y-x}{z-x}z)\\ \end{aligned} \ \ \ \ \ (1)

where the last one holds due to the definition of convexity. \Box

The next lemma shows that convex function in a real line must be always increasing or always decreasing or first decreasing and then increasing.

Lemma 3 Let {f:I \rightarrow \mathbb{R}} where {I} is an interval in {\mathbb{R}}. The interval {I} can be any one kind of {[a,b],(a,b),(a,b],[a,b)} where {-\infty \leq a<b\leq \infty}. Then {f} is either always increasing or always decreasing or first decreasing and then increasing.

Proof: We may suppose {f} is not always a constant since in this case, the assertion is true. Now suppose that {f} is not always increasing and is also not always decreasing. Then there exists {x<y <z} all in {I} such that

\displaystyle f(x) > f(y) , f(y)< f(z)

or

\displaystyle f(x) <f(y), f(y) > f(z).

The latter case is not possible because it implies that

\displaystyle \frac{f(x)-f(y)}{x-y}> 0 > \frac{f(y) - f(z)}{y-z}

which contradicts the convexity of {f}. Thus only the first case is possible.

Now if {x,z} are interior point of {I}, Then {f} is continuous on the interior of {I} and so is continuous on {[x,z]}. This means that {f} attains a minimum on {[x,z]} at some {x_0\in (x,z)} as {f(y)<\min \{f(x),f(z)\}}. Now for any {x_1<x_2<x<x_3<x_4<x_0}, by convexity and monotonicity of slope, we have

\displaystyle \begin{aligned} \frac{f(x_1)-f(x_2)}{x_1-x_2}\leq \frac{f(x_2)-f(x)}{x_2-x_3}\leq \frac{f(x)-f(x_3)}{x-x_3}\leq \frac{f(x_3)-f(x_4)}{x_3-x_4}\leq \frac{f(x_4)-f(x_0)}{x_4-x_0}\leq 0. \end{aligned} \ \ \ \ \ (2)

which implies {f(x_1)\geq f(x_2)\geq f(x)\geq f(x_3)\geq f(x_4)}. Thus {f} is decreasing on {I\cap (-\infty,x_0]}. Similarly, using monotonicity of slope, we have {f} is increasing on {I\cap(x_0,\infty)}. Now if suppose either {x} or {z} is the end point of {I}. We take the half points {a_1 = \frac{x+y}{2},b_1 = \frac{y+z}{2}}. Then there are only three possible cases

  1. {f(x)\geq f(a_1)\geq f(y)\geq f(b_1)\leq f(z)}
  2. {f(x)\geq f(a_1)\geq f(y)\leq f(b_1)\leq f(z)}
  3. {f(x)\geq f(a_1)\leq f(y)\leq f(b_1)\leq f(z)}

The second case is ideal as {a_1} and {b_1} are interior point and we can proceed our previous argument and show {f} is first decreasing and the then increasing. In the first case, we can take {b_2 = \frac{b_1+z}{2}} and ask the relation between {f(y),f(b_1),f(b_2)} and {f(z)}. The only non-ideal case is that {f(b_1)\geq f(b_2)\leq f(z)} (the other case gives three interior point such that {f(y)\geq f(b_1)\leq f(b_2)} and we can employ our previous argument). We can then further have {b_3 = \frac{b_2+z}{2}}. Again there will be only one non-ideal case that {f(b_1)\geq f(b_2)\geq f(b_3)}.

Continuing this process, we either stop at some point to obtain a pattern {f(y)\geq f(b_1)\leq f(b_i)} or the sequence {b_n} satisfies {f(b_i)\geq f(b_{i+1})} for all {i} and {\lim_{i\rightarrow \infty} b_i=z}. If {z} is an interior point, then continuity of {f} implies that {\lim_{i\rightarrow \infty}f(b_i) = f(z)}. But this is not possible because {f(z)> f(y)\geq f(b_i)}. Thus {z} must be the end point of {I}. Moreover, {f} is actually decreasing on {I/\{z\}}. Indeed, for any {x_1>x_2, x_i\in I/\{z\}} for {i=1,2}, there is a {b_i} such that {x_2>b_i} and so

\displaystyle \frac{f(x_1)-f(x_2)}{x_1-x_2}\leq \frac{f(x_2)-f(b_i)}{x_2-b_i}\leq \frac{f(b_i)-f(b_{i+1})}{b_i-b_{i+1}}\leq 0.

But since {f(z)>f(y)}, we see {f} is indeed decreasing on {I/\{z\}} and increase/jump at the point {z}.

The third case is similar, we will either have {f} always increasing and only jump down at the end point {x} or {f} is just first decreases for some interval and then increases for some interval. \Box

Utilizing the above lemma, we can say 1) one dimensional convex function is almost everywhere differentiable and 2) something about the boundary points of {f} on a finite interval.

Lemma 4 Suppose {f:[a,b] \rightarrow \mathbb{R}} is convex for some {a\in \mathbb{R},b \in \mathbb{R}}. Then {f(a) \geq \lim _{x\downarrow a}f(a) = f(a^+), f(b)\geq \lim_{x\uparrow b}f(x) =f(b^+)}.

The above lemma shows that a convex function on a finite open interval is uniformly continuous as it can be extend to the boundary. However, this property does not hold in higher dimension by considering {f(x,y) = \frac{x^2}{y}} on {x^2 \leq y, 0<y <1} and the point {(0,0)}.

LP Conditioning

We consider global conditioning property of Linear Program (LP) here. Let’s first define the problem. The standard form of LP is

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & c^Tx\\ & {\text{subject to}} & & Ax =b ,\\ & & & x\geq 0 \end{aligned} \ \ \ \ \ (1)

where {A \in \mathbb{R}^{n\times m}, c\in \mathbb{R}^n} and {b \in\mathbb{R}^m}. The dual of the standard form is

\displaystyle \begin{aligned} & \underset{y \in \mathbb{R}^m}{\text{maximize}} & & b^Ty\\ & {\text{subject to}} & & A^Ty\leq c. \\ \end{aligned} \ \ \ \ \ (2)

Now suppose we add perturbation {u\in \mathbb{R}^{n}} to Program (2) in the {c} term, we have

\displaystyle \begin{aligned} & \underset{y \in \mathbb{R}^n}{\text{maximize}} & & b^Ty\\ & {\text{subject to}} & & A^Ty\leq c+u.\\ \end{aligned} \ \ \ \ \ (3)

Let {v(u)} be the optimal value of Program (3) and {y(u)} be an optimal solution to the same program. We would like to ask the following question:

\displaystyle \begin{aligned} \textbf{Q}&: \text{Is}\; v(u)\; \text{Lipschitz continuous?} \;\text{Is} \;y(u)\;\text{Lipschitz}? \end{aligned} \ \ \ \ \ (4)

Such questions are called the conditioning of Program (2). It asks under some perturbation, how does the solution and optimal value changes accordingly. In general, we hope that Program (1) and (2) to be stable in the sense that small perturbation in the data, i.e., {A,b,c}, only induces small changes in the optimal value and solution. The reason is that in real application, there might be some error in the coding of {b,c} or even {A}. If Program (1) and (2) are not stable with respect to the perturbation, then possibly either more costs need to be paid to the encoding of the data or the LP is not worth solving. At first sight, it is actually not even clear that

  1. Whether {v(u)} will be always finite?
  2. Whether {y(u)} is actually unique? If {y(u)} is not unique, then the Lipschitz property of {y(u)} does not even make sense.

To address the first issue, we make our first assumption that

A1: Program (2) has a solution and the optimal value is finite.

Under A1, by strong duality of LP, we have that Program (1) also has a solution and the optimal value is the same as the optimal value of Program (2). This, in particular, implies that (1) is feasible. We now characterize the region where {v(u)} is finite. Let the feasible region of Program (1) be denoted as

\displaystyle \mathcal{F}_p :\,= \{x\mid Ax = b,x\geq 0\},

the set of vertices of {\mathcal{F}_p} be {\mathcal{V}_p=\{x_i\}_{i=1}^d} and the set of extreme rays be {\mathcal{R}_p=\{\gamma_j\}_{j=1}^l}. Using our assumption A1, we know

\displaystyle \mathcal{F}_p \not = \emptyset.

Now according to the Resolution Theorem of the primal polyhedron, we have

\displaystyle \mathcal{F}_p = \{ x\mid x = \sum_{i=1}^d \alpha _ix_i + \sum_{j=1}^l\beta_j \gamma_j, x_i \in \mathcal{V}_p, \gamma _j \in \mathcal{R}_p,\sum_{i=1}^d \alpha_i =1, \alpha_i\geq 0, \beta_j \geq 0, \forall i,j\}.

Using this representation, the fact that the primal of Program (3) is

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & (c+u)^Tx\\ & {\text{subject to}} & & Ax =b \\ & & & x\geq 0, \end{aligned} \ \ \ \ \ (5)

and strong duality continue to hold for (3) and (5) as (5) is feasible by A1, the value of {v(u)} is

\displaystyle v(u) = \min \{ c^Tx_i +u^Tx_i\mid x_ i \in \mathcal{V}_p\} + \min\{c^T\gamma_j +\beta_ju^T\gamma_j \mid \gamma_j \in \mathcal{R}_p,\beta_j\geq 0\}.

Thus it is immediate that the region where {v(u)} is finite is

\displaystyle \begin{aligned} \mathcal{F}_u = \{ u\mid \gamma_j ^Tu\geq 0, \forall \gamma_j \in \mathcal{R}_p\}. \end{aligned} \ \ \ \ \ (6)

In particular, if {\mathcal{F}_p} is bounded, then we have {v(u)} to be always finite.

To address the issue whether {y(u)} is unique, we make the following nondegenerate assumption on vertices.

A2: All the basic feasible solution of (1) are non-degenerate, i.e., all vertices of the feasible region of Program (1) have exactly {m} positive entries.

Under this condition, and the definition of basic feasible solution, we know if {x} is a basic feasible solution, then {A_{\cdot V(x)}\in \mathbb{R}^{m\times m}} where {V(x)=\{i\mid x_i \text{ is nonzero}\}} has all column independent. Here {A_{\cdot V(x)}} is the submatrix of {A} having columns with indices in {\mathcal{V}(x)}. By complementarity of LP (3) and (5), we thus know that if {x} is a baisc feasible solution and it is a solution to Program (5), then

\displaystyle y(u) = (A_{\cdot V(x)})^{-T}(c+u)

is the unique solution to Program (3) where {(A_{\cdot V(x)})^{-T} =((A_{\cdot V(x)})^{T})^{-1} }. What we have done is that under A1 and A2,

\displaystyle v(u) = \min \{ c^Tx_i +u^Tx_i, x_ i \in \mathcal{V}_p\}, u \in \mathcal{F}_u

and

\displaystyle y(u) = (A_{\cdot V(x)})^{-T}(c+u)

where {x = \arg\min_{x_i}\{ c^Tx_i +u^Tx_i, x_ i \in \mathcal{V}_p\}}.

By defining

\displaystyle \Omega_i = \{ u \mid c^Tx_i +u^Tx_i \leq c^Tx_j +u^Tx_j,\forall j\not =i \}, \forall\, 1\leq i\leq d

which is a polyhedron, we may write the above compactly as

\displaystyle v(u) = c^Tx_i +u^Tx_i, \quad y(u) = (A_{\cdot V(x_i)})^{-T}(c+u), u \in \Omega_i.

Then it is easy to see that the Lipschitz condition of {v(u)} is

\displaystyle |v(u_1)-v(u_2)| \leq \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \|u_1-u_2\|

where {\|\cdot\|_*} is the dual norm of {\|\cdot\|}. We note {\|\cdot \|} here is arbitrary.

Similarly, by using the theorem in this post, we see that {y(u)} is also Lipschitz continuous with Lipschitz property

\displaystyle \| y(u_1) - y(u_2)\|_\alpha \leq\max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}\|u_1-u_2\|_{\beta }

where {\|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta } = \max_{\|x\|_{\beta }=1}\|(A_{\cdot V(x)})^{-T}\|_\alpha} and {\|\cdot\|_\alpha, \|\cdot\|_\beta} are arbitrary norms. We conclude the above discussion in the following theorem.

Theorem 1 Recall {\mathcal{V}_p} is the set of extreme points of the feasible region of (1), {\mathcal{R}_p} is the set of extreme points of the feasible region of (1) and {\mathcal{F}_u = \{ u\mid \gamma_j ^Tu\geq 0, \forall \gamma_j \in \mathcal{R}_p\}}. Under assumption A1 and A2, we have for all {x\in \mathcal{F}_u},

\displaystyle |v(u_1)-v(u_2)| \leq \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \|u_1-u_2\|

and

\displaystyle \| y(u_1) - y(u_2)\|_\alpha \leq\max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}\|u_1-u_2\|_\beta

where {\|\cdot\|,\|\cdot\|_{\alpha},\|\cdot\|_\beta } are arbitrary norms. Here {A_{\cdot V(x)}} is the submatrix of {A} having columns with indices in {\mathcal{V}(x)=\{i\mid x_i \text{ is nonzero}\}}.

How might one bound the Lipschitz constants

\displaystyle \max\{\|x_i\|_*,x_i \in \mathcal{V}_p\|\} \text{ and } \max \{ \|(A_{\cdot V(x)})^{-T}\|_{\alpha,\beta }, x \in \mathcal{V}_p\}?

The former might be done by giving a bound of the diameter of the feasible region {\mathcal{F}_p} of Program (1) and the later might be done through some restricted isometry property of {A}.

Lipschitz constant of piecewise linear (vector valued) functions

Suppose we have a collection of sets {\{\Omega_i \subset \mathbb{R}^{n}\}_{i=1}^d} and a set of {d} linear functions {\{f_i(x) = A_ix +c_i\mid x\in \Omega_i, i=1,\dots, d\}} where each {\Omega_i} being closed polyhedrons in {\mathbb{R}^{n}}, i.e., {\Omega_i = \{x\mid a_{i_j}^Tx + b_{i_j} \leq 0,\forall j\leq k(i)\}} for some {a_{i_j} \in \mathbb{R}^{n},b_{i_j}\in \mathbb{R},k(i)\in \mathbb{N}}, and {A_i\in \mathbb{R}^{m\times n}} and {c_i \in \mathbb{R}^{m}}. Moreover, for any {i\not =j}, and {x\in \Omega_i \cap \Omega_j}, we have {f_i(x) = f_j(x)}. Then we can define function {f} on {\Omega =\cup_{i=1}^d \Omega_i} such that

\displaystyle f(x) = f_i(x) \quad \text{if} \; x\in \Omega_i.

We further assume \Omega is convex. We now state a theorem concerning the Lipschitz constant of {f}.

Theorem 1 Under the above setting, given arbitrary norms {\|\cdot\|_\alpha,\|\cdot\|_\beta }, {f} is {\max_{i\leq d}\{\|A_i\|_{\alpha,\beta}\}}-Lipschitz continuous with respect to {\|\cdot\|_\alpha} and {\|\cdot \|_\beta }, i.e., for all {x_1,x_2 \in \Omega},

\displaystyle \begin{aligned} \| f(x_2) - f(x_1)\|_\alpha \leq \max_{i\leq d}\{\|A_i\|_{\alpha,\beta }\}\|x_2-x_1\|_\beta . \end{aligned} \ \ \ \ \ (1)

Here {\|A_i\|} is {\|A_i\|_{\alpha,\beta } = \max_{\|x\|_\beta \leq 1}\|A_ix\|_{\alpha}}.

Proof: By letting {g(t) = x_1 + t(x_2 - x_1)}, {L = \max_{i\leq d}\{\|A_i\|_{\alpha,\beta }\}\|x_2-x_1\|_\beta } and {h= f\circ g}. we see the inequality (1) is equivalent to

\displaystyle \begin{aligned} \|h(1)- h(0)\|_\alpha \leq L. \end{aligned} \ \ \ \ \ (2)

Using the property that {f} is well-defined on {\Omega_i\cap \Omega_j} for any {i,j}, we have that there exists {0=t_0\leq \dots \leq t_l\leq \dots \leq t_K\leq t_{K+1}=1, l=1,\dots,K} and corresponding {i_1,\dots ,i_K} such that

\displaystyle \begin{aligned} h(t_l)= A_{i_l}x^{(l)}+c_{i_l} = A_{i_{l-1}}x^{(l)}+c_{i_{l-1}} \end{aligned} \ \ \ \ \ (3)

for all {1\leq l\leq K} where {x^{(l)} = g(t_l)}. We also let {x^{(0)} =g(t_0) =x_1} and {x^{(K+1)}=g(t_{{K+1}}) =x_2}. These {x_l}s have the property that

\displaystyle \begin{aligned} \sum_{l=0}^{K}\|x_{i_{l+1}} -x_{i_l}\|_\beta = \sum_{l=0}^{K}(t_{l+1}-t_l)\|x_2 -x_1\|_\beta = \|x_2-x_1\|_\beta . \end{aligned} \ \ \ \ \ (4)

Thus we can bound the term {\|h(1)- h(0)\|_\alpha } by

\displaystyle \begin{aligned} \|h(1) - h(0)\|_\alpha &\leq \sum_{l=0}^{K} \|h(t_{l+1}) -h(t_{l})\|_\alpha \\ & \leq \sum_{l=0}^{K} \|A_{i_{l+1}}x_{l+1} +c_{i_{l+1}} - A_{i_l}x_{l} -c_{i_l}\|_\alpha \\ &\overset{(i)}{\leq } \sum_{l=0}^{K-1} \|A_{i_{l}}x_{l+1} +c_{i_{l}} - A_{i_l}x_{l} -c_{i_l}\|_\alpha \\ & = \sum_{l=0}^{K} \|A_{i_{l}}x_{l+1} - A_{i_l}x_{l} \|_\alpha \\ &\overset{(ii)}{\leq}\sum_{l=0}^{K}\|A_{i_l}\|_{\alpha,\beta } \|x_{l+1}-x_{l}\|_\beta \\ & \leq \max_{i}\{\|A_i\|_{\alpha,\beta },i\leq d\}\sum_{l=0}^{K} \|x_{l+1}-x_{l}\|_\beta \\ & \overset{(iii)}{\leq}\max_{i}\{\|A_i\|_{\alpha,\beta },i\leq d\} \|x_2-x_1\|_\beta \\ & = L, \end{aligned} \ \ \ \ \ (5)

where {(i)} is due to inequality (3), {(ii)} is due to the definition of operator norm and {(iii)} is using equality (4). This proves inequality (1). \Box

Sandwich inequality of optimal gap and distance to solutions of LP

Suppose we have a general optimization problem.

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & f(x)\\ & {\text{subject to}} & & x \in \Omega. \\ \end{aligned} \ \ \ \ \ (1)

Also suppose problem (1) has a minimum and the minimum can be achieved by an unique minimizer {x^*}.

Now if I have a point {x} such that {f(x) - f(x^*) } is very small, then how small is the distance {\|x-x^*\|}. We might expect that {f(x) - f(x^*) \rightarrow 0} will imply that {x \rightarrow x^*}. This is true if {\Omega} is compact and {f} is continuous. But this does not tell what is the quantitative relationship between the optimal gap, i.e., {f(x) -f(x^*)} and the distance to the solution, i.e., {\|x-x^*\|}.

In this post, I am going to show that for linear programming (LP) problem, the optimal gap and distance to solutions are the same up to a multiplicative constant which only depend on the problem data.

To start, consider an LP in the standard form, i.e.,

\displaystyle \begin{aligned} & \underset{x \in \mathbb{R}^n}{\text{minimize}} & & c^Tx\\ & {\text{subject to}} & & Ax =b \\ & & & x\geq 0. \end{aligned} \ \ \ \ \ (2)

where the decision variable is {x} and problem data are {A\in \mathbb{R}^{m\times n},b\in \mathbb{R}^m,c\in \mathbb{}R^n}. {x\geq 0} means each coordinate of {x} is nonnegative.

Denote the solution set of problem (1) to be {X^* = \arg \min\{ c^Tx| Ax = b, x\geq 0\} }, and the distance to the solution set {X^*} to be {\text{dist}(x,X^*) = \inf_{x^* \in X^*} \|x-x^*\|}. Note that the norm here is arbitrary, not necessarily the Euclidean norm.

We are now ready to state the theorem.

Theorem 1 (Sandwich inequality of optimal gap and distance to solutions of LP) For problem (1), theres exist constants {C_0, c_0>0} which depends only on {A,b,c} and {\|\cdot\|} such that for all feasible {x}, i.e., {Ax= b, x\geq 0},

\displaystyle C_0 \text{dist}(x,X^*)\geq c^Tx - c^Tx^* \geq c_0 \text{dist}(x,X^*).

The above theorem shows that the role of optimal gap, i.e., {c^Tx - c^Tx^*}, and the distance to the solution set, i.e. {\text{dist}(x,X^*)}, are the same up to a multiplicative constant. The right inequality  of the theorem is usually referred as linear growth  in the optimization literature.

The proof below is constructive and we can in fact take {c_0 = \frac{\epsilon}{\max\{2B,1\}}} where { B = \max_{ 1\leq i \leq l} \|x_i\|} and {\epsilon = \min_{ m+1\leq i\leq l, n+1\leq j \leq s} \{ c^Tx_i-c^Tx^*, c^T\gamma_j\}} for {x^*\in X^*} and {C_0 = \|c\|_*}. Here {\| y\|_* = \sup \{ u^Ty\mid \|u\|\leq 1\}} is the dual norm and x_i,\gamma_j are extreme points and extreme rays. We assume there are l many extreme points and s many extreme rays with first m\; x_is and first n \; \gamma_j are in the optimal set X^*. See the proof for more detail.

The idea of the proof mainly relies on the extreme point and extreme rays representation of the feasible region and the optimal set,i.e., {\{x : Ax = b, x\geq 0\}} and X^*.

Proof: The feasible region can be written as

\displaystyle \{x|Ax=b,x\geq 0\} = \{ \sum_{i=1}^l \alpha_i x_i + \sum_{j=1}^s \beta_j \gamma_j | \sum_i \alpha_i = 1, \alpha_i\geq 0, \beta_j\geq 0\}.

Here {x_i}s are extreme points of the feasible region and {\gamma_j}s are extreme rays. By scaling the extreme rays, we can assume that {\|\gamma_j\| =1} for all {j}.

The optimal set can also be written as

\displaystyle X^* = \{ \sum_{i=1}^m \alpha_i x_i + \sum_{j=1}^n \beta_j \gamma_j | \sum_i \alpha_i = 1, \alpha_i\geq 0, \beta_j\geq 0\}.

We assume here the first {m} many {x_i} and {n} many {\gamma_j} are in the optimal set and the rest of {x_i} and {\gamma_j}s are not for notation simplicity.

We denote {B = \max_{ 1\leq i \leq l} \|x_i\|} and {\epsilon = \min_{ m+1\leq i\leq l, n+1\leq j \leq s} \{ c^Tx_i-c^Tx^*, c^T\gamma_j\}} where {x^*\in X^*}. Note {\epsilon>0} since the {\gamma_j}s not in the optimal set should have inner product with {c} to be positive.

We first prove the second inquality, i.e., {c^Tx - c^Tx^* \geq c_0 \text{dist}(x,X^*)}.

Now take an arbitrary feasible {x}, it can be written as

\displaystyle x = \sum_{i=1}^m a_i x_i + \sum_{i=m+1}^l a_i x_i + \sum _{j=1}^n b_j \gamma_j + \sum_{j=n+1}^s b_j \gamma_j

for some {a_i \geq 0, \sum_i a_i =1} and {b_j\geq 0}.

The objective value of {x} is then

\displaystyle c^Tx = \sum_{i=1}^m a_i c^T x_i + \sum_{i=m+1}^l a_i c^T x_i +\sum_{j=n+1}^s b_j c^T \gamma_j.

We use the fact that {c^T\gamma_j=0} for all {j\leq m} here.

Subtract the above by {c^Tx^*}. We have

\displaystyle \begin{aligned} c^Tx -c^Tx^*& = (\sum_{i=1}^m a_i-1)c^T x_i + \sum_{i=m+1}^l a_i c^T x_i +\sum_{j=n+1}^s b_j c^T \gamma_j \\ & = (-\sum_{i=m+1}^l a_i)c^T x_i + \sum_{i=m+1}^l a_i c^T x_i +\sum_{j=n+1}^s b_j c^T \gamma_j \\ &\geq (\sum_{i=m+1}^l a_i) \epsilon + (\sum_{j=n+1}^s b_j )\epsilon \end{aligned} \ \ \ \ \ (3)

The second equality is due to {\sum_i a_i =1} and the inequality is because of the definition of {\epsilon} and the {b_j,a_i}s are positive.

The distance between {x} and {X^*} is the infimum of

\displaystyle \| \sum_{i=1}^m( a_i-\alpha_i)x_i + \sum_{i=m+1}^l a_i x_i + \sum _{j=1}^n (b_j-\beta_j) \gamma_j + \sum_{j=n+1}^s b_j \gamma_j \|.

By taking {\alpha_i \geq a_i} with {\sum_i^m \alpha_i =1} and {\beta_j = b_j} and apply triangular inequality to the above quantity, we have

\displaystyle \begin{aligned} &\| \sum_{i=1}^m( a_i-\alpha_i)x_i + \sum_{i=m+1}^l a_i x_i + \sum _{j=1}^n (b_j-\beta_j) \gamma_j + \sum_{j=n+1}^s b_j \gamma_j \| \\ &\leq \sum_{i=1}^m( \alpha_i-a_i)\|x_i \|+ \sum_{i=m+1}^l a_i\|x_i \|+\sum_{j=n+1}^s b_j \|\gamma_j \|\\ &\leq \sum_{i=1}^m (\alpha_i -a_i) B + \sum_{i=m+1}^l a_i B + \sum_{j=n+1}^s b_j\\ &= (1-\sum_{i=1}^m a_i)B + (\sum_{i=m+1}^l a_i)B + \sum_{j=n+1}^s b_j\\ &= 2B (\sum_{i=m+1}^l a_i) +\sum_{j=n+1}^s b_j. \end{aligned} \ \ \ \ \ (4)

The first inequality is the triangular inequality and {\alpha_i\geq a_i, a_i\geq 0, b_j\geq 0}. The second inequality is applying the definition of {B} and {\|\gamma_j\|=1}. The first equality is due to {\sum_i \alpha_i =1} and the second equality is due to {\sum_i a_i =1}.

Thus the distance between {x} and {X^*} is bounded above by

\displaystyle \text{dist}(x,X^*) \leq 2B(\sum_{i=m+1}^l a_i) +\sum_{j=n+1}^s b_j.

Since {c^Tx -c^Tx^* \geq( \sum_{i = m+1}^l a_i + \sum_{j=n+1}^s b_j)\epsilon} by our previous argument, we see that setting

\displaystyle c_0 = \frac{\epsilon}{\max\{2B,1\}}

should give us

\displaystyle c^Tx - c^Tx^* \geq c_0 \text{dist}(x,X^*).

We now prove the inequality

\displaystyle C_0 \text{dist}(x,X^*)\geq c^Tx - c^Tx^*.

Note that the infimum in {\text{dist}(x,X^*) = \inf_{x^* \in X^*} \|x-x^*\|} is actually achieved by some {x^*}. The reason is that we can first pick a {x'\in X^*}, then

\displaystyle \inf_{x^* \in X^*} \|x-x^*\| = \inf_{x^* \in X^*, \|x-x^*\| \leq \|x-x'\|} \|x-x^*\|.

But the set {X^* \cap \{x^* | \|x-x^*\| \leq \|x-x'\| \}} is actually bounded and closed ({X^*} is closed as it is a convex combination of finite points plus a conic combination of extreme vectors), thus Weierstrass theorems tells us that the infimum is actually achieved by some {x^*\in X}.

Now take {x^*} such that {\|x-x^*\| =\text{dist}(x,X)}. We have

\displaystyle c^Tx -c^Tx^* \leq \|c\|_* \|x-x^*\|=\|c\|_*\text{dist}(x,X^*)

where {\|\cdot\|_*} is the dual norm of {\|\cdot\|}. Thus letting {C_0 = \|c\|_*} finishes the proof. \Box

From the proof, we see that two possible choice of {c_0} and {C_0} are {c_0 = \frac{\epsilon}{\max\{2B,1\}}} where { B = \max_{ 1\leq i \leq l} \|x_i\|} and {\epsilon = \min_{ m+1\leq i\leq l, n+1\leq j \leq s} \{ c^Tx_i-c^Tx^*, c^T\gamma_j\}} for {x^*\in X^*} and {C_0 = \|c\|_*}. These are not optimal and can be sharpened. I probably will give a sharper constant in a future post.