An interlacing property of order statistics

In this post, we present an interlacing property of order statistics.

Suppose we have a distribution {F} on real line {\mathbb{R}} and {\theta_i \overset{\text{iid}}{\sim} F} where {i=1,\dots, n} and {\overset{\text{iid}}{\sim}} means “independently and identically distributed as”. We denote {\theta_{(i)}^n} to be the {n-i+1}-th largest random variable of the {n} many {\theta_i}s. So {\theta_{(n)}^n} is the largest value and {\theta_{(1)}^n} is the smallest of the {n} many iid {\theta}s. Note here the subscript {n} in {\theta^n_{(i)}} does not mean taking to the {n}th power but means how many {\theta_i}s we are considering.

The ordered variables {\{\theta^{n}_{(i)}\}_{i=1}^n} are called order statistics in Statistics. They are sufficient statistics for distribution {F} when {\theta_i}s are independent and all follow {F}. We now present an interlacing property of order statistics.

Theorem 1 For any {n,m>0}, { l\leq n, m-n+1\leq i\leq m} and {x\in \mathbb{R}}, we have

\displaystyle \begin{aligned} \mathbb{P}( \theta_{(n+i)}^{n+m}\leq x)\leq \mathbb{P} (\theta^n_{(n-m+i)}\leq x)\text{ and } \mathbb{P}( \theta_{(l)}^n\leq x) \leq \mathbb{ P}(\theta_{(l)}^{n+m}\leq x) \end{aligned} \ \ \ \ \ (1)

If {\theta_i} has finite mean, then we have

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n+i)}^{n+m})\geq \mathbb{E} (\theta^n_{(n-m+i)} )\text{ and } \mathbb{E}( \theta_{(l)}^n) \geq \mathbb{ E}(\theta_{(l)}^{n+m}) \end{aligned} \ \ \ \ \ (2)

The meaning of the right inequalities of the inequality (1) and the inequality (2) is that the larger the sample size {n}, the smaller the {n-l+1}-th largest value. The left inequalities of the inequality (1) and the inequality (2) show that the larger sample size {n}, the larger the {m+i}-th order statistic is in expectation.

The first inequality chain is the stochastic dominance version of the second inequality chain. The first chain is a stronger property than the second.

Let us now explain why we call Theorem 1 an interlacing property. Consider the case that {n\geq 2} and take {m=1}, {l =n}, and {i=0} in Theorem 1. The inequality (2) gives that

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n)}^n) \geq \mathbb{ E}(\theta_{(n)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-1)}). \end{aligned} \ \ \ \ \ (3)

If {n\geq 3}, then we can take {l=n-1}, {m=1}, and {i=-1}. Then equation (2) gives

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n-1)}^{n}) \geq \mathbb{ E}(\theta_{(n-1)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-2)}). \end{aligned} \ \ \ \ \ (4)

Thus from (3) and (4) we have

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n)}^n) \geq \mathbb{ E}(\theta_{(n)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-1)})\geq \mathbb{ E}(\theta_{(n-1)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-2)}). \end{aligned} \ \ \ \ \ (5)

Continuing the choices of {l} and {i} while fixing {m=1}, we see that

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n+1)}^{n+1})\geq \mathbb{E}( \theta_{(n)}^n) \geq \mathbb{ E}(\theta_{(n)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-1)})\geq\dots \geq\mathbb{E} (\theta^{n+1}_{(2)})\geq \mathbb{E} (\theta^{n}_{(1)})\geq \mathbb{E} (\theta^{n+1}_{(1)}) . \end{aligned} \ \ \ \ \ (6)

Thus we see the quantities {\{\theta_{(i)}^n\}_{i=1}^n} and {\{\theta_{(i)}^{n+1}\}_{i=1}^{n+1}} are interlacing. The main idea of the proof is putting {\theta_{(l)}^n} and {\theta_{(l)}^{n+m}} and {\theta_{(l-m)}^n} in the same probability space and arguing that we actually have almost sure inequality instead of just expectation or stochastically dominance. This technique is known as coupling.

Proof: Let us first prove the right inequality. Suppose we first generate {n} many iid {\theta_i}s following {F} and order them so that we have {\theta_{(i)}^n}s. Now if we generate (independently) {m} many more {\theta_{i+n}} follows {F} as well with {i=1,\dots,m}. We now consider all the {\theta} we generated, then we find {\theta_{(l)}^{n+m}} is at most {\theta_{(l)}^n} since we have

\displaystyle \theta_{(l)}^{n+m} = \theta_{(l)}^n \text{ if } \theta_{i+n} \geq \theta_{(l)}^n,\forall 1\leq i\leq m

and

\displaystyle \theta_{(l)}^{n+m} < \theta_{(l)}^n, \text{ if } \exists i \text{ s.t. } \theta_{i+m}< \theta_{(l)}^n.

Since we have almost sure inequality, we surely have stochastic dominance and inequality in expectation. Now consider the left inequality. We still follow the previous ways of generating {\theta_i}s. After we have {\theta_{(i)}^n}s, the {n-(l-m)+1}-th largest value {\theta_{(l-m)}^{n}} is always smaller than the {n-(l-m)+1}-th largest value {\theta_{(l)}^{n+m}} since

\displaystyle \theta_{(l)}^{n+m} = \theta_{(l-m)}^{n} \text{ if } \theta_{i+n} \leq \theta_{(l-m)}^n,\forall 1\leq i\leq m

and

\displaystyle \theta_{(l)}^{n+m} > \theta_{(l-m)}^{n} \text{ if } \exists i \text{ s.t. } \theta_{i+n} >\theta_{(l-m)}^n.

Thus we have shown that we have stochastically dominance for the random variables we are considering. The proof here is a bit lengthy and maybe not so clear but correct. One may draw a picture of the data-generating process and then compare the {\theta_i}s to see the correctness. \Box

Interestingly, the above proof actually did not use the assumption that {\theta}s are iid. This means we can have the same conclusion for any joint distribution. We state this result explicitly below.

Theorem 2 Let {\theta_1,\dots, \theta_{n+m}} have joint distribution {F_{m+n}}. Denote {\theta_{(1)}\leq \theta_{(2)}\leq \dots \leq \theta_{(n)} \leq \dots \leq \theta_{(n+m)}} to be the order statistics. Now consider only the first {n} many {\theta_i}s, i.e., {\theta_1,\dots,\theta_n}. Denote their marginal distribution as {F_{n}}. We denote the order statistics of these {n} many {\theta}s to be {\tilde{\theta}_{(1)} \leq \dots \leq \tilde{\theta}_{(n)}}. Then for any {n,m>0}, {m< l\leq n,1\leq i\leq m} and {x\in \mathbb{R}}, we have

\displaystyle \begin{aligned} \mathbb{P}( \theta_{(n+i)}\leq x)\leq \mathbb{P}( \tilde{\theta}_{(l)}\leq x) \leq \mathbb{ P}(\theta_{(l)}\leq x)\leq \mathbb{P} (\tilde{\theta}_{(l-m)}\leq x). \end{aligned} \ \ \ \ \ (8)

If each {\theta_i} has finite mean, then we have

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n+i)})\geq \mathbb{E}( \tilde{\theta}_{(l)}) \geq\mathbb{ E}(\theta_{(l)})\geq\mathbb{E} (\tilde{\theta}_{(l-m)}). \end{aligned} \ \ \ \ \ (9)

Proof: The proof is a repetition of the previous argument. Let us first prove the middle inequality of (8) and (9). Suppose we first generate {n} many {\theta_i}s following {F_n} and order them so that we have {\theta_{(i)}}s. Now if we generate the extra {m} many more {\theta_{i+n}} follows the conditional distribution given by the conditional defined by {F_{n+m}} and the first {n} many thetas. Specifically, the conditional distribution is

\displaystyle P(\theta_{i+m}\leq x_1,\dots, \theta_{n+m} \leq x_m | \theta_1,\dots,\theta_n ).

We now consider all the {\theta} we generated. Note that the unconditional distribution of all the {\theta}s are just {F_{n+m}}. We find {\tilde{\theta}_{(l)}} is at most {\theta_{(l)}} since we have

\displaystyle \tilde{\theta}_{(l)}= \theta_{(l)} \text{ if } \theta_{i+m} \geq \theta_{(l)},\forall 1\leq i\leq m

and

\displaystyle \tilde{\theta}_{(l)} < \theta_{(l)}, \text{ if } \exists i \text{ s.t. } \tilde{\theta}_{i+m}< \theta_{(l)}.

Since we have almost sure inequality, we surely have stochastic dominance and inequality in expectation. Now consider the right-most inequality of (8) and (9). . We still follow the previous ways of generating {\theta_i}s. After we have {\tilde{\theta}_{(i)}}s, the {n-(l-m)+1} th largest value {\theta_{(l-m)}} is always smaller than the {n-(l-m)+1}th largest value {\theta_{(l-m)}} since

\displaystyle \tilde{\theta}_{(l)} = \theta_{(l-m)}\text{ if } \tilde{\theta}_{i+m} \leq \theta_{(l-m)},\forall 1\leq i\leq m

and

\displaystyle \tilde{\theta}_{(l)} > \theta_{(l-m)} \text{ if } \exists i \text{ s.t. } \tilde{\theta}_{i+m} >\theta_{(l-m)}.

Thus we have shown that we have stochastically dominance for the random variables we are considering. The left-most inequality of of (8) and (9) can be proved using the fact that {\theta_{(n+i)}\geq \tilde{\theta}_{(l)}} for any {l\leq n}. \Box

I found this interlacing result while taking a take-home exam on game theory. It turns out this result is not useful for the exam though. I would also like to thank Yueyang Liu for capturing a mistake in the previous version of the post.

Natural sufficient statistic is not necessarily minimal for exponential family

This post gives a simple example of an exponential family that has natural parameter space being a point and that its natural sufficient statistic is not minimal.

Let us define the concept of exponential family with natural parameters.

Definition 1 (Natural exponential family)
A family of probability densities (or probability mass function) {f(\mathbf{x}|\mathbf{\eta} )} with parameter (index) {\mathbf{\eta}\in {\mathbb R}^k} is said to be a natural exponential family if {f} can be written as

\displaystyle f(\mathbf{x}|\mathbf{\eta}) = h(\mathbf{x}) \exp \left\{ \sum_{i=1}^k \mathbf{\eta}_i T_i(\mathbf{x}) - \mathcal{A}(\mathbf {\eta})\right\}. \ \ \ \ \ (1)

Here for {i = 1,\ldots,k}, we call

  • {T_i}: natural sufficient statistic
  • {\eta_i}: natural parameter
  • {H := \{ \mathbf{\eta} = (\eta_1,\ldots,\eta_k): \int h(\mathbf{x}) \exp \left\{ \sum_{i=1}^k \eta_i T_i(\mathbf{x})\right\} \, d \mathbf{x} < \infty \} }: natural parameter space.

A lot of well-known distributions actually are exponential families, e.g., normal distribution, Binomial distribution, Poisson distribution, beta distribution, and gamma distribution. The exponential family is central to the modeling and analysis of data.

Next, we define the concept of sufficiency and minimal sufficiency.

Definition 2 (Sufficiency and minimal sufficiency)
For a family of probability density {f_\eta}, {\eta \in \Theta \subset {\mathbb R}^k}, and a random variable {X\sim f_\eta}, a statistic {T(X)}
is sufficient if the conditional distribution {P(X|T(X))} is independent of {\eta}. A sufficient statistic {T(X)} is minimal if for any other sufficient statistic {S(X)}, there is a function {g} such that {T(X)=g(S(X))}.

Intuitively, a sufficient statistic captures all the information of the underlying parameter {\eta}. Indeed, suppose someone hands you a sufficient statistic {T(X)}. Because {P(X|T(X))} is independent of {\eta}, you know the distribution {P(X|T(X))} already. Now if you can generate the data {X} according to {P(X|T(X))}, then the unconditional distribution of {X} is simply {f_\theta}! So even though you don’t know the underlying distribution {f_\theta}, you can generate {X} so long as {T(X)} is available.

The minimality of a sufficient statistic means the data is reduced in an optimal way. As all other sufficient statistics actually contain more information than needed. One thing to note is that this definition of minimality has nothing to say about the dimension of a minimal sufficient statistic. Indeed, if {T(X)} is minimal, then {(T(X),1,2)} is
also minimal.

It is easily verified that natural sufficient statistic is actually sufficient using the factorization theorem. A natural question occurs at this point, is the natural sufficient statistic always minimal? The following example reveals that we do need to put a few more condition on {\eta} or {T(x)}.

Example 1
Consider the density family

\displaystyle f(x,y\mid \eta) = \frac{1}{(1+x^2)(1+y^2)}\exp\left (\eta \left (x^2-y^2\right)-A(\eta)\right).

where {x,y\in {\mathbb R}}.
The natural parameter space is the place where the integeral

\displaystyle \int_{{\mathbb R}^2} \frac{ \exp\left(\eta \left(x^2-y^2\right)\right)}{(1+x^2)(1+y^2)} dxdy = \int_{\mathbb R}\frac{\exp (\eta x^2)}{1+x^2}dx \int_{\mathbb R} \frac{\exp (-\eta y^2)}{1+y^2}dy

is finite (we use Fubini’s theorem in the middle step). Hence the parameter {\eta} needs to satisfy that
{\eta \leq 0 } for the integral {\int_{\mathbb R}\frac{\exp (\eta x^2)}{1+x^2}dx} to be finite and {\eta\geq 0} for the integral {\int_{\mathbb R} \frac{\exp (-\eta y^2)}{1+y^2}dy} to be finite. This means actually {\eta=0} and so the natural parameter space is a single point {\{0\}\subset {\mathbb R}}.

The natural sufficient statistic {X^2-Y^2} is indeed sufficient. But any constant estimator is also sufficient and minimal as any other sufficient statistic under a constant function is a constant. But {X^2-Y^2} is not a constant estimator so we see the natural sufficient statistic is not always necessarily minimal.

It can be shown that so long as the natural parameter space contains an open set then the natural sufficient is indeed minimal. See Theorem 4.5 b) of this note