A note on the symmetrization technique

This note addresses the steps in the symmetrization technique in empirical process theory in probability. I will walk through each step and provide explanations. If you have some confusion and questions about this technique, like me, then this might be helpful to you! I encountered the argument perhaps 5 or 6 years ago, and I thought I knew what was going on. Until recently, I realized I missed something (especially the step {(d)} below), when I tried to use it in my own research.

Setup. Consider a finite function class of size {m}, {\mathcal{F}:=\{f_i, 1\leq i\leq m\mid f_i :\mathcal{X}\rightarrow \mathbb{R}\}} where {\mathcal{X}} is a set. We are given a series of independent random varaibles {X_i}, {i=1,\dots,n}, which take value in {\mathcal{X}}. Let {X = (X_i)_{i=1}^n}. A typical quantity appeared in empirical process theory is

\displaystyle \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right]. \ \ \ \ \ (1)

And we aim to upper bound it using the so-called Rademacher complexity (to be defined later).

The symmetrization argument. The symmetrization argument uses (1) an independent copy of {X}, denoted as {X'}, and (2) {n} independent Rademacher random variables, {\epsilon_i \in \{\pm 1\}, 1\leq i\leq n} with {\mathbb{P}(\epsilon_i =1) = \mathbb{P}(\epsilon_i = -1) = \frac{1}{2}}. Denote {\epsilon = (\epsilon_i)_{i=1}^n}. Let us display the symmetrization argument:

\displaystyle \begin{aligned} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n f(X_i)- \mathbb{E}f(X_i) \right) \right] \\ \overset{(a)}{=} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n f(X_i)- \mathbb{E}_{X_i'}f(X_i') \right) \right] \\ \overset{(b)}{ =} & \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n \mathbb{E}_{X_i'}(f(X_i)-f(X_i')) \right) \right]\\ \overset{(c)}{\leq} & \mathbb{E}_{X,X'}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right) \right]\\ \overset{(d)}{=} & \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n \epsilon_i(f(X_i)-f(X_i')) \right) \right]\\ \overset{(e)}{=} & \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]+ \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i') \right| \right]\\ \overset{(f)}{=} & 2 \mathbb{E}_{X,\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]. \end{aligned} \ \ \ \ \ (2)

The last quantity { \mathbb{E}_{X,X',\epsilon}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\left| \sum_{i=1}^n \epsilon_if(X_i) \right| \right]} is called the Rademacher complexity of {\mathcal{F}} w.r.t. {X}. In step {(a)} and {(f)}, we uses the fact that {X} and {X'} has the same distribution. In step {(b)}, we use the fact that {X} and {X'} are independent so that we could take the expectation w.r.t. {X'} first (or consider {\mathbb{E}_{X'}[\cdot]} as {\mathbb{E}[ \cdot \mid X]}). In step {(e)}, we use the triangle inequality. Next, we explain the steps {(c)} and {(d)}.

Step {(c)}. Here in the step {(c)}, we use the fact that for each {f}, we have

\displaystyle \left(\frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right)\right)\leq \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right).

Taking the expectation over {X'} gives

\displaystyle \mathbb{E}_{X'} \left(\frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right)\right) \leq \mathbb{E}_{X'} \sup_{f\in \mathcal{F}} \frac{1}{n}\left( \sum_{i=1}^n (f(X_i)-f(X_i')) \right).

After taking a superium over the left-hand side of the above over {f} first and then taking an expectation over {X}, we conclude the step {(c)}.

Step {(d)}. Step {(d)} looks puzzling (to me) at first. A naive reasoning of {(d)} is that {f(X_i) - f(X'_i)} has the same distribution as {\epsilon_i (f(X_i) -f(X'_i))}. I was not convinced by this argument since we are taking a superium over {\mathcal{F}}.

Denote “equal in distribution” as {\overset{\text{d}}{=}}. The things we are (or I am) missing is that there are {m} many {f}s in {\mathcal{F}}. Hence, knowing that for each {f},

\displaystyle (f(X_i) - f(X'_i))_{1\leq i\leq n} \overset{\text{d}}{=} (\epsilon_i (f(X_i) -f(X'_i)))_{1\leq i\leq n}

is not enough. Let us first assume {\epsilon \in \{\pm 1\}^n} is fixed. What we really need is to ensure that for any fixed {\epsilon \in \{\pm 1\}^n}, the joint distributions of the following two are the same,

\displaystyle [f_j(X_i)-f_j(X_i')]_{1\leq i\leq n,1\leq j\leq m} \overset{\text{d}}{=} \left[\epsilon_i \left(f_j(X_i)-f_j(X_i')\right)\right]_{1\leq i\leq n,1\leq j\leq m}. \ \ \ \ \ (3)

To argue (3), first, for each {i}, we have

\displaystyle [f_j(X_i) - f_j(X_i')]_{j=1}^m \overset{\text{d}}{=} [\epsilon_i (f_j(X_i) - f_j(X_i'))]_{j=1}^m,

by noting that [\epsilon_i (f_j(X_i) - f_j(X_i'))]_{j=1}^m = \epsilon_i [(f_j(X_i) - f_j(X_i'))]_{j=1}^m. Hence, because of independence of {(X_1,X_1'),\dots, (X_n,X_n')}, we know that

\displaystyle [[f_j(X_i)-f_j(X_i')]_{1\leq j\leq m}]_{1\leq i\leq n} \overset{\text{d}}{=} [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n}.

This establishes the equality (3) for each fixed {\epsilon}. If {\epsilon \sim \{\pm 1\}^n} uniformly, then we see (3) continues to hold by using the law of total probability: for any measurable set {B}

\displaystyle \begin{aligned} &\mathbb{P}( [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B)\\ \overset{(a)}{=}& \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\epsilon_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B | \epsilon = \bar{\epsilon})\\ \overset{(b)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\bar{\epsilon}_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B | \epsilon = \bar{\epsilon})\\ \overset{(c)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[\bar{\epsilon}_i(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B )\\ \overset{(d)}{=} & \sum_{\bar{\epsilon} \in \{\pm 1\}^n}\frac{1}{2^n} \mathbb{P}( [[(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B )\\ =& \mathbb{P}( [[(f_j(X_i)-f_j(X_i'))]_{1\leq j\leq m}]_{1\leq i\leq n} \in B ). \end{aligned} \ \ \ \ \ (4)

Here in {(a)}, we use the law of total probability by splitting the event according to the value {\epsilon} take. In {(b)}, we replace the value of {\epsilon} by {\bar{\epsilon}} since this is the condition in the conditional probability. In {(c)}, we use the fact that {\epsilon} is independent of {X}. In {(d)}, we use (3) for fixed {\epsilon}. Thus, the equality (3) holds for the case {\epsilon} being random. This means for any measurable function {h: \mathbb{R}^{m\times n} \rightarrow \mathbb{R}},

\displaystyle \mathbb{E}\left[h\left([f_j(X_i)-f_j(X_i')]_{1\leq j\leq m, \;1\leq i\leq n} \right)\right] = \mathbb{E}\left[h\left(\left[\epsilon_i(f_j(X_i)-f_j(X_i'))\right]_{1\leq j\leq m, \;1\leq i\leq n} \right)\right].

Step {(d)} in the symmetrization argument now follows by taking an appropriate {h}.

Remark. To deal with the case that {\mathcal{F}} is infinite, we may set

\displaystyle \mathcal{A} =\{\mathcal{G}\mid \mathcal{G}\subset \mathcal{F},\; \mathcal{G} \text{ has finite elements}\},

and define

\displaystyle \begin{aligned} \mathbb{E}\left [ \sup_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right] := \sup_{\mathcal{G}\in \mathcal{A}} \mathbb{E}\left [ \sup_{f\in \mathcal{G}} \frac{1}{n}\sum_{i=1}^n \left( f(X_i) - \mathbb{E}(f(X_i) \right) \right]. \end{aligned} \ \ \ \ \ (5)

Then our previous argument continues to work.

An interlacing property of order statistics

In this post, we present an interlacing property of order statistics.

Suppose we have a distribution {F} on real line {\mathbb{R}} and {\theta_i \overset{\text{iid}}{\sim} F} where {i=1,\dots, n} and {\overset{\text{iid}}{\sim}} means “independently and identically distributed as”. We denote {\theta_{(i)}^n} to be the {n-i+1}-th largest random variable of the {n} many {\theta_i}s. So {\theta_{(n)}^n} is the largest value and {\theta_{(1)}^n} is the smallest of the {n} many iid {\theta}s. Note here the subscript {n} in {\theta^n_{(i)}} does not mean taking to the {n}th power but means how many {\theta_i}s we are considering.

The ordered variables {\{\theta^{n}_{(i)}\}_{i=1}^n} are called order statistics in Statistics. They are sufficient statistics for distribution {F} when {\theta_i}s are independent and all follow {F}. We now present an interlacing property of order statistics.

Theorem 1 For any {n,m>0}, { l\leq n, m-n+1\leq i\leq m} and {x\in \mathbb{R}}, we have

\displaystyle \begin{aligned} \mathbb{P}( \theta_{(n+i)}^{n+m}\leq x)\leq \mathbb{P} (\theta^n_{(n-m+i)}\leq x)\text{ and } \mathbb{P}( \theta_{(l)}^n\leq x) \leq \mathbb{ P}(\theta_{(l)}^{n+m}\leq x) \end{aligned} \ \ \ \ \ (1)

If {\theta_i} has finite mean, then we have

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n+i)}^{n+m})\geq \mathbb{E} (\theta^n_{(n-m+i)} )\text{ and } \mathbb{E}( \theta_{(l)}^n) \geq \mathbb{ E}(\theta_{(l)}^{n+m}) \end{aligned} \ \ \ \ \ (2)

The meaning of the right inequalities of the inequality (1) and the inequality (2) is that the larger the sample size {n}, the smaller the {n-l+1}-th largest value. The left inequalities of the inequality (1) and the inequality (2) show that the larger sample size {n}, the larger the {m+i}-th order statistic is in expectation.

The first inequality chain is the stochastic dominance version of the second inequality chain. The first chain is a stronger property than the second.

Let us now explain why we call Theorem 1 an interlacing property. Consider the case that {n\geq 2} and take {m=1}, {l =n}, and {i=0} in Theorem 1. The inequality (2) gives that

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n)}^n) \geq \mathbb{ E}(\theta_{(n)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-1)}). \end{aligned} \ \ \ \ \ (3)

If {n\geq 3}, then we can take {l=n-1}, {m=1}, and {i=-1}. Then equation (2) gives

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n-1)}^{n}) \geq \mathbb{ E}(\theta_{(n-1)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-2)}). \end{aligned} \ \ \ \ \ (4)

Thus from (3) and (4) we have

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n)}^n) \geq \mathbb{ E}(\theta_{(n)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-1)})\geq \mathbb{ E}(\theta_{(n-1)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-2)}). \end{aligned} \ \ \ \ \ (5)

Continuing the choices of {l} and {i} while fixing {m=1}, we see that

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n+1)}^{n+1})\geq \mathbb{E}( \theta_{(n)}^n) \geq \mathbb{ E}(\theta_{(n)}^{n+1})\geq \mathbb{E} (\theta^n_{(n-1)})\geq\dots \geq\mathbb{E} (\theta^{n+1}_{(2)})\geq \mathbb{E} (\theta^{n}_{(1)})\geq \mathbb{E} (\theta^{n+1}_{(1)}) . \end{aligned} \ \ \ \ \ (6)

Thus we see the quantities {\{\theta_{(i)}^n\}_{i=1}^n} and {\{\theta_{(i)}^{n+1}\}_{i=1}^{n+1}} are interlacing. The main idea of the proof is putting {\theta_{(l)}^n} and {\theta_{(l)}^{n+m}} and {\theta_{(l-m)}^n} in the same probability space and arguing that we actually have almost sure inequality instead of just expectation or stochastically dominance. This technique is known as coupling.

Proof: Let us first prove the right inequality. Suppose we first generate {n} many iid {\theta_i}s following {F} and order them so that we have {\theta_{(i)}^n}s. Now if we generate (independently) {m} many more {\theta_{i+n}} follows {F} as well with {i=1,\dots,m}. We now consider all the {\theta} we generated, then we find {\theta_{(l)}^{n+m}} is at most {\theta_{(l)}^n} since we have

\displaystyle \theta_{(l)}^{n+m} = \theta_{(l)}^n \text{ if } \theta_{i+n} \geq \theta_{(l)}^n,\forall 1\leq i\leq m

and

\displaystyle \theta_{(l)}^{n+m} < \theta_{(l)}^n, \text{ if } \exists i \text{ s.t. } \theta_{i+m}< \theta_{(l)}^n.

Since we have almost sure inequality, we surely have stochastic dominance and inequality in expectation. Now consider the left inequality. We still follow the previous ways of generating {\theta_i}s. After we have {\theta_{(i)}^n}s, the {n-(l-m)+1}-th largest value {\theta_{(l-m)}^{n}} is always smaller than the {n-(l-m)+1}-th largest value {\theta_{(l)}^{n+m}} since

\displaystyle \theta_{(l)}^{n+m} = \theta_{(l-m)}^{n} \text{ if } \theta_{i+n} \leq \theta_{(l-m)}^n,\forall 1\leq i\leq m

and

\displaystyle \theta_{(l)}^{n+m} > \theta_{(l-m)}^{n} \text{ if } \exists i \text{ s.t. } \theta_{i+n} >\theta_{(l-m)}^n.

Thus we have shown that we have stochastically dominance for the random variables we are considering. The proof here is a bit lengthy and maybe not so clear but correct. One may draw a picture of the data-generating process and then compare the {\theta_i}s to see the correctness. \Box

Interestingly, the above proof actually did not use the assumption that {\theta}s are iid. This means we can have the same conclusion for any joint distribution. We state this result explicitly below.

Theorem 2 Let {\theta_1,\dots, \theta_{n+m}} have joint distribution {F_{m+n}}. Denote {\theta_{(1)}\leq \theta_{(2)}\leq \dots \leq \theta_{(n)} \leq \dots \leq \theta_{(n+m)}} to be the order statistics. Now consider only the first {n} many {\theta_i}s, i.e., {\theta_1,\dots,\theta_n}. Denote their marginal distribution as {F_{n}}. We denote the order statistics of these {n} many {\theta}s to be {\tilde{\theta}_{(1)} \leq \dots \leq \tilde{\theta}_{(n)}}. Then for any {n,m>0}, {m< l\leq n,1\leq i\leq m} and {x\in \mathbb{R}}, we have

\displaystyle \begin{aligned} \mathbb{P}( \theta_{(n+i)}\leq x)\leq \mathbb{P}( \tilde{\theta}_{(l)}\leq x) \leq \mathbb{ P}(\theta_{(l)}\leq x)\leq \mathbb{P} (\tilde{\theta}_{(l-m)}\leq x). \end{aligned} \ \ \ \ \ (8)

If each {\theta_i} has finite mean, then we have

\displaystyle \begin{aligned} \mathbb{E}( \theta_{(n+i)})\geq \mathbb{E}( \tilde{\theta}_{(l)}) \geq\mathbb{ E}(\theta_{(l)})\geq\mathbb{E} (\tilde{\theta}_{(l-m)}). \end{aligned} \ \ \ \ \ (9)

Proof: The proof is a repetition of the previous argument. Let us first prove the middle inequality of (8) and (9). Suppose we first generate {n} many {\theta_i}s following {F_n} and order them so that we have {\theta_{(i)}}s. Now if we generate the extra {m} many more {\theta_{i+n}} follows the conditional distribution given by the conditional defined by {F_{n+m}} and the first {n} many thetas. Specifically, the conditional distribution is

\displaystyle P(\theta_{i+m}\leq x_1,\dots, \theta_{n+m} \leq x_m | \theta_1,\dots,\theta_n ).

We now consider all the {\theta} we generated. Note that the unconditional distribution of all the {\theta}s are just {F_{n+m}}. We find {\tilde{\theta}_{(l)}} is at most {\theta_{(l)}} since we have

\displaystyle \tilde{\theta}_{(l)}= \theta_{(l)} \text{ if } \theta_{i+m} \geq \theta_{(l)},\forall 1\leq i\leq m

and

\displaystyle \tilde{\theta}_{(l)} < \theta_{(l)}, \text{ if } \exists i \text{ s.t. } \tilde{\theta}_{i+m}< \theta_{(l)}.

Since we have almost sure inequality, we surely have stochastic dominance and inequality in expectation. Now consider the right-most inequality of (8) and (9). . We still follow the previous ways of generating {\theta_i}s. After we have {\tilde{\theta}_{(i)}}s, the {n-(l-m)+1} th largest value {\theta_{(l-m)}} is always smaller than the {n-(l-m)+1}th largest value {\theta_{(l-m)}} since

\displaystyle \tilde{\theta}_{(l)} = \theta_{(l-m)}\text{ if } \tilde{\theta}_{i+m} \leq \theta_{(l-m)},\forall 1\leq i\leq m

and

\displaystyle \tilde{\theta}_{(l)} > \theta_{(l-m)} \text{ if } \exists i \text{ s.t. } \tilde{\theta}_{i+m} >\theta_{(l-m)}.

Thus we have shown that we have stochastically dominance for the random variables we are considering. The left-most inequality of of (8) and (9) can be proved using the fact that {\theta_{(n+i)}\geq \tilde{\theta}_{(l)}} for any {l\leq n}. \Box

I found this interlacing result while taking a take-home exam on game theory. It turns out this result is not useful for the exam though. I would also like to thank Yueyang Liu for capturing a mistake in the previous version of the post.

Natural sufficient statistic is not necessarily minimal for exponential family

This post gives a simple example of an exponential family that has natural parameter space being a point and that its natural sufficient statistic is not minimal.

Let us define the concept of exponential family with natural parameters.

Definition 1 (Natural exponential family)
A family of probability densities (or probability mass function) {f(\mathbf{x}|\mathbf{\eta} )} with parameter (index) {\mathbf{\eta}\in {\mathbb R}^k} is said to be a natural exponential family if {f} can be written as

\displaystyle f(\mathbf{x}|\mathbf{\eta}) = h(\mathbf{x}) \exp \left\{ \sum_{i=1}^k \mathbf{\eta}_i T_i(\mathbf{x}) - \mathcal{A}(\mathbf {\eta})\right\}. \ \ \ \ \ (1)

Here for {i = 1,\ldots,k}, we call

  • {T_i}: natural sufficient statistic
  • {\eta_i}: natural parameter
  • {H := \{ \mathbf{\eta} = (\eta_1,\ldots,\eta_k): \int h(\mathbf{x}) \exp \left\{ \sum_{i=1}^k \eta_i T_i(\mathbf{x})\right\} \, d \mathbf{x} < \infty \} }: natural parameter space.

A lot of well-known distributions actually are exponential families, e.g., normal distribution, Binomial distribution, Poisson distribution, beta distribution, and gamma distribution. The exponential family is central to the modeling and analysis of data.

Next, we define the concept of sufficiency and minimal sufficiency.

Definition 2 (Sufficiency and minimal sufficiency)
For a family of probability density {f_\eta}, {\eta \in \Theta \subset {\mathbb R}^k}, and a random variable {X\sim f_\eta}, a statistic {T(X)}
is sufficient if the conditional distribution {P(X|T(X))} is independent of {\eta}. A sufficient statistic {T(X)} is minimal if for any other sufficient statistic {S(X)}, there is a function {g} such that {T(X)=g(S(X))}.

Intuitively, a sufficient statistic captures all the information of the underlying parameter {\eta}. Indeed, suppose someone hands you a sufficient statistic {T(X)}. Because {P(X|T(X))} is independent of {\eta}, you know the distribution {P(X|T(X))} already. Now if you can generate the data {X} according to {P(X|T(X))}, then the unconditional distribution of {X} is simply {f_\theta}! So even though you don’t know the underlying distribution {f_\theta}, you can generate {X} so long as {T(X)} is available.

The minimality of a sufficient statistic means the data is reduced in an optimal way. As all other sufficient statistics actually contain more information than needed. One thing to note is that this definition of minimality has nothing to say about the dimension of a minimal sufficient statistic. Indeed, if {T(X)} is minimal, then {(T(X),1,2)} is
also minimal.

It is easily verified that natural sufficient statistic is actually sufficient using the factorization theorem. A natural question occurs at this point, is the natural sufficient statistic always minimal? The following example reveals that we do need to put a few more condition on {\eta} or {T(x)}.

Example 1
Consider the density family

\displaystyle f(x,y\mid \eta) = \frac{1}{(1+x^2)(1+y^2)}\exp\left (\eta \left (x^2-y^2\right)-A(\eta)\right).

where {x,y\in {\mathbb R}}.
The natural parameter space is the place where the integeral

\displaystyle \int_{{\mathbb R}^2} \frac{ \exp\left(\eta \left(x^2-y^2\right)\right)}{(1+x^2)(1+y^2)} dxdy = \int_{\mathbb R}\frac{\exp (\eta x^2)}{1+x^2}dx \int_{\mathbb R} \frac{\exp (-\eta y^2)}{1+y^2}dy

is finite (we use Fubini’s theorem in the middle step). Hence the parameter {\eta} needs to satisfy that
{\eta \leq 0 } for the integral {\int_{\mathbb R}\frac{\exp (\eta x^2)}{1+x^2}dx} to be finite and {\eta\geq 0} for the integral {\int_{\mathbb R} \frac{\exp (-\eta y^2)}{1+y^2}dy} to be finite. This means actually {\eta=0} and so the natural parameter space is a single point {\{0\}\subset {\mathbb R}}.

The natural sufficient statistic {X^2-Y^2} is indeed sufficient. But any constant estimator is also sufficient and minimal as any other sufficient statistic under a constant function is a constant. But {X^2-Y^2} is not a constant estimator so we see the natural sufficient statistic is not always necessarily minimal.

It can be shown that so long as the natural parameter space contains an open set then the natural sufficient is indeed minimal. See Theorem 4.5 b) of this note. 

Testing homogeneity of data for location-scale family

Suppose you have a sequence of independent data {X_1,\dots, X_n\in \mathbb{R}^d}, how do you test that {X_i}s all come from the same distribution, i.e., how do you test homogeneity of the data?

To make the problem more precise, suppose we have a distribution family indexed by {\theta \in \mathbb{R}^s}, i.e., a set

\displaystyle \Omega = \{ F_\theta\mid F_\theta \;\text{is a probability cumulative function,} \;\theta \in \mathbb{R}^s\},

and each {X_i} follows the distribution {F_{\theta_i}} for some {\theta_i}. Our problem is

Is {F_{\theta_1} =F_{\theta_2}=\dots =F_{\theta_n}}?

If we have that {\theta \not =\bar{\theta} \implies F_{\theta}\not= F_{\bar{\theta}}} (known as the identifiability of {\theta}), then our question becomes

Is {\theta_1 =\theta_2=\dots =\theta_n}?

Now suppose further that each {F_\theta\in \Omega} has a density {f_\theta} (so that we can write down the likelihood), the likelihood of seeing the independent sequence {X_1,\dots, X_n} is

\displaystyle \mathcal{L}_{(X_1,\dots,X_n) }{(\theta_1,\dots,\theta_n)}= \prod_{i=1}^nf_{\theta_i}(X_i).

To test our question in a statistical way, we use hypothesis testing. Our null hypothesis is

\displaystyle \begin{aligned} H_0:\quad \theta_1=\theta_2=\cdots = \theta_n, \end{aligned} \ \ \ \ \ (1)

and our alternative hypothesis is

\displaystyle \begin{aligned} H_1:\quad \theta_i\not= \theta_j \quad \text{for some}\; i\not=j. \end{aligned} \ \ \ \ \ (2)

Further denote the space of the null as {\Theta_0=\{ (\theta_1,\dots,\theta_n)\mid \theta_1=\dots =\theta_n\}} and the space of the alternative as {\Theta_1=\mathbb{R}^{sn} / \Theta_0}. A popular and natural approach is the likelihood ratio test. We construct the test statistic which is called likelihood ratio as

\displaystyle \begin{aligned} R = \frac{\sup_{(\theta_1,\dots, \theta_n)\in \Theta_0}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}{\sup_{(\theta_1,\dots, \theta_n)\in (\Theta_0\cup \Theta_1)}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}. \end{aligned} \ \ \ \ \ (3)

Intuitively, if our null hypothesis is indeed true, i.e., there is some {\theta^*} such that {\theta_1=\dots =\theta_n=\theta^*} and {X_i} follows { f_{\theta^*}}, then this ratio should be large and we have confidence that our null hypothesis is true. This means we should reject our null hypothesis if we find {R} is small. Thus if we want to have a significance level {\alpha} test of our null hypothesis, we should reject null hypothesis when {R\leq c} where {c} satisfies

\displaystyle \begin{aligned} \mathbb{P}(R<c\mid H_0)\leq \alpha. \end{aligned} \ \ \ \ \ (4)

However, the main issue is that we don’t know the distribution of {R} under {H_0} even if we know how to sample from each {f_\theta} and the functional form of {f_\theta} for each {\theta}. The reason is that {H_0} did not specify which {\theta^*} (which equals to {\theta^*=\theta_1=\dots =\theta_n}) generates the data. So the distribution of {R} may depend on {\theta^*} as well and the real thing we need for {c} is

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)\leq \alpha. \end{aligned} \ \ \ \ \ (5)

Thus even if we want to know approximate the {\sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)} through computational methods, we have to simulate for each {\theta^* \in \mathbb{R}^s}. As {\Theta_0} could be rather large (in fact as large as {\mathbb{R}^s}), approximation can be time consuming as well.

Fortunately, if {\Omega} is the so called location-scale family, we find that the distribution of {R} is independent of {\theta^*} and we are free to chose whichever {\theta^*} we like. Let us define what is location-scale family, then state the theorem and prove it.

Definition 1 Suppose we have a family of probability densities {\Omega} on {\mathbb{R}^d} indexed by {\theta =(\mu,\Sigma)} where {\mu \in \mathbb{R}^d} and {\Sigma\in GL(\mathbb{R},d)}, the set of  invertible matrices in {\mathbb{R}^{d \times d}}. The family {\Omega} is a local-scale family if there is a family member {f} (called pivot) such that for any other {f_\theta} with {\theta =(\mu,\Sigma)},

\displaystyle f_\theta (x) = f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}.

Thus if {Z\in \mathbb{R}^d} follows {f}, then {X=\Sigma Z+\mu} has probability density {f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}. Indeed, for any Borel set {B\subset \mathbb{R}^d }

\displaystyle \begin{aligned} \mathop{\mathbb P}(X\in B ) &=\mathop{\mathbb P}(\Sigma Z+\mu\in B)\\ & = \mathop{\mathbb P}(Z\in \Sigma ^{-1}(B-\mu)) \\ & = \int_{ \Sigma ^{-1}(B-\mu )} f(z) dz \\ & =\int_B f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}dx \end{aligned} \ \ \ \ \ (6)

where we use a change of variable {x=\Sigma z+\mu} in the last equality and the last equality shows {X} follows {f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}. We are now ready to state the theorem and prove it.

Theorem 2 Suppose our family of distribution {\Omega} is a local-scale family, then under the null hypothesis, there is a {\theta^*= (\mu^*,\Sigma^*)} such that each {X_i} follows {f_{\theta^*}(x) = \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))} and the distribution of {R} is independent of {\theta^*}.

Since the distribution of {R} is independent of {\theta^*} under the null. This means that for any {\theta \in \mathbb{R}^d}, and any {c}

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*) = \mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta). \end{aligned} \ \ \ \ \ (7)

Thus we can choose any family member of {\Omega} to sample {X_i} and approximates the distribution of {R} using empirical distribution as long as {\Omega} is a location-scale family!

Proof:  We need to show that the ratio {R} has distribution independent of {\theta^*}. Since {X_i \sim \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))} and {\Omega} is a location scale family, we can assume they are generated via {Z_i} where {Z_i} follows a pivot {f} and {X_i = \Sigma^*Z_i+\mu^*}. Then the likelihood of {\theta_1,\dots, \theta_n} is

\displaystyle \begin{aligned} \mathcal{L}_{(X_1,\dots,X_n) }(\theta_1,\dots,\theta_n) &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ & = \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(\Sigma^*Z_i+\mu^*-\mu_i))\\ &= \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*)))\\ & = \bigr(|\det((\Sigma^*)^{-1}|)\bigr)^n\prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))).\\ \end{aligned} \ \ \ \ \ (8)

Thus the likelihood ratio {R} reduces to

\displaystyle \begin{aligned} R &= \frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*)))}{\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d),\,i=1,\dots,n} \prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ &=\frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*))) }{\prod_{i=1}^n\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ \end{aligned} \ \ \ \ \ (9)

Now let’s define {\hat{\Sigma} = (\Sigma^*)^{-1}\Sigma}, {\hat{\mu} = (\Sigma^*)^{-1}(\mu-\mu^*) }, {\hat{\mu}_i = (\Sigma^*)^{-1}(\mu_i -\mu^*) } and {\hat{\Sigma}_i= (\Sigma^*)^{-1}\Sigma_i }. Note that since {\mu,\Sigma}, {\mu_i,\Sigma_i} can vary all over the space {\mathbb{R}^d\times GL(\mathbb{R},d)}, so is {\hat{\mu},\hat{\Sigma}}, {\hat{\mu}_i} and {\hat{\Sigma}_i}. The equality (10) can be rewritten as

\displaystyle \begin{aligned} R &= \frac{\sup_{(\hat{\mu},\hat{\Sigma})\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{\det(\hat{\Sigma})} f((\hat{\Sigma})^{-1}(Z_i-\hat{\mu})) }{\prod_{i=1}^n\sup_{(\hat{\mu}_i,\hat{\Sigma}_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{\det(\hat{\Sigma}_i)} f((\hat{\Sigma}_i)^{-1}(Z_i -\hat{\mu}_i))}.\\ \end{aligned} \ \ \ \ \ (10)

As we just argued, {\hat{\mu},\hat{\Sigma}}, {\hat{\mu}_i} and {\hat{\Sigma}_i} can vary all over the space without any restriction, the supremum in the numerator and denominator thus does not depend on the choice {\mu^*} and {\Sigma^*} at all. So our theorem is proved. \Box

A variance-bias decomposition of L1 norm

Suppose we have real random variables {X,X',Y,Y'}, {X,X'\sim F}, {Y,Y'\sim G} where {F} and {G} are cumulative distribution function and {X,X',Y,Y'} are all independent and {\mathbb{E}|X|,\mathbb{E}|Y|} are finite. We prove the following theorem.

Theorem 1 (Variance-Bias decomposition of {L1} norm) For independent random variables {X,X',Y,Y'} defined above, we have

\displaystyle 2\mathbb{E}|X-Y| = \mathbb{E}|X-X'| + \mathbb{E}|Y-Y'| +2 \int (G(u)-F(u))^2du.

Thus

\displaystyle \begin{aligned} 2\mathbb{E}|X-Y| \geq \mathbb{E}|X-X'| + \mathbb{E}|Y-Y'| \end{aligned} \ \ \ \ \ (1)

with equality holds if and only if {G=F}.

The quantity {\mathbb{E}|X-X'|} is usually referred as mean absolute difference and it measures the spread of a distribution. I don’t know the term for the quantity {\mathbb{E}|X-Y|} but what it measures is the difference between the distribution {F} and {G}. I think cross mean difference would be a nice name.

The equality can be considered as an analogue of the well-known variance-bias decomposition of estimators/predictors in statistics. If we think we are using {X} to estimate/predicts {Y}, then the expected error (the cross mean difference) in terms of absolute value ({L1} norm in more advance term) is the sum of the mean absolute difference in {X} and {Y}, i.e., \mathbb{E}|X-X'|, \mathbb{E}|Y-Y'|, which can be considered as variance and the difference in the two distribution ,i.e., {\int(F-G)^2}, which can be considered as bias.

There is an analogue in terms of the usual square loss (or {L2} norm) and it is

\displaystyle 2\mathbb{E}(X-Y)^2 = \mathbb{E}(X-X')^2 + \mathbb{E}(Y-Y')^2 + 2(\mathbb{E}X-\mathbb{E}Y)^2.

Under this setting, we also see a decomposition of the estimator/prediction error in terms of the variance in {X}and {Y}, i.e., {\mathbb{E}(X-X')^2, \mathbb{E}(Y-Y')^2}, and the difference of mean can be considered as bias as well.

The theorem assume {X,Y} both have first finite moments. In the case either {X} or {Y} has no finite first moment, the equality of the decomposition and inequality is still true by inspecting the proof below. But that the equality holds for inequality (1) does not necessarily imply that {F =G}.

Proof: The trick to establish the equality is to write the quantity {\mathbb{E}|X-Y|} in the following form.

\displaystyle \begin{aligned} 2\mathbb{E}|X-Y| & = 2\mathbb{E}(X-Y) 1_{\{ X\geq Y\}} + 2\mathbb{E}(Y-X) 1_{\{ Y\geq X\}}\\ & =2 \mathbb{E}\int 1_{\{Y\leq u\leq X\}}du + 2 \mathbb{E}\int 1_{\{ X\leq u\leq Y\}}du\\ & =2 \int \mathbb{P}(Y\leq u\leq X)du+ 2\int \mathbb{P}(X\leq u\leq Y)du\\ & =2\int G(u)(1-F(u))du +2\int F(u)(1-G(u))du\\ & = 2\int G(u)(1-F(u)) + F(u)(1-G(u)) du. \end{aligned} \ \ \ \ \ (2)

The third equality is due to Fubini’s theorem and the fourth is because of the independence between {X} and {Y}. Similarly, we have

\displaystyle \mathbb{E}|X-X'| +\mathbb{E}|Y-Y'| =2 \int F(u)(1-F(u)) + G(u)(1-G(u)) du.

Thus the difference of {2\mathbb{E}|X-Y|} and {\mathbb{E}|X-X'|+\mathbb{E}|Y-Y'|} which is finite because {\mathbb{E}|X|,\mathbb{E}|Y|<\infty} is

\displaystyle \begin{aligned} &2\mathbb{E}|X-Y|-\mathbb{E}|X-X'|-\mathbb{E}|Y-Y'|\\ =&2\int G(u)(1-F(u)) + F(u)(1-G(u)) du -2 \int F(u)(1-F(u)) + G(u)(1-G(u)) du\\ = &2\int (G(u)-F(u))^2du \geq 0\\ \end{aligned} \ \ \ \ \ (3)

Thus the equality of decomposition in the theorem and the inequality (1) is established. Now we argue equality of inequality (1) holds if and only {F=G}. If {F=G}, then inequality (1) obviously becomes an equality, Now if inequality (1) becomes an equality, by the last line of inequality (3), we have

\displaystyle \int (G(u)-F(u))^2 du = 0.

This means that

\displaystyle G(u) = F(u)\quad \text{almost everywhere}.

But {G} and {F} are right continuous, we have for all {u\in \mathbb{R}},

\displaystyle G(u) = F(u).

\Box