Testing homogeneity of data for location-scale family

Suppose you have a sequence of independent data {X_1,\dots, X_n\in \mathbb{R}^d}, how do you test that {X_i}s all come from the same distribution, i.e., how do you test homogeneity of the data?

To make the problem more precise, suppose we have a distribution family indexed by {\theta \in \mathbb{R}^s}, i.e., a set

\displaystyle \Omega = \{ F_\theta\mid F_\theta \;\text{is a probability cumulative function,} \;\theta \in \mathbb{R}^s\},

and each {X_i} follows the distribution {F_{\theta_i}} for some {\theta_i}. Our problem is

Is {F_{\theta_1} =F_{\theta_2}=\dots =F_{\theta_n}}?

If we have that {\theta \not =\bar{\theta} \implies F_{\theta}\not= F_{\bar{\theta}}} (known as the identifiability of {\theta}), then our question becomes

Is {\theta_1 =\theta_2=\dots =\theta_n}?

Now suppose further that each {F_\theta\in \Omega} has a density {f_\theta} (so that we can write down the likelihood), the likelihood of seeing the independent sequence {X_1,\dots, X_n} is

\displaystyle \mathcal{L}_{(X_1,\dots,X_n) }{(\theta_1,\dots,\theta_n)}= \prod_{i=1}^nf_{\theta_i}(X_i).

To test our question in a statistical way, we use hypothesis testing. Our null hypothesis is

\displaystyle \begin{aligned} H_0:\quad \theta_1=\theta_2=\cdots = \theta_n, \end{aligned} \ \ \ \ \ (1)

and our alternative hypothesis is

\displaystyle \begin{aligned} H_1:\quad \theta_i\not= \theta_j \quad \text{for some}\; i\not=j. \end{aligned} \ \ \ \ \ (2)

Further denote the space of the null as {\Theta_0=\{ (\theta_1,\dots,\theta_n)\mid \theta_1=\dots =\theta_n\}} and the space of the alternative as {\Theta_1=\mathbb{R}^{sn} / \Theta_0}. A popular and natural approach is the likelihood ratio test. We construct the test statistic which is called likelihood ratio as

\displaystyle \begin{aligned} R = \frac{\sup_{(\theta_1,\dots, \theta_n)\in \Theta_0}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}{\sup_{(\theta_1,\dots, \theta_n)\in (\Theta_0\cup \Theta_1)}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}. \end{aligned} \ \ \ \ \ (3)

Intuitively, if our null hypothesis is indeed true, i.e., there is some {\theta^*} such that {\theta_1=\dots =\theta_n=\theta^*} and {X_i} follows { f_{\theta^*}}, then this ratio should be large and we have confidence that our null hypothesis is true. This means we should reject our null hypothesis if we find {R} is small. Thus if we want to have a significance level {\alpha} test of our null hypothesis, we should reject null hypothesis when {R\leq c} where {c} satisfies

\displaystyle \begin{aligned} \mathbb{P}(R<c\mid H_0)\leq \alpha. \end{aligned} \ \ \ \ \ (4)

However, the main issue is that we don’t know the distribution of {R} under {H_0} even if we know how to sample from each {f_\theta} and the functional form of {f_\theta} for each {\theta}. The reason is that {H_0} did not specify which {\theta^*} (which equals to {\theta^*=\theta_1=\dots =\theta_n}) generates the data. So the distribution of {R} may depend on {\theta^*} as well and the real thing we need for {c} is

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)\leq \alpha. \end{aligned} \ \ \ \ \ (5)

Thus even if we want to know approximate the {\sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)} through computational methods, we have to simulate for each {\theta^* \in \mathbb{R}^s}. As {\Theta_0} could be rather large (in fact as large as {\mathbb{R}^s}), approximation can be time consuming as well.

Fortunately, if {\Omega} is the so called location-scale family, we find that the distribution of {R} is independent of {\theta^*} and we are free to chose whichever {\theta^*} we like. Let us define what is location-scale family, then state the theorem and prove it.

Definition 1 Suppose we have a family of probability densities {\Omega} on {\mathbb{R}^d} indexed by {\theta =(\mu,\Sigma)} where {\mu \in \mathbb{R}^d} and {\Sigma\in GL(\mathbb{R},d)}, the set of  invertible matrices in {\mathbb{R}^{d \times d}}. The family {\Omega} is a local-scale family if there is a family member {f} (called pivot) such that for any other {f_\theta} with {\theta =(\mu,\Sigma)},

\displaystyle f_\theta (x) = f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}.

Thus if {Z\in \mathbb{R}^d} follows {f}, then {X=\Sigma Z+\mu} has probability density {f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}. Indeed, for any Borel set {B\subset \mathbb{R}^d }

\displaystyle \begin{aligned} \mathop{\mathbb P}(X\in B ) &=\mathop{\mathbb P}(\Sigma Z+\mu\in B)\\ & = \mathop{\mathbb P}(Z\in \Sigma ^{-1}(B-\mu)) \\ & = \int_{ \Sigma ^{-1}(B-\mu )} f(z) dz \\ & =\int_B f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}dx \end{aligned} \ \ \ \ \ (6)

where we use a change of variable {x=\Sigma z+\mu} in the last equality and the last equality shows {X} follows {f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}. We are now ready to state the theorem and prove it.

Theorem 2 Suppose our family of distribution {\Omega} is a local-scale family, then under the null hypothesis, there is a {\theta^*= (\mu^*,\Sigma^*)} such that each {X_i} follows {f_{\theta^*}(x) = \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))} and the distribution of {R} is independent of {\theta^*}.

Since the distribution of {R} is independent of {\theta^*} under the null. This means that for any {\theta \in \mathbb{R}^d}, and any {c}

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*) = \mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta). \end{aligned} \ \ \ \ \ (7)

Thus we can choose any family member of {\Omega} to sample {X_i} and approximates the distribution of {R} using empirical distribution as long as {\Omega} is a location-scale family!

Proof:  We need to show that the ratio {R} has distribution independent of {\theta^*}. Since {X_i \sim \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))} and {\Omega} is a location scale family, we can assume they are generated via {Z_i} where {Z_i} follows a pivot {f} and {X_i = \Sigma^*Z_i+\mu^*}. Then the likelihood of {\theta_1,\dots, \theta_n} is

\displaystyle \begin{aligned} \mathcal{L}_{(X_1,\dots,X_n) }(\theta_1,\dots,\theta_n) &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ & = \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(\Sigma^*Z_i+\mu^*-\mu_i))\\ &= \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*)))\\ & = \bigr(|\det((\Sigma^*)^{-1}|)\bigr)^n\prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))).\\ \end{aligned} \ \ \ \ \ (8)

Thus the likelihood ratio {R} reduces to

\displaystyle \begin{aligned} R &= \frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*)))}{\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d),\,i=1,\dots,n} \prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ &=\frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*))) }{\prod_{i=1}^n\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ \end{aligned} \ \ \ \ \ (9)

Now let’s define {\hat{\Sigma} = (\Sigma^*)^{-1}\Sigma}, {\hat{\mu} = (\Sigma^*)^{-1}(\mu-\mu^*) }, {\hat{\mu}_i = (\Sigma^*)^{-1}(\mu_i -\mu^*) } and {\hat{\Sigma}_i= (\Sigma^*)^{-1}\Sigma_i }. Note that since {\mu,\Sigma}, {\mu_i,\Sigma_i} can vary all over the space {\mathbb{R}^d\times GL(\mathbb{R},d)}, so is {\hat{\mu},\hat{\Sigma}}, {\hat{\mu}_i} and {\hat{\Sigma}_i}. The equality (10) can be rewritten as

\displaystyle \begin{aligned} R &= \frac{\sup_{(\hat{\mu},\hat{\Sigma})\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{\det(\hat{\Sigma})} f((\hat{\Sigma})^{-1}(Z_i-\hat{\mu})) }{\prod_{i=1}^n\sup_{(\hat{\mu}_i,\hat{\Sigma}_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{\det(\hat{\Sigma}_i)} f((\hat{\Sigma}_i)^{-1}(Z_i -\hat{\mu}_i))}.\\ \end{aligned} \ \ \ \ \ (10)

As we just argued, {\hat{\mu},\hat{\Sigma}}, {\hat{\mu}_i} and {\hat{\Sigma}_i} can vary all over the space without any restriction, the supremum in the numerator and denominator thus does not depend on the choice {\mu^*} and {\Sigma^*} at all. So our theorem is proved. \Box

A variance-bias decomposition of L1 norm

Suppose we have real random variables {X,X',Y,Y'}, {X,X'\sim F}, {Y,Y'\sim G} where {F} and {G} are cumulative distribution function and {X,X',Y,Y'} are all independent and {\mathbb{E}|X|,\mathbb{E}|Y|} are finite. We prove the following theorem.

Theorem 1 (Variance-Bias decomposition of {L1} norm) For independent random variables {X,X',Y,Y'} defined above, we have

\displaystyle 2\mathbb{E}|X-Y| = \mathbb{E}|X-X'| + \mathbb{E}|Y-Y'| +2 \int (G(u)-F(u))^2du.

Thus

\displaystyle \begin{aligned} 2\mathbb{E}|X-Y| \geq \mathbb{E}|X-X'| + \mathbb{E}|Y-Y'| \end{aligned} \ \ \ \ \ (1)

with equality holds if and only if {G=F}.

The quantity {\mathbb{E}|X-X'|} is usually referred as mean absolute difference and it measures the spread of a distribution. I don’t know the term for the quantity {\mathbb{E}|X-Y|} but what it measures is the difference between the distribution {F} and {G}. I think cross mean difference would be a nice name.

The equality can be considered as an analogue of the well-known variance-bias decomposition of estimators/predictors in statistics. If we think we are using {X} to estimate/predicts {Y}, then the expected error (the cross mean difference) in terms of absolute value ({L1} norm in more advance term) is the sum of the mean absolute difference in {X} and {Y}, i.e., \mathbb{E}|X-X'|, \mathbb{E}|Y-Y'|, which can be considered as variance and the difference in the two distribution ,i.e., {\int(F-G)^2}, which can be considered as bias.

There is an analogue in terms of the usual square loss (or {L2} norm) and it is

\displaystyle 2\mathbb{E}(X-Y)^2 = \mathbb{E}(X-X')^2 + \mathbb{E}(Y-Y')^2 + 2(\mathbb{E}X-\mathbb{E}Y)^2.

Under this setting, we also see a decomposition of the estimator/prediction error in terms of the variance in {X}and {Y}, i.e., {\mathbb{E}(X-X')^2, \mathbb{E}(Y-Y')^2}, and the difference of mean can be considered as bias as well.

The theorem assume {X,Y} both have first finite moments. In the case either {X} or {Y} has no finite first moment, the equality of the decomposition and inequality is still true by inspecting the proof below. But that the equality holds for inequality (1) does not necessarily imply that {F =G}.

Proof: The trick to establish the equality is to write the quantity {\mathbb{E}|X-Y|} in the following form.

\displaystyle \begin{aligned} 2\mathbb{E}|X-Y| & = 2\mathbb{E}(X-Y) 1_{\{ X\geq Y\}} + 2\mathbb{E}(Y-X) 1_{\{ Y\geq X\}}\\ & =2 \mathbb{E}\int 1_{\{Y\leq u\leq X\}}du + 2 \mathbb{E}\int 1_{\{ X\leq u\leq Y\}}du\\ & =2 \int \mathbb{P}(Y\leq u\leq X)du+ 2\int \mathbb{P}(X\leq u\leq Y)du\\ & =2\int G(u)(1-F(u))du +2\int F(u)(1-G(u))du\\ & = 2\int G(u)(1-F(u)) + F(u)(1-G(u)) du. \end{aligned} \ \ \ \ \ (2)

The third equality is due to Fubini’s theorem and the fourth is because of the independence between {X} and {Y}. Similarly, we have

\displaystyle \mathbb{E}|X-X'| +\mathbb{E}|Y-Y'| =2 \int F(u)(1-F(u)) + G(u)(1-G(u)) du.

Thus the difference of {2\mathbb{E}|X-Y|} and {\mathbb{E}|X-X'|+\mathbb{E}|Y-Y'|} which is finite because {\mathbb{E}|X|,\mathbb{E}|Y|<\infty} is

\displaystyle \begin{aligned} &2\mathbb{E}|X-Y|-\mathbb{E}|X-X'|-\mathbb{E}|Y-Y'|\\ =&2\int G(u)(1-F(u)) + F(u)(1-G(u)) du -2 \int F(u)(1-F(u)) + G(u)(1-G(u)) du\\ = &2\int (G(u)-F(u))^2du \geq 0\\ \end{aligned} \ \ \ \ \ (3)

Thus the equality of decomposition in the theorem and the inequality (1) is established. Now we argue equality of inequality (1) holds if and only {F=G}. If {F=G}, then inequality (1) obviously becomes an equality, Now if inequality (1) becomes an equality, by the last line of inequality (3), we have

\displaystyle \int (G(u)-F(u))^2 du = 0.

This means that

\displaystyle G(u) = F(u)\quad \text{almost everywhere}.

But {G} and {F} are right continuous, we have for all {u\in \mathbb{R}},

\displaystyle G(u) = F(u).

\Box