Testing homogeneity of data for location-scale family

Suppose you have a sequence of independent data {X_1,\dots, X_n\in \mathbb{R}^d}, how do you test that {X_i}s all come from the same distribution, i.e., how do you test homogeneity of the data?

To make the problem more precise, suppose we have a distribution family indexed by {\theta \in \mathbb{R}^s}, i.e., a set

\displaystyle \Omega = \{ F_\theta\mid F_\theta \;\text{is a probability cumulative function,} \;\theta \in \mathbb{R}^s\},

and each {X_i} follows the distribution {F_{\theta_i}} for some {\theta_i}. Our problem is

Is {F_{\theta_1} =F_{\theta_2}=\dots =F_{\theta_n}}?

If we have that {\theta \not =\bar{\theta} \implies F_{\theta}\not= F_{\bar{\theta}}} (known as the identifiability of {\theta}), then our question becomes

Is {\theta_1 =\theta_2=\dots =\theta_n}?

Now suppose further that each {F_\theta\in \Omega} has a density {f_\theta} (so that we can write down the likelihood), the likelihood of seeing the independent sequence {X_1,\dots, X_n} is

\displaystyle \mathcal{L}_{(X_1,\dots,X_n) }{(\theta_1,\dots,\theta_n)}= \prod_{i=1}^nf_{\theta_i}(X_i).

To test our question in a statistical way, we use hypothesis testing. Our null hypothesis is

\displaystyle \begin{aligned} H_0:\quad \theta_1=\theta_2=\cdots = \theta_n, \end{aligned} \ \ \ \ \ (1)

and our alternative hypothesis is

\displaystyle \begin{aligned} H_1:\quad \theta_i\not= \theta_j \quad \text{for some}\; i\not=j. \end{aligned} \ \ \ \ \ (2)

Further denote the space of the null as {\Theta_0=\{ (\theta_1,\dots,\theta_n)\mid \theta_1=\dots =\theta_n\}} and the space of the alternative as {\Theta_1=\mathbb{R}^{sn} / \Theta_0}. A popular and natural approach is the likelihood ratio test. We construct the test statistic which is called likelihood ratio as

\displaystyle \begin{aligned} R = \frac{\sup_{(\theta_1,\dots, \theta_n)\in \Theta_0}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}{\sup_{(\theta_1,\dots, \theta_n)\in (\Theta_0\cup \Theta_1)}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}. \end{aligned} \ \ \ \ \ (3)

Intuitively, if our null hypothesis is indeed true, i.e., there is some {\theta^*} such that {\theta_1=\dots =\theta_n=\theta^*} and {X_i} follows { f_{\theta^*}}, then this ratio should be large and we have confidence that our null hypothesis is true. This means we should reject our null hypothesis if we find {R} is small. Thus if we want to have a significance level {\alpha} test of our null hypothesis, we should reject null hypothesis when {R\leq c} where {c} satisfies

\displaystyle \begin{aligned} \mathbb{P}(R<c\mid H_0)\leq \alpha. \end{aligned} \ \ \ \ \ (4)

However, the main issue is that we don’t know the distribution of {R} under {H_0} even if we know how to sample from each {f_\theta} and the functional form of {f_\theta} for each {\theta}. The reason is that {H_0} did not specify which {\theta^*} (which equals to {\theta^*=\theta_1=\dots =\theta_n}) generates the data. So the distribution of {R} may depend on {\theta^*} as well and the real thing we need for {c} is

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)\leq \alpha. \end{aligned} \ \ \ \ \ (5)

Thus even if we want to know approximate the {\sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)} through computational methods, we have to simulate for each {\theta^* \in \mathbb{R}^s}. As {\Theta_0} could be rather large (in fact as large as {\mathbb{R}^s}), approximation can be time consuming as well.

Fortunately, if {\Omega} is the so called location-scale family, we find that the distribution of {R} is independent of {\theta^*} and we are free to chose whichever {\theta^*} we like. Let us define what is location-scale family, then state the theorem and prove it.

Definition 1 Suppose we have a family of probability densities {\Omega} on {\mathbb{R}^d} indexed by {\theta =(\mu,\Sigma)} where {\mu \in \mathbb{R}^d} and {\Sigma\in GL(\mathbb{R},d)}, the set of  invertible matrices in {\mathbb{R}^{d \times d}}. The family {\Omega} is a local-scale family if there is a family member {f} (called pivot) such that for any other {f_\theta} with {\theta =(\mu,\Sigma)},

\displaystyle f_\theta (x) = f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}.

Thus if {Z\in \mathbb{R}^d} follows {f}, then {X=\Sigma Z+\mu} has probability density {f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}. Indeed, for any Borel set {B\subset \mathbb{R}^d }

\displaystyle \begin{aligned} \mathop{\mathbb P}(X\in B ) &=\mathop{\mathbb P}(\Sigma Z+\mu\in B)\\ & = \mathop{\mathbb P}(Z\in \Sigma ^{-1}(B-\mu)) \\ & = \int_{ \Sigma ^{-1}(B-\mu )} f(z) dz \\ & =\int_B f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}dx \end{aligned} \ \ \ \ \ (6)

where we use a change of variable {x=\Sigma z+\mu} in the last equality and the last equality shows {X} follows {f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}. We are now ready to state the theorem and prove it.

Theorem 2 Suppose our family of distribution {\Omega} is a local-scale family, then under the null hypothesis, there is a {\theta^*= (\mu^*,\Sigma^*)} such that each {X_i} follows {f_{\theta^*}(x) = \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))} and the distribution of {R} is independent of {\theta^*}.

Since the distribution of {R} is independent of {\theta^*} under the null. This means that for any {\theta \in \mathbb{R}^d}, and any {c}

\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*) = \mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta). \end{aligned} \ \ \ \ \ (7)

Thus we can choose any family member of {\Omega} to sample {X_i} and approximates the distribution of {R} using empirical distribution as long as {\Omega} is a location-scale family!

Proof:  We need to show that the ratio {R} has distribution independent of {\theta^*}. Since {X_i \sim \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))} and {\Omega} is a location scale family, we can assume they are generated via {Z_i} where {Z_i} follows a pivot {f} and {X_i = \Sigma^*Z_i+\mu^*}. Then the likelihood of {\theta_1,\dots, \theta_n} is

\displaystyle \begin{aligned} \mathcal{L}_{(X_1,\dots,X_n) }(\theta_1,\dots,\theta_n) &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ & = \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(\Sigma^*Z_i+\mu^*-\mu_i))\\ &= \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*)))\\ & = \bigr(|\det((\Sigma^*)^{-1}|)\bigr)^n\prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))).\\ \end{aligned} \ \ \ \ \ (8)

Thus the likelihood ratio {R} reduces to

\displaystyle \begin{aligned} R &= \frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*)))}{\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d),\,i=1,\dots,n} \prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ &=\frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*))) }{\prod_{i=1}^n\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ \end{aligned} \ \ \ \ \ (9)

Now let’s define {\hat{\Sigma} = (\Sigma^*)^{-1}\Sigma}, {\hat{\mu} = (\Sigma^*)^{-1}(\mu-\mu^*) }, {\hat{\mu}_i = (\Sigma^*)^{-1}(\mu_i -\mu^*) } and {\hat{\Sigma}_i= (\Sigma^*)^{-1}\Sigma_i }. Note that since {\mu,\Sigma}, {\mu_i,\Sigma_i} can vary all over the space {\mathbb{R}^d\times GL(\mathbb{R},d)}, so is {\hat{\mu},\hat{\Sigma}}, {\hat{\mu}_i} and {\hat{\Sigma}_i}. The equality (10) can be rewritten as

\displaystyle \begin{aligned} R &= \frac{\sup_{(\hat{\mu},\hat{\Sigma})\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{\det(\hat{\Sigma})} f((\hat{\Sigma})^{-1}(Z_i-\hat{\mu})) }{\prod_{i=1}^n\sup_{(\hat{\mu}_i,\hat{\Sigma}_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{\det(\hat{\Sigma}_i)} f((\hat{\Sigma}_i)^{-1}(Z_i -\hat{\mu}_i))}.\\ \end{aligned} \ \ \ \ \ (10)

As we just argued, {\hat{\mu},\hat{\Sigma}}, {\hat{\mu}_i} and {\hat{\Sigma}_i} can vary all over the space without any restriction, the supremum in the numerator and denominator thus does not depend on the choice {\mu^*} and {\Sigma^*} at all. So our theorem is proved. \Box