December | 2018 | Three Squirrels

Suppose you have a sequence of independent data ${X_1,\dots, X_n\in \mathbb{R}^d}$ , how do you test that ${X_i}$ s all come from the same distribution, i.e., how do you test homogeneity of the data?

To make the problem more precise, suppose we have a distribution family indexed by ${\theta \in \mathbb{R}^s}$ , i.e., a set

$\displaystyle \Omega = \{ F_\theta\mid F_\theta \;\text{is a probability cumulative function,} \;\theta \in \mathbb{R}^s\},$

and each ${X_i}$ follows the distribution ${F_{\theta_i}}$ for some ${\theta_i}$ . Our problem is

Is ${F_{\theta_1} =F_{\theta_2}=\dots =F_{\theta_n}}$ ?

If we have that ${\theta \not =\bar{\theta} \implies F_{\theta}\not= F_{\bar{\theta}}}$ (known as the identifiability of ${\theta}$ ), then our question becomes

Is ${\theta_1 =\theta_2=\dots =\theta_n}$ ?

Now suppose further that each ${F_\theta\in \Omega}$ has a density ${f_\theta}$ (so that we can write down the likelihood), the likelihood of seeing the independent sequence ${X_1,\dots, X_n}$ is

$\displaystyle \mathcal{L}_{(X_1,\dots,X_n) }{(\theta_1,\dots,\theta_n)}= \prod_{i=1}^nf_{\theta_i}(X_i).$

To test our question in a statistical way, we use hypothesis testing. Our null hypothesis is

$\displaystyle \begin{aligned} H_0:\quad \theta_1=\theta_2=\cdots = \theta_n, \end{aligned} \ \ \ \ \ (1)$

and our alternative hypothesis is

$\displaystyle \begin{aligned} H_1:\quad \theta_i\not= \theta_j \quad \text{for some}\; i\not=j. \end{aligned} \ \ \ \ \ (2)$

Further denote the space of the null as ${\Theta_0=\{ (\theta_1,\dots,\theta_n)\mid \theta_1=\dots =\theta_n\}}$ and the space of the alternative as ${\Theta_1=\mathbb{R}^{sn} / \Theta_0}$ . A popular and natural approach is the likelihood ratio test. We construct the test statistic which is called likelihood ratio as

$\displaystyle \begin{aligned} R = \frac{\sup_{(\theta_1,\dots, \theta_n)\in \Theta_0}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}{\sup_{(\theta_1,\dots, \theta_n)\in (\Theta_0\cup \Theta_1)}\mathcal{L}_{(X_1,\dots, X_n)}(\theta_1,\dots, \theta_n)}. \end{aligned} \ \ \ \ \ (3)$

Intuitively, if our null hypothesis is indeed true, i.e., there is some ${\theta^*}$ such that ${\theta_1=\dots =\theta_n=\theta^*}$ and ${X_i}$ follows ${ f_{\theta^*}}$ , then this ratio should be large and we have confidence that our null hypothesis is true. This means we should reject our null hypothesis if we find ${R}$ is small. Thus if we want to have a significance level ${\alpha}$ test of our null hypothesis, we should reject null hypothesis when ${R\leq c}$ where ${c}$ satisfies

$\displaystyle \begin{aligned} \mathbb{P}(R<c\mid H_0)\leq \alpha. \end{aligned} \ \ \ \ \ (4)$

However, the main issue is that we don’t know the distribution of ${R}$ under ${H_0}$ even if we know how to sample from each ${f_\theta}$ and the functional form of ${f_\theta}$ for each ${\theta}$ . The reason is that ${H_0}$ did not specify which ${\theta^*}$ (which equals to ${\theta^*=\theta_1=\dots =\theta_n}$ ) generates the data. So the distribution of ${R}$ may depend on ${\theta^*}$ as well and the real thing we need for ${c}$ is

$\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)\leq \alpha. \end{aligned} \ \ \ \ \ (5)$

Thus even if we want to know approximate the ${\sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*)}$ through computational methods, we have to simulate for each ${\theta^* \in \mathbb{R}^s}$ . As ${\Theta_0}$ could be rather large (in fact as large as ${\mathbb{R}^s}$ ), approximation can be time consuming as well.

Fortunately, if ${\Omega}$ is the so called location-scale family, we find that the distribution of ${R}$ is independent of ${\theta^*}$ and we are free to chose whichever ${\theta^*}$ we like. Let us define what is location-scale family, then state the theorem and prove it.

Definition 1 Suppose we have a family of probability densities ${\Omega}$ on ${\mathbb{R}^d}$ indexed by ${\theta =(\mu,\Sigma)}$ where ${\mu \in \mathbb{R}^d}$ and ${\Sigma\in GL(\mathbb{R},d)}$ , the set of invertible matrices in ${\mathbb{R}^{d \times d}}$ . The family ${\Omega}$ is a local-scale family if there is a family member ${f}$ (called pivot) such that for any other ${f_\theta}$ with ${\theta =(\mu,\Sigma)}$ ,

$\displaystyle f_\theta (x) = f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}.$

Thus if ${Z\in \mathbb{R}^d}$ follows ${f}$ , then ${X=\Sigma Z+\mu}$ has probability density ${f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}$ . Indeed, for any Borel set ${B\subset \mathbb{R}^d }$

$\displaystyle \begin{aligned} \mathop{\mathbb P}(X\in B ) &=\mathop{\mathbb P}(\Sigma Z+\mu\in B)\\ & = \mathop{\mathbb P}(Z\in \Sigma ^{-1}(B-\mu)) \\ & = \int_{ \Sigma ^{-1}(B-\mu )} f(z) dz \\ & =\int_B f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}dx \end{aligned} \ \ \ \ \ (6)$

where we use a change of variable ${x=\Sigma z+\mu}$ in the last equality and the last equality shows ${X}$ follows ${f(\Sigma^{-1}(x-\mu))\frac{1}{|\det(\Sigma)|}}$ . We are now ready to state the theorem and prove it.

Theorem 2 Suppose our family of distribution ${\Omega}$ is a local-scale family, then under the null hypothesis, there is a ${\theta^*= (\mu^*,\Sigma^*)}$ such that each ${X_i}$ follows ${f_{\theta^*}(x) = \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))}$ and the distribution of ${R}$ is independent of ${\theta^*}$ .

Since the distribution of ${R}$ is independent of ${\theta^*}$ under the null. This means that for any ${\theta \in \mathbb{R}^d}$ , and any ${c}$

$\displaystyle \begin{aligned} \sup_{\theta^* \in\mathbb{R}^s }\mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta^*) = \mathbb{P}(R<c\mid H_0,\theta_1=\dots =\theta_n =\theta). \end{aligned} \ \ \ \ \ (7)$

Thus we can choose any family member of ${\Omega}$ to sample ${X_i}$ and approximates the distribution of ${R}$ using empirical distribution as long as ${\Omega}$ is a location-scale family!

Proof: We need to show that the ratio ${R}$ has distribution independent of ${\theta^*}$ . Since ${X_i \sim \frac{1}{|\det(\Sigma^*)|}f((\Sigma^*)^{-1}(x-\mu^*))}$ and ${\Omega}$ is a location scale family, we can assume they are generated via ${Z_i}$ where ${Z_i}$ follows a pivot ${f}$ and ${X_i = \Sigma^*Z_i+\mu^*}$ . Then the likelihood of ${\theta_1,\dots, \theta_n}$ is

$\displaystyle \begin{aligned} \mathcal{L}_{(X_1,\dots,X_n) }(\theta_1,\dots,\theta_n) &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ &=\prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(X_i-\mu_i))\\ & = \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}(\Sigma^*Z_i+\mu^*-\mu_i))\\ &= \prod_{i=1}^n\frac{1}{|\det(\Sigma_i)|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*)))\\ & = \bigr(|\det((\Sigma^*)^{-1}|)\bigr)^n\prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))).\\ \end{aligned} \ \ \ \ \ (8)$

Thus the likelihood ratio ${R}$ reduces to

$\displaystyle \begin{aligned} R &= \frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*)))}{\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d),\,i=1,\dots,n} \prod_{i=1}^n\frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ &=\frac{\sup_{(\mu,\Sigma)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{|\det(\Sigma(\Sigma^*)^{-1})|} f(\Sigma^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu-\mu^*))) }{\prod_{i=1}^n\sup_{(\mu_i,\Sigma_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{|\det(\Sigma_i(\Sigma^*)^{-1})|} f(\Sigma_i^{-1}\Sigma^*(Z_i-(\Sigma^*)^{-1}(\mu_i -\mu^*))}\\ \end{aligned} \ \ \ \ \ (9)$

Now let’s define ${\hat{\Sigma} = (\Sigma^*)^{-1}\Sigma}$ , ${\hat{\mu} = (\Sigma^*)^{-1}(\mu-\mu^*) }$ , ${\hat{\mu}_i = (\Sigma^*)^{-1}(\mu_i -\mu^*) }$ and ${\hat{\Sigma}_i= (\Sigma^*)^{-1}\Sigma_i }$ . Note that since ${\mu,\Sigma}$ , ${\mu_i,\Sigma_i}$ can vary all over the space ${\mathbb{R}^d\times GL(\mathbb{R},d)}$ , so is ${\hat{\mu},\hat{\Sigma}}$ , ${\hat{\mu}_i}$ and ${\hat{\Sigma}_i}$ . The equality (10) can be rewritten as

$\displaystyle \begin{aligned} R &= \frac{\sup_{(\hat{\mu},\hat{\Sigma})\in \mathbb{R}^d\times GL(\mathbb{R},d)} \prod_{i=1}^n\frac{1}{\det(\hat{\Sigma})} f((\hat{\Sigma})^{-1}(Z_i-\hat{\mu})) }{\prod_{i=1}^n\sup_{(\hat{\mu}_i,\hat{\Sigma}_i)\in \mathbb{R}^d\times GL(\mathbb{R},d)} \frac{1}{\det(\hat{\Sigma}_i)} f((\hat{\Sigma}_i)^{-1}(Z_i -\hat{\mu}_i))}.\\ \end{aligned} \ \ \ \ \ (10)$

As we just argued, ${\hat{\mu},\hat{\Sigma}}$ , ${\hat{\mu}_i}$ and ${\hat{\Sigma}_i}$ can vary all over the space without any restriction, the supremum in the numerator and denominator thus does not depend on the choice ${\mu^*}$ and ${\Sigma^*}$ at all. So our theorem is proved. $\Box$

Three Squirrels

Little Discoveries of Optimization, Statistics, and Probability

Month: December 2018

Testing homogeneity of data for location-scale family