Valige vasakul üks märksõnadest ...

StatisticsHypothesis Testing

Lugemise aeg: ~35 min

Hypothesis testing is a disciplined framework for adjudicating whether observed data do not support a given hypothesis.

Consider an unknown distribution from which we will observe n samples X_1, \ldots X_n.

  • We state a hypothesis H_0—called the null hypothesis—about the distribution.
  • We come up with a test statistic T, which is a function of the data X_1, \ldots X_n, for which we can evaluate the distribution of T assuming the null hypothesis.
  • We give an alternative hypothesis H_{\mathrm{a}} under which T is expected to be significantly different from its value under H_0.
  • We give a significance level \alpha (like 5% or 1%), and based on H_{\mathrm{a}} we determine a set of values for T—called the critical region—which T would be in with probability at most \alpha under the null hypothesis.
  • After setting \boldsymbol{H_0}, \boldsymbol{H_{\mathrm{a}}}, \boldsymbol{\alpha}, \boldsymbol{T}, and the critical region, we run the experiment, evaluate T on the samples we get, and record the result as t_{\mathrm{obs}}.
  • If t_{\mathrm{obs}} falls in the critical region, we reject the null hypothesis. The corresponding p-value is defined to be the minimum \alpha-value which would have resulted in rejecting the null hypothesis, with the critical region chosen in the same way.

Muriel Bristol claims that she can tell by taste whether the tea or the milk was poured into the cup first. She is given eight cups of tea, four poured milk-first and four poured tea-first.

We posit a null hypothesis that she isn't able to discern the pouring method, and an alternative hypothesis that she can tell the difference. How many cups does she have to identify correctly to reject the null hypothesis with 95% confidence?

Solution. Under the null hypothesis, the number of cups identified correctly is 4 with probability 1/\binom{8}{4} \approx 1.4% and at least 3 with probability 17/70 \approx 24%. Therefore, at the 5% significance level, only a correct identification of all the cups would give us grounds to reject the null hypothesis. The p-value in that case would be 1.4%.

Failure to reject the null hypothesis is not necessarily evidence for the null hypothesis. The power of a hypothesis test is the conditional probability of rejecting the null hypothesis given that the alternative hypothesis is true. A p-value may be low either because the null hypothesis is true or because the test has low power.

The Wald test and the t-test

The Wald test is based on the normal approximation. Consider a null hypothesis \theta = 0 and the alternative hypothesis \theta \neq 0, and suppose that \widehat{\theta} is approximately normally distributed. The Wald test rejects the null hypothesis at the 5% significance level if |\widehat{\theta}| > 1.96 \operatorname{se}(\widehat{\theta}).

Consider the alternative hypothesis that 8-cylinder engines have lower fuel economy than 6-cylinder engines (with null hypothesis that they are the same). Apply the Wald test, using the data below from the R dataset mtcars.

six_cyl_mpgs = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7]
eight_cyl_mpgs = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0]

Solution. We frame the problem as a question about whether the difference in means between the distribution of 8-cylinder mpg values and the distribution of 6-cylinder mpg values is zero. We use the difference between the sample means \overline{X} and \overline{Y} of the two populations as an estimator of the difference in means. If we think of the records in the data frame as independent, then \overline{X} and \overline{Y} are independent. Since each is approximately normally distributed by the central limit theorem, their difference is therefore also approximately normal. So, let's calculate the sample mean and sample variance for the 8-cylinder cars and for the 6-cylinder cars.

using Statistics
six = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7]
eight = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0]
m₁, m₂ = mean(six), mean(eight)
s₁, s₂ = std(six), std(eight)
n₁, n₂ = length(six), length(eight)

stats <- mtcars %>%
  group_by(cyl) %>%
  filter(cyl %in% c(6,8)) %>%
  summarise(m = mean(mpg), S2 = var(mpg), n = n(), se = sqrt(S2/n))

Given that the distribution of 8-cylinder mpg values has variance \sigma_{\mathrm{eight}}^2, the variance of the sample mean \overline{X} is \sigma_{\mathrm{eight}}^2/n_{\mathrm{eight}}, where n_{\mathrm{eight}} is the number of 8-cylinder vehicles (and similarly for \overline{Y}). Therefore, we estimate the variance of the difference in sample means as

\begin{align*} \operatorname{Var}(\overline{X} - \overline{Y}) = \operatorname{Var}(\overline{X}) + \operatorname{Var}(\overline{Y}) =\sigma_{\mathrm{eight}}^2/n_{\mathrm{eight}} + \sigma_{\mathrm{six}}^2/n_{\mathrm{six}}.\end{align*}

Under the null hypothesis, therefore, \overline{X} - \overline{Y} has mean zero and standard error \sqrt{\sigma_{\mathrm{eight}}^2/n_{\mathrm{eight}} + \sigma_{\mathrm{six}}^2/n_{\mathrm{six}}}. We therefore reject the null hypothesis with 95% confidence if the value of \overline{X} - \overline{Y} divided by its estimated standard error exceeds 1.96. We find that

z <- (stats$m[1] - stats$m[2]) / sqrt(sum(stats$se^2))
z = (m₁ - m₂) / sqrt(s₁^2/n₁ + s₂^2/n₂)

returns 5.29, so we do reject the null hypothesis at the 95% confidence level. The p-value of this test is 1-cdf(Normal(0,1),z) = 6.08 \times 10^{-6}.

The Wald test can be overconfident because it doesn't account for the fact that the standard deviation values are estimated from the data:

Experiment with the code block below to see how, even when the initial distribution is normal, standardizing the mean using the estimated standard deviation results in a non-normal distribution. How is this distribution different? Around what value of n does the graph become visually indistinguishable from the normal distribution (in this visualization)?

using Plots, Distributions
n = 6
μ = 3
sample(n) = [μ + 0.5randn() for _ in 1:n]
standardize(X) = (mean(X) - μ)/(std(X)/√(length(X)))
histogram([standardize(sample(n)) for _ in 1:1_000_000],
           xlims = (-6,6), normed=true, label="standardized mean")
plot!(-6:0.05:6, x-> pdf(Normal(0,1),x), linewidth = 3,
      label = "standard normal density", opacity = 0.75)

Solution. The distribution of \frac{A_n - \mu}{S/\sqrt{n}} apparently has heavier tails that the normal distribution. Based on the graph, it appears that this effect is more noticeable for n less than 30 than for n greater than 100. (Both of these numbers are arbitrary; the main point is that it doesn't take huge values of n for the distribution to start looking fairly normal.)

If X_1, X_2, \ldots, X_n is a sequence of normal random variables with mean \mu and variance \sigma^2, let's define \overline{X} to be the average of X_i's, and S to be the sample variance, so S^{2}=\frac{1}{n-1} \sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2}. Then the distribution of (\overline{X} - \mu)/(S/\sqrt{n}) is called the t-distribution with n-1 degrees of freedom.

Use your knowledge of the t-distribution to test the hypothesis that the mean of the distribution used to generate the following list of numbers has mean greater than 4.

Note: you can create an object to represent the t-distribution with ν degrees of freedom using the expression TDist(ν). To evaluate its cumulative distribution function at x, use cdf(TDist(ν), x).

X = [4.1, 5.12, 3.39, 4.97, 3.07, 4.17, 4.46, 5.53, 3.28, 3.62]

Solution. We define the statistic t = (X - 4)/(S/\sqrt{n}), which under the null hypothesis is t-distributed with 9 degrees of freedom. We compute

t = (mean(X) - 4) / (std(X)/length(X))

which is approximately 2.029, and then

1 - cdf(TDist(length(X)-1), t)

which is about 3.7%. So we are able to reject the null hypothesis at the 5% significance level.

There are a variety of t-tests, including one appropriate to the mpg problem discussed above:

Redo the mpg problem above with the Welch's t-test instead of the Wald test. This test says that the statistic

\begin{align*}t = \frac{\overline{X}_{1}-\overline{X}_{2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}\end{align*}

is, under the null hypothesis, t-distributed with

\begin{align*}\frac{\left(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}\right)^{2}}{\frac{\left(s_{1}^{2} / n_{1}\right)^{2}}{n_{1}-1}+\frac{\left(s_{2}^{2} / n_{2}\right)^{2}}{n_{2}-1}}\end{align*}

degrees of freedom.

using Statistics
six = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7]
eight = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0]
m₁, m₂ = mean(six), mean(eight)
s₁, s₂ = std(six), std(eight)
n₁, n₂ = length(six), length(eight)

Solution. We calculate

a = s₁^2/n₁
b = s₂^2/n₂
ν = (a + b)^2 / (a^2/(n₁-1) + b^2/(n₂-1))
t = (m₁ - m₂)/sqrt(a+b)
ccdf(TDist(ν), t)

(Note that ccdf is the same as 1-cdf.) The value returned is 2.27 \times 10^{-5}, so we still reject the null hypothesis, but the p-value is higher than what we got previously.

Random Permutation Test

The following test is more flexible than the Wald test, since it doesn't rely on the normal approximation. It's based on a simple idea: if there's no difference in labels, the data shouldn't look very different if we shuffle them around.

The random permutation test is applicable when the null hypothesis is that two distributions are the same.

  • We compute the difference between the sample means for the two groups.
  • We randomly re-assign the group labels and compute the resulting sample mean differences. Repeat many times.
  • We check where the original difference falls in the sorted list of re-sampled differences.

Suppose the heights of the Romero sons are 72, 69, 68, and 66 inches, and the heights of the Larsen sons are 70, 65, and 64 inches. Consider the null hypothesis that the height distributions for the two families are the same, with the alternative hypothesis that they are not. Determine whether a random permutation test applied to the absolute sample mean difference rejects the null hypothesis at significance level \alpha = 0.05.

Solution. We find that the absolute sample mean difference of about 2.4 inches is larger than only about 68% of the mean differences obtained by resampling many times.

romero <- c(72, 69, 68, 66)
larsen <- c(70, 65, 64)
actual.diff <- abs(mean(romero) - mean(larsen))

resample.diff <- function(n) {
  shuffled <- sample(c(romero,larsen))
  abs(mean(shuffled[1:4]) - mean(shuffled[5:7]))

sum(sapply(1:10000,resample.diff) < actual.diff)

Since 68% < 95%, we retain the null hypothesis.

Multiple testing

If we conduct many hypothesis tests, then the probability of obtaining some false rejections is high. This is called the multiple testing problem.

Credit: xkcd.com

The Bonferroni method is to reject the null hypothesis only for those tests whose p-values are less than \alpha divided by the number of hypothesis tests being run. This ensures that the probability of having even one false rejection is less than \alpha, so it is very conservative.

Suppose that 10 different genes are tested to determine whether they have an affect on heart disease. The 10 p-values resulting from these hypothesis tests are (rounded to the nearest hundredth of a percent):

\begin{align*}0.89\%, 2.71\%, 9.11\%, 2.18\%, 9.17\%, 7.48\%, 5.0\%, 2.02\%, 5.22\%, 9.46\%\end{align*}

Which results are reported as significant at the 5% level, according to the Bonferroni method?

Solution. At the 5% level, only p values less than 5%/10 = 0.5% are reported as significant (since we ran ten hypothesis tests). Since none of the p values are below 0.5%, none of the genes will be considered significant.

Hypothesis testing is often viewed by learners of statistics as potentially misleading. In fact, this thought is not uncommon among professional statisticians and other scientists as well. See, for example, this comment in Nature, which was part of a widespread discussion of p-values in the statistics community in early 2019.

Despite these concerns, it's useful to be understand the basics of hypothesis testing, because it remains a widely used framework, and conveys a critical lesson about the hazards of extracting hypotheses from data rather the other way around (using data to scrutinize hypotheses).

Bruno Bruno