StatisticsHypothesis Testing
Hypothesis testing is a disciplined framework for adjudicating whether observed data do not support a given hypothesis.
Consider an unknown distribution from which we will observe samples .
- We state a hypothesis —called the null hypothesis—about the distribution.
- We come up with a test statistic , which is a function of the data , for which we can evaluate the distribution of assuming the null hypothesis.
- We give an alternative hypothesis under which is expected to be significantly different from its value under .
- We give a significance level (like 5% or 1%), and based on we determine a set of values for —called the critical region—which would be in with probability at most under the null hypothesis.
- After setting , , , , and the critical region, we run the experiment, evaluate on the samples we get, and record the result as .
- If falls in the critical region, we reject the null hypothesis. The corresponding p-value is defined to be the minimum -value which would have resulted in rejecting the null hypothesis, with the critical region chosen in the same way.
Example
Muriel Bristol claims that she can tell by taste whether the tea or the milk was poured into the cup first. She is given eight cups of tea, four poured milk-first and four poured tea-first.
We posit a null hypothesis that she isn't able to discern the pouring method, and an alternative hypothesis that she can tell the difference. How many cups does she have to identify correctly to reject the null hypothesis with 95% confidence?
Solution. Under the null hypothesis, the number of cups identified correctly is 4 with probability and at least 3 with probability . Therefore, at the 5% significance level, only a correct identification of all the cups would give us grounds to reject the null hypothesis. The -value in that case would be 1.4%.
Failure to reject the null hypothesis is not necessarily evidence for the null hypothesis. The power of a hypothesis test is the conditional probability of rejecting the null hypothesis given that the alternative hypothesis is true. A -value may be low either because the null hypothesis is true or because the test has low power.
The Wald test and the t-test
Definition
The Wald test is based on the normal approximation. Consider a null hypothesis  and the alternative hypothesis , and suppose that  is approximately normally distributed. The Wald test rejects the null hypothesis at the 5% significance level if .
Example
Consider the alternative hypothesis that 8-cylinder engines have lower fuel economy than 6-cylinder engines (with null hypothesis that they are the same). Apply the Wald test, using the data below from the R dataset mtcars.
six_cyl_mpgs = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7]
eight_cyl_mpgs = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0]Solution. We frame the problem as a question about whether the difference in means between the distribution of 8-cylinder mpg values and the distribution of 6-cylinder mpg values is zero. We use the difference between the sample means  and 
using Statistics six = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7] eight = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0] m₁, m₂ = mean(six), mean(eight) s₁, s₂ = std(six), std(eight) n₁, n₂ = length(six), length(eight)
library(tidyverse) stats <- mtcars %>% group_by(cyl) %>% filter(cyl %in% c(6,8)) %>% summarise(m = mean(mpg), S2 = var(mpg), n = n(), se = sqrt(S2/n))
Given that the distribution of 8-cylinder mpg values has variance 
Under the null hypothesis, therefore, 
z <- (stats$m[1] - stats$m[2]) / sqrt(sum(stats$se^2))
z = (m₁ - m₂) / sqrt(s₁^2/n₁ + s₂^2/n₂)
returns 1-cdf(Normal(0,1),z) 
The Wald test can be overconfident because it doesn't account for the fact that the standard deviation values are estimated from the data:
Exercise
Experiment with the code block below to see how, even when the initial distribution is normal, standardizing the mean using the estimated standard deviation results in a non-normal distribution. How is this distribution different? Around what value of 
using Plots, Distributions
n = 6
μ = 3
sample(n) = [μ + 0.5randn() for _ in 1:n]
standardize(X) = (mean(X) - μ)/(std(X)/√(length(X)))
histogram([standardize(sample(n)) for _ in 1:1_000_000],
           xlims = (-6,6), normed=true, label="standardized mean")
plot!(-6:0.05:6, x-> pdf(Normal(0,1),x), linewidth = 3,
      label = "standard normal density", opacity = 0.75)Solution. The distribution of 
If 
Exercise
Use your knowledge of the t-distribution to test the hypothesis that the mean of the distribution used to generate the following list of numbers has mean greater than 4.
Note: you can create an object to represent the t-distribution with ν degrees of freedom using the expression TDist(ν). To evaluate its cumulative distribution function at x, use cdf(TDist(ν), x).
X = [4.1, 5.12, 3.39, 4.97, 3.07, 4.17, 4.46, 5.53, 3.28, 3.62]
Solution. We define the statistic 
t = (mean(X) - 4) / (std(X)/length(X))
which is approximately 2.029, and then
1 - cdf(TDist(length(X)-1), t)
which is about 3.7%. So we are able to reject the null hypothesis at the 5% significance level.
There are a variety of 
Exercise
Redo the mpg problem above with the Welch's t-test instead of the Wald test. This test says that the statistic
is, under the null hypothesis, 
degrees of freedom.
using Statistics six = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7] eight = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0] m₁, m₂ = mean(six), mean(eight) s₁, s₂ = std(six), std(eight) n₁, n₂ = length(six), length(eight)
Solution. We calculate
a = s₁^2/n₁ b = s₂^2/n₂ ν = (a + b)^2 / (a^2/(n₁-1) + b^2/(n₂-1)) t = (m₁ - m₂)/sqrt(a+b) ccdf(TDist(ν), t)
(Note that ccdf is the same as 1-cdf.) The value returned is 
Random Permutation Test
The following test is more flexible than the Wald test, since it doesn't rely on the normal approximation. It's based on a simple idea: if there's no difference in labels, the data shouldn't look very different if we shuffle them around.
Definition
The random permutation test is applicable when the null hypothesis is that two distributions are the same.
- We compute the difference between the sample means for the two groups.
- We randomly re-assign the group labels and compute the resulting sample mean differences. Repeat many times.
- We check where the original difference falls in the sorted list of re-sampled differences.
Example
Suppose the heights of the Romero sons are 72, 69, 68, and 66 inches, and the heights of the Larsen sons are 70, 65, and 64 inches. Consider the null hypothesis that the height distributions for the two families are the same, with the alternative hypothesis that they are not. Determine whether a random permutation test applied to the absolute sample mean difference rejects the null hypothesis at significance level 
Solution. We find that the absolute sample mean difference of about 2.4 inches is larger than only about 68% of the mean differences obtained by resampling many times.
set.seed(123)
romero <- c(72, 69, 68, 66)
larsen <- c(70, 65, 64)
actual.diff <- abs(mean(romero) - mean(larsen))
resample.diff <- function(n) {
  shuffled <- sample(c(romero,larsen))
  abs(mean(shuffled[1:4]) - mean(shuffled[5:7]))
}
sum(sapply(1:10000,resample.diff) < actual.diff)
   Since 68% < 95%, we retain the null hypothesis.
Multiple testing
If we conduct many hypothesis tests, then the probability of obtaining some false rejections is high. This is called the multiple testing problem.

Credit: xkcd.com
The Bonferroni method is to reject the null hypothesis only for those tests whose 
Example
Suppose that 10 different genes are tested to determine whether they have an affect on heart disease. The 10 
Which results are reported as significant at the 5% level, according to the Bonferroni method?
Solution. At the 5% level, only 
Hypothesis testing is often viewed by learners of statistics as potentially misleading. In fact, this thought is not uncommon among professional statisticians and other scientists as well. See, for example, this comment in Nature, which was part of a widespread discussion of 
Despite these concerns, it's useful to be understand the basics of hypothesis testing, because it remains a widely used framework, and conveys a critical lesson about the hazards of extracting hypotheses from data rather the other way around (using data to scrutinize hypotheses).
 English
English