# StatisticsHypothesis Testing

**Hypothesis testing** is a disciplined framework for adjudicating whether observed data do not support a given hypothesis.

Consider an unknown distribution from which we will observe samples .

- We state a hypothesis —called the
**null hypothesis**—about the distribution. - We come up with a
**test statistic**, which is a function of the data , for which we can evaluate the distribution of assuming the null hypothesis. - We give an
**alternative hypothesis**under which is expected to be significantly different from its value under . - We give a significance level (like 5% or 1%), and based on we determine a set of values for —called the
*critical region*—which would be in with probability at most under the null hypothesis. **After setting , , , , and the critical region**, we run the experiment, evaluate on the samples we get, and record the result as .- If falls in the critical region, we reject the null hypothesis. The corresponding
is defined to be the minimum -value which would have resulted in rejecting the null hypothesis, with the critical region chosen in the same way.*p*-value

**Example**

Muriel Bristol claims that she can tell by taste whether the tea or the milk was poured into the cup first. She is given eight cups of tea, four poured milk-first and four poured tea-first.

We posit a null hypothesis that she isn't able to discern the pouring method, and an alternative hypothesis that she can tell the difference. How many cups does she have to identify correctly to reject the null hypothesis with 95% confidence?

*Solution.* Under the null hypothesis, the number of cups identified correctly is 4 with probability and at least 3 with probability . Therefore, at the 5% significance level, only a correct identification of all the cups would give us grounds to reject the null hypothesis. The -value in that case would be 1.4%.

Failure to reject the null hypothesis is not necessarily evidence *for* the null hypothesis. The **power** of a hypothesis test is the conditional probability of rejecting the null hypothesis given that the alternative hypothesis is true. A -value may be low either because the null hypothesis is true or because the test has low power.

### The Wald test and the t-test

**Definition**

The **Wald test** is based on the normal approximation. Consider a null hypothesis and the alternative hypothesis , and suppose that is approximately normally distributed. The Wald test rejects the null hypothesis at the 5% significance level if .

**Example**

Consider the alternative hypothesis that 8-cylinder engines have lower fuel economy than 6-cylinder engines (with null hypothesis that they are the same). Apply the Wald test, using the data below from the R dataset `mtcars`

.

```
six_cyl_mpgs = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7]
eight_cyl_mpgs = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0]
```

*Solution.* We frame the problem as a question about whether the *difference in means* between the distribution of 8-cylinder `mpg`

values and the distribution of 6-cylinder `mpg`

values is zero. We use the difference between the sample means and

using Statistics six = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7] eight = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0] m₁, m₂ = mean(six), mean(eight) s₁, s₂ = std(six), std(eight) n₁, n₂ = length(six), length(eight)

library(tidyverse) stats <- mtcars %>% group_by(cyl) %>% filter(cyl %in% c(6,8)) %>% summarise(m = mean(mpg), S2 = var(mpg), n = n(), se = sqrt(S2/n))

Given that the distribution of 8-cylinder `mpg`

values has variance

Under the null hypothesis, therefore,

z <- (stats$m[1] - stats$m[2]) / sqrt(sum(stats$se^2))

z = (m₁ - m₂) / sqrt(s₁^2/n₁ + s₂^2/n₂)

returns `1-cdf(Normal(0,1),z)`

The Wald test can be overconfident because it doesn't account for the fact that the standard deviation values are estimated from the data:

**Exercise**

Experiment with the code block below to see how, even when the initial distribution is normal, standardizing the mean using the *estimated* standard deviation results in a non-normal distribution. How is this distribution different? Around what value of

using Plots, Distributions n = 6 μ = 3 sample(n) = [μ + 0.5randn() for _ in 1:n] standardize(X) = (mean(X) - μ)/(std(X)/√(length(X))) histogram([standardize(sample(n)) for _ in 1:1_000_000], xlims = (-6,6), normed=true, label="standardized mean") plot!(-6:0.05:6, x-> pdf(Normal(0,1),x), linewidth = 3, label = "standard normal density", opacity = 0.75)

*Solution.* The distribution of

If **t-distribution** with **degrees of freedom**

**Exercise**

Use your knowledge of the t-distribution to test the hypothesis that the mean of the distribution used to generate the following list of numbers has mean greater than 4.

Note: you can create an object to represent the t-distribution with `ν`

degrees of freedom using the expression `TDist(ν)`

. To evaluate its cumulative distribution function at `x`

, use `cdf(TDist(ν), x)`

.

X = [4.1, 5.12, 3.39, 4.97, 3.07, 4.17, 4.46, 5.53, 3.28, 3.62]

*Solution.* We define the statistic

t = (mean(X) - 4) / (std(X)/length(X))

which is approximately 2.029, and then

1 - cdf(TDist(length(X)-1), t)

which is about 3.7%. So we are able to reject the null hypothesis at the 5% significance level.

There are a variety of

**Exercise**

Redo the mpg problem above with the *Welch's* t-test instead of the Wald test. This test says that the statistic

is, under the null hypothesis,

degrees of freedom.

using Statistics six = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7] eight = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0] m₁, m₂ = mean(six), mean(eight) s₁, s₂ = std(six), std(eight) n₁, n₂ = length(six), length(eight)

*Solution.* We calculate

a = s₁^2/n₁ b = s₂^2/n₂ ν = (a + b)^2 / (a^2/(n₁-1) + b^2/(n₂-1)) t = (m₁ - m₂)/sqrt(a+b) ccdf(TDist(ν), t)

(Note that `ccdf`

is the same as `1-cdf`

.) The value returned is

## Random Permutation Test

The following test is more flexible than the Wald test, since it doesn't rely on the normal approximation. It's based on a simple idea: if there's no difference in labels, the data shouldn't look very different if we shuffle them around.

**Definition**

The **random permutation test** is applicable when the null hypothesis is that two distributions are the same.

- We compute the difference between the sample means for the two groups.
- We randomly re-assign the group labels and compute the resulting sample mean differences. Repeat many times.
- We check where the original difference falls in the sorted list of re-sampled differences.

**Example**

Suppose the heights of the Romero sons are 72, 69, 68, and 66 inches, and the heights of the Larsen sons are 70, 65, and 64 inches. Consider the null hypothesis that the height distributions for the two families are the same, with the alternative hypothesis that they are not. Determine whether a random permutation test applied to the absolute sample mean difference rejects the null hypothesis at significance level

*Solution.* We find that the absolute sample mean difference of about 2.4 inches is larger than only about 68% of the mean differences obtained by resampling many times.

set.seed(123) romero <- c(72, 69, 68, 66) larsen <- c(70, 65, 64) actual.diff <- abs(mean(romero) - mean(larsen)) resample.diff <- function(n) { shuffled <- sample(c(romero,larsen)) abs(mean(shuffled[1:4]) - mean(shuffled[5:7])) } sum(sapply(1:10000,resample.diff) < actual.diff)

Since 68% < 95%, we retain the null hypothesis.

## Multiple testing

If we conduct many hypothesis tests, then the probability of obtaining some false rejections is high. This is called the **multiple testing problem**.

The **Bonferroni method** is to reject the null hypothesis only for those tests whose

**Example**

Suppose that 10 different genes are tested to determine whether they have an affect on heart disease. The 10

Which results are reported as significant at the 5% level, according to the Bonferroni method?

*Solution.* At the 5% level, only

Hypothesis testing is often viewed by learners of statistics as potentially misleading. In fact, this thought is not uncommon among professional statisticians and other scientists as well. See, for example, this comment in Nature, which was part of a widespread discussion of

Despite these concerns, it's useful to be understand the basics of hypothesis testing, because it remains a widely used framework, and conveys a critical lesson about the hazards of extracting hypotheses from data rather the other way around (using data to scrutinize hypotheses).