5  Hypothesis Testing

Goals:

Lab 5.1: Permutation Tests

In this lab, we use permutation tests to assess whether there is evidence that the mean delay times of two different airlines are different and whether there is evidence that completing a difficult task with a dog in the room lowers stress level, on average.

Example 1: Flight Data

Consider again the flight delay data. Here, we will consider whether or not there is statistical evidence that the mean delay of United Airlines is different than the mean delay of American Airlines.

Recall that, for a permutation test, we do not need to assume that the underlying populations follow a normal distribution. However, if we are interested in testing a difference in means, we do need to assume that the populations have the same variance and the same shape (but perhaps may have a different center). Note that, if we have a randomized experiment, then these assumptions can be relaxed.

First, we obtain some summary statistics and make a plot of the data:

library(tidyverse)
library(resampledata)
delay_df <- FlightDelays |> as_tibble() |>
  select(Carrier, Delay, everything())
## set.seed(13617)  # 50100 for bias illustration

delay_df |> group_by(Carrier) |>
  summarise(n = n(),
            xbar = mean(Delay),
            sd = sd(Delay))
# A tibble: 2 × 4
  Carrier     n  xbar    sd
  <fct>   <int> <dbl> <dbl>
1 AA       2906  10.1  40.1
2 UA       1123  16.0  45.1
ggplot(delay_df,
       aes(x = Carrier, y = Delay)) + 
  geom_boxplot(fill = "steelblue") +
  theme_minimal()

We next store the difference in sample means as a value called teststat and each of the sample sizes as n_a and n_u:

teststat <- delay_df |> group_by(Carrier) |>
  summarise(mean_delay = mean(Delay)) |>
  pull(mean_delay) |>
  diff()
## teststat is united delay mean minus american airlines mean

Under the null hypothesis, the distribution of delays is identical for american and united airlines. To reshuffle the delay values randomly across both airlines we can use the code below:

delay_df |>
  mutate(delay_perm = sample(Delay, size = nrow(delay_df), replace = FALSE)) |>
  relocate(delay_perm)
# A tibble: 4,029 × 11
   delay_perm Carrier Delay    ID FlightNo Destination DepartTime Day   Month
        <int> <fct>   <int> <int>    <int> <fct>       <fct>      <fct> <fct>
 1         -4 UA         -1     1      403 DEN         4-8am      Fri   May  
 2         -6 UA        102     2      405 DEN         8-Noon     Fri   May  
 3         -2 UA          4     3      409 DEN         4-8pm      Fri   May  
 4          0 UA         -2     4      511 ORD         8-Noon     Fri   May  
 5         87 UA         -3     5      667 ORD         4-8am      Fri   May  
 6        120 UA          0     6      669 ORD         4-8am      Fri   May  
 7         -9 UA         -5     7      673 ORD         8-Noon     Fri   May  
 8         -3 UA          0     8      677 ORD         8-Noon     Fri   May  
 9         -6 UA         10     9      679 ORD         Noon-4pm   Fri   May  
10          6 UA         60    10      681 ORD         Noon-4pm   Fri   May  
# ℹ 4,019 more rows
# ℹ 2 more variables: FlightLength <int>, Delayed30 <fct>

Rerun the previous code chunk a few times to see a different permutations of the Delay variable.

With one permutation, we want to recalculate the difference in means:

delay_df |>
  mutate(delay_perm = sample(Delay, size = nrow(delay_df), replace = FALSE)) |>
  relocate(delay_perm) |>
  group_by(Carrier) |>
  summarise(mean_delay_perm = mean(delay_perm)) |>
  pull(mean_delay_perm) |>
  diff()
[1] 1.032533

Rerun the code chunk above a few times to get a few different mean differences under different permutations. The big idea is that a large number of these mean differences will form the null distribution for the difference in sample means (if there really is no difference in the underlying population distributions).

So, let’s wrap that above code in a function, and then iterate over that function a large number of times to obtain our null distribution:

get_delay_perm_diff <- function() {
  delay_df |>
    mutate(delay_perm = sample(Delay, size = nrow(delay_df), replace = FALSE)) |>
    relocate(delay_perm) |>
    group_by(Carrier) |>
    summarise(mean_delay_perm = mean(delay_perm)) |>
    pull(mean_delay_perm) |>
    diff()
}

n_perm <- 5000
diff_means <- map_dbl(1:n_perm, \(i) get_delay_perm_diff())
null_df <- tibble(diff_means)

ggplot(data = null_df, aes(x = diff_means)) +
  geom_histogram(colour = "orange4", fill = "orange2", bins = 15) +
  theme_minimal()

Exercise. Write a 1-2 sentence interpretation of what this null distribution means in context of the problem.

Exercise. Write a 1-2 sentence explanation on what the code above is doing.

Now, let’s add where our test statistic from our data is on the graph:

ggplot(data = null_df, aes(x = diff_means)) +
  geom_histogram(colour = "orange4", fill = "orange2", bins = 15) +
  geom_vline(xintercept = teststat, colour = "grey75") +
  theme_minimal()

Exercise. Before explicitly calculating a p-value, assess visually whether a p-value for this hypothesis test will be relatively large or relatively small.

Now, let’s explicitly compute a conservative p-value for the test:

(sum(diff_means > abs(teststat)) + sum(diff_means < -abs(teststat)) + 1) / (n_perm + 1)
[1] 0.00019996

Exercise. What is the alternative hypothesis that the p-value is testing?

Exercise. What is the purpose of the + 1 in the code above?

Exercise. Why are there two different summations in the code above to get the p-value?

Exercise. Write a conclusion in context of the problem for this hypothesis test.

Example 2: Stress Levels

Many of you took our probability exams with the adorable Mipha providing moral support throughout the exam. There has actually been formal research assessing whether being in the presence of a dog can have beneficial effects for people. In one such study, researchers recruited 30 women who were self-proclaimed dog lovers. They randomly assigned 15 women to a stressful task alone (control: C) and 15 women to do a stressful task with a pet dog present (pet: P). The response variable is heart rate during the task, with higher heart rates presumed to mean that the there was more stress during the task.

Use the code below to read in the data and do some brief data exploration:

library(tidyverse)
stress_df <- read_csv("https://raw.githubusercontent.com/highamm/stat326_labs/master/data/stress_dog.csv")
Rows: 30 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): group
dbl (1): rate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Exploration:

ggplot(data = stress_df, aes(x = group, y = rate, fill = group)) +
  geom_boxplot(outlier.shape = 8) +
  theme_minimal() +
  scale_fill_viridis_d(begin = 0.4, end = 0.9) +
  guides(fill = "none")

stress_df |> group_by(group) |>
  summarise(mean = mean(rate),
            n = n())
# A tibble: 2 × 3
  group  mean     n
  <chr> <dbl> <int>
1 C      82.5    15
2 P      73.5    15

Exercise. What do you find from your data exploration plot? What are the observed sample mean heart rates for each group?


Test Statistic:

# From our summary statistics
teststat <- stress_df |> group_by(group) |>
  summarise(mean = mean(rate),
            n = n()) |>
  pull(mean) |>
  diff()
## Pet minus Control
teststat
[1] -9.041

Exercise: Modify the following code to create the null distribution for a test that the mean heart rate is different in each of the two groups.

get_stress_perm_diff <- function() {
  stress_df |>
    mutate(rate_perm = sample(____,
                                size = nrow(____), replace = FALSE)) |>
    relocate(rate_perm) |>
    group_by(_______) |>
    summarise(mean_rate_perm = mean(rate_perm)) |>
    pull(mean_rate_perm) |>
    diff()
}
get_rate_perm_diff()

n_perm <- 5000
diff_means <- map_dbl(1:n_perm, \(i) get_rate_perm_diff())
null_df <- tibble(diff_means)

ggplot(data = null_df, aes(x = diff_means)) +
  geom_histogram(colour = "skyblue4", fill = "skyblue1", bins = 15) +
  ## add test stat to null dist
  geom_vline(xintercept = teststat, colour = "purple") +
  theme_minimal()

Exercise. Calculate a p-value for the test.

Exercise. Write a conclusion in context of the problem for this hypothesis test.


Lab 5.2: Asymptotic LRTs in Practice

In this lab, we will see an example of how an asymptotic likelihood ratio test is used in practice.

Logistic Regression Example

If you took STAT 213, recall the logistic regression model that you used in that course:

\(Y_i \sim \text{Bernoulli}(p_i)\), where each \(Y_i\) is independent of all other \(Y_j\) but the \(Y_i\) are not identically distributed. Instead, each \(Y_i\) is allowed to have its own probability of success, \(p_i\), which can be modeled as:

\[ \text{E}(Y_i) = p_i = \frac{\text{exp}(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki})}{1 + \text{exp}(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki})}, \]

where \(x_{1i}, x_{2i}, \ldots, x_{ki}\) are possible predictor variables for the probability of success.

Recall: For what type of response variable was logistic regression an appropriate model?

Recall: What values is the quantity \(\frac{\text{exp}(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki})}{1 + \text{exp}(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki})}\) bounded by?

Consider the following data set on survival of passengers aboard the Titanic.

library(tidyverse)

titanic_df <- read_csv("https://raw.githubusercontent.com/highamm/stat326_labs/master/data/titanic.csv",
                       col_types = list(Pclass = col_factor())) |>
  filter(!is.na(Age) & !is.na(Sex)) |>
  mutate(Pclass = fct_recode(Pclass,
                             "1st" = "1",
                             "2nd" = "2",
                             "3rd" = "3"))
titanic_df |> slice(1:3) |>
  dplyr::select(Survived, Pclass, Name, Sex, Age) |>
  pander::pander()
Survived Pclass Name Sex Age
0 3rd Braund, Mr. Owen Harris male 22
1 1st Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38
1 3rd Heikkinen, Miss. Laina female 26

Exercise. Suppose that we want to build a logistic regression model with Survived (a 1 is survived and 0 is died) as the response variable and Age and Sex as predictors. We want to test whether there is an association of Age and whether or not a passenger survived (after accounting for the effects of Sex.

  1. Write out the logistic regression model with Age and Sex as predictors.

  2. Write the null hypothesis for our hypothesis test.

  3. Write both the “null model” and the “alternative model.”

  4. What is the value of maximum likelihood under the null model? We will start by writing the likelihood function, assuming that the null model is true.

  5. What is the maximum likelihood under the alternative model? We will start by writing the likelihood function, assuming that the alternative model is true.

Now, we have reached a point where the calculations have become a bit too unweildy. So, we will let R calculate the maximum likelihoods of these two models. R only has functionality to return the log likelihood using the logLik() function from a fitted model, so, we will backtransform the log likelihood using exp() to obtain the maximum likelihood for each family of models.

First, for the “null model” that does not include Age:

titanic_null <- glm(Survived ~ Sex,
                    data = titanic_df, family = "binomial")
logLik(titanic_null) |> exp() |> as.numeric()
[1] 9.716759e-164

Now, we obtain the value of the likelihood maximized under the alternative model (that does include Age):

titanic_alt <- glm(Survived ~ Sex + Age,
                    data = titanic_df, family = "binomial")
logLik(titanic_alt) |> exp() |> as.numeric()
[1] 1.409022e-163
  1. Using these values for the maximum likelihoods of each model, calculate the test statistic for the asymptotic likelihood ratio test.

  2. Using the test statistic, calculate the p-value for the test.

  3. Examine the output from the alternative model below. Can you locate the p-value we just calculated?

broom::tidy(titanic_alt)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  1.28      0.230       5.55  2.87e- 8
2 Sexmale     -2.47      0.185     -13.3   2.26e-40
3 Age         -0.00543   0.00631    -0.860 3.90e- 1
summary(titanic_alt)

Call:
glm(formula = Survived ~ Sex + Age, family = "binomial", data = titanic_df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.277273   0.230169   5.549 2.87e-08 ***
Sexmale     -2.465920   0.185384 -13.302  < 2e-16 ***
Age         -0.005426   0.006310  -0.860     0.39    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 964.52  on 713  degrees of freedom
Residual deviance: 749.96  on 711  degrees of freedom
AIC: 755.96

Number of Fisher Scoring iterations: 4

Lab 5.3: Empirical Power

We have seen some examples of how to compute power analytically, but, as the statistical test we use gets more complex, computing power becomes much more tedious (and sometimes is not even possible to compute exactly). In these scenarios, people often use empirical power to approximate the power of a test for a few different sample sizes or for a few different values of the alternative hypothesis.

In this lab, we will suppose that a researcher wants to assess whether there is evidence that a new drug helps to lower blood pressure. Note that “clinical trials” is a branch of statistics that deals with how to analyze medical data. Do note that clinical trials are more complex than what we are dealing with here, often with many different “phases” and specialized types of analysis.

Recruiting subjects in clinical trials is quite expensive! So the researcher would like to limit costs as much as possible, while still recruiting enough subjects to be able to assess whether the drug is effective.

In particular, the researcher can apply for funding to either recruit 50 participants (25 for a group that receives the new drug and 25 for a group that receives a placebo), or to recruit 200 participants (100 in each group).

Finally, the researcher wants to be fairly certain that they do not falsely reject the null hypothesis so they will only reject the null hypothesis if they get a p-value that is less than or equal to \(\alpha = 0.01\), and they would like to conduct a two-sided test that assumes the population variances are equal.

Exercise. What else do we need to know from the researcher to conduct a power analysis? There are actually a couple of things that we need!

Exercise. What are the null and alternative hypotheses for this test?


Once we have that information, use the following code to compute empirical power.

library(tidyverse)
delta <- 5
sigma <- 20
n <- 10 ## sample size for one group 

compute_twosamp_pval <- function(delta, sigma, n) {
  samp1 <- rnorm(n, 0, sigma)
  samp2 <- rnorm(n, 0 + delta, sigma)
  
  test_out <- t.test(samp1, samp2, var.equal = TRUE)
  p_val <- test_out$p.value
  
  return(p_val)
}

compute_twosamp_pval(delta = delta,
                   sigma = sigma,
                   n = n)
[1] 0.2635063

Exercise. What does delta represent in the code above?

Exercise. We do not know the true mean blood pressures \(\mu_1\) and \(\mu_2\) in the population. So why are we “allowed” to set one of the true means to be 0 in the code above?

Exercise. If we were to repeatedly run the function many times and we were to set delta to be equal to 0, what proportion of p-values would you expect to be less than 0.01?

Now, to calculate empirical power, we need to map through the function we wrote a large number of times, recording a p-value for the test each time we run the function.

n_sim <- 10000
p_vals <- map_dbl(1:n_sim, \(i) compute_twosamp_pval(delta = delta, sigma = sigma, n = n))

And, we will reject the null hypothesis if the p-value we observe from one of the simulations is less than alpha:

emp_power_df <- tibble(p_vals) |>
  mutate(reject = if_else(p_vals <= 0.01,
                          true = 1, false = 0))
emp_power_df
# A tibble: 10,000 × 2
   p_vals reject
    <dbl>  <dbl>
 1 0.995       0
 2 0.134       0
 3 0.0393      0
 4 0.978       0
 5 0.923       0
 6 0.358       0
 7 0.755       0
 8 0.0200      0
 9 0.169       0
10 0.330       0
# ℹ 9,990 more rows

Exercise. How would you calculate empirical power from the previous data frame? (you do not have to answer this in terms of specific R code but should instead give the idea of how you would calculate empirical power here).

Exercise Before conducting the power analysis, what do you expect to happen to the power as we increase the sample size? What about as we increase the (absolute) value of delta? What about if we increase alpha?

Exercise. Before conducting the analysis, it might also be nice to check and make sure that our function is working as we would expect. Run the code below. Is the empirical power what you expect? Why or why not?

p_vals_0_25 <- map_dbl(1:n_sim,
                       \(i) compute_twosamp_pval(delta = 0,
                                               sigma = sigma, n = 25))

p_val_df <- tibble(p_vals_0_25) |>
  mutate(across(everything(), ~ if_else(.x <= 0.01,
                                        true = 1, false = 0)))
p_val_df
p_val_df |>
  summarise(across(everything(), ~ mean(.x)))

Exercise. As a class, we will modify the mapping code above to create two different power curves as functions of delta. First, we will create a power curve for the smaller sample size (\(n = 25\) for each group).

Exercise. Next, we will obtain a power curve for the larger sample size (\(n = 100\) for each group).

Exercise. What do you suppose would happen to the power if we fixed the sample size, but increased the population standard deviation? Come up with a hypothesis and we will subsequently test your hypothesis.


Mini Project 5: Advantages and Drawbacks of Using p-values

AI Usage: You may not use generative AI for this project in any way.

Collaboration: For this project, you may work with a self-contained group of 3. Keep in mind that you may not work with the same person on more than one mini-project (so, if you worked with a student on the first, third, or fourth mini-projects as a partner or in a small group, you may not work with that person on this project). Finally, if working with a partner or in a group of 3, your write-ups must be distinct and written individually (so you can chat with your group about what you might say, but you should never copy a group member’s response).

Statement of Integrity: At the top of your submission, copy and paste the following statement and type your name, certifying that you have followed all AI and collaboration rules for this mini-project.

“All work presented is my own, and I have followed all rules for collaboration. I have not used generative AI on this project.”


For this final mini-project, you will read an editorial published in The American Statistician and respond to the questions below. You will find a pdf of the editorial on Canvas.

First, read through the first six sections (stopping at the middle of page 10) of the editorial titled Moving to a World Beyond ‘p < 0.05’ by Wasserstein, Schirm, and Lazar published in The American Statistician in 2019 as the introduction to a special issue about p-values. I strongly encourage you to read the first six sections at least twice before you attempt to respond to the questions that appear below. This mini-project will be graded, primarily, on how thoughtfully you respond to these questions.

Submission: Upload your typed responses to these questions to the assignment in Canvas by the deadline.

Questions:

  1. Towards the end of Section 1, the authors say “As ‘statistical significance’ is used less, statistical thinking will be used more.” Elaborate on what you think the authors mean. Give some examples of what you think embodies “statistical thinking.”

  2. Section 2, third paragraph: The authors state “A label of statistical significance adds nothing to what is already conveyed by the value of \(p\); in fact, this dichotomization of p-values makes matters worse.” Elaborate on what you think the authors means.

  3. Section 2, end of first column: The authors state “For the integrity of scientific publishing and research dissemination, therefore, whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight.” Do you agree or disagree? How should it be decided which results to present/highlight in scientific publishing?

  4. Section 3, end of page 2: The authors state “The statistical community has not yet converged on a simple paradigm for the use of statistical inference in scientific research – and in fact it may never do so. A one-size-fits-all approach to statistical inference is an inappropriate expectation.” Do you agree or disagree? Explain.

  5. Section 3.2: The authors note that they are envisioning “a sort of ‘statistical thoughtfulness’.” What do you think “statistical thoughtfulness” means? What are some ways to demonstrate “statistical thoughtfulness” in an analysis?

  6. Section 3.2.4: A few of the authors of papers in this special issue argue that some of the terminology used in statistics, such as “significance” and “confidence” can be misleading, and they propose the use of “compatibility” instead. What you do you think they believe the problem is? Do you agree or disagree (that there is a problem and that changing the name will help)?

  7. Find a quote or point that really strikes you (i.e., made you think). What is the quote (and tell me where to find it), and why does it stand out to you?