Making decisions with randomization tests

Lecture 21

2024-06-17

Announcements

Exam 2 on Thursday!
Take-home exam 2 will be released after lab on Thursday and due June 6/22 at 11:59 PM.
Project final presentation during lab on 6/24.
Project final report due 6/25 at 11:59 PM. Submit your report by pushing it to github.
All information regarding the lab is on the course webpage: sta199-summer24.github.io/project/

Goals

Review model comparison
Review constructing confidence intervals via bootstrapping
Hypothesis testing, p-values, and making conclusions
- Test a claim about a population parameter
- Use simulation-based methods to generate the null distribution
- Calculate and interpret the p-value
- Use the p-value to draw conclusions in the context of the data and the research question

Review 1- Comparing models

Model 1

First, predict mpg using gear as the only predictor.

mtcars_fit <- linear_reg() |>
  fit(mpg ~ gear, data = mtcars) 

tidy(mtcars_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)     5.62      4.92      1.14 0.262  
2 gear            3.92      1.31      3.00 0.00540

The regression equation is: \[ \widehat{mpg} = 5.62 + ( 3.92)\times gear \]

Model 1 evaluation

augment(mtcars_fit, new_data = mtcars) %>% 
  summarize(SSR = sum(.resid^2))

# A tibble: 1 × 1
    SSR
  <dbl>
1  866.

glance(mtcars_fit)$r.squared

[1] 0.2306734

glance(mtcars_fit)$adj.r.squared

[1] 0.2050292

Interpretation of R-squared: 23.07% of the variability in observed in mpg is explained by this regression model.

What is the goal of a regression model? How much % of the variability do we want to explain?

Model 2

First, predict mpg with an additive model including gear and disp as predictors.

mtcars_fit_2 <- linear_reg() |>
  fit(mpg ~ gear + disp, data = mtcars) 

tidy(mtcars_fit_2)

# A tibble: 3 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)  29.1      4.49        6.49  0.000000421 
2 gear          0.111    0.968       0.115 0.909       
3 disp         -0.0408   0.00576    -7.09  0.0000000847

The regression equation is: \[ \widehat{mpg} = 29.1 + (0.111)\times gear + (-0.0408)\times disp \]

Model 2 evaluation

augment(mtcars_fit_2, new_data = mtcars) %>% 
  summarize(SSR = sum(.resid^2))

# A tibble: 1 × 1
    SSR
  <dbl>
1  317.

glance(mtcars_fit_2)$r.squared

[1] 0.7184715

glance(mtcars_fit_2)$adj.r.squared

[1] 0.6990557

Model comparison with adjusted R-squared

Model 2 is more complex, and, as such, it is guaranteed to have a larger R-squared value than model 1. Is it actually a better model?

Let’s compare adjusted R-squared, which includes a penalty for including more predictors.

Model 1: 0.2050292
Model 2: 0.6990557

Review 2

Bootstrap intervals

Why do we construct confidence intervals?
What is bootstrapping?
What does each dot on the plot represent? Note: The plot is of a bootstrap distribution of a sample mean.

Why do we construct confidence intervals?

To estimate plausible values of a parameter of interest, e.g., a slope (\(\beta_1\)), a mean (\(\mu\)), a proportion (\(p\)).

What is bootstrapping?

Bootstrapping is a statistical procedure that resamples(with replacement) a single data set to create many simulated samples.
We then use these simulated samples to quantify the uncertainty around the sample statistic we’re interested in, e.g., a slope (\(b_1\)), a mean (\(\bar{x}\)), a proportion (\(\hat{p}\)).

What does each dot on the plot represent?

Note: The plot is of a bootstrap distribution of a sample mean.

Resample, with replacement, from the original data
Do this 20 times (since there are 20 dots on the plot)
Calculate the summary statistic of interest in each of these samples

Bootstrapping for categorical data

specify(response = x, success = "success level")
calculate(stat = "prop")

Bootstrapping for other `stat`s

calculate() documentation: infer.tidymodels.org/reference/calculate.html
infer pipelines: infer.tidymodels.org/articles/observed_stat_examples.html

Hypothesis testing

A hypothesis test is a statistical technique used to evaluate competing claims using data

Null hypothesism \(H_0\): An assumption about the population. “There is nothing going on.”
Alternative hypothesis, \(H_A\): A research question about the population. “There is something going on”.

Note: Hypotheses are always at the population level!

Writing hypotheses

As a researcher, you are interested in the average number of cups of coffee Duke students drink in a day. An article on The Chronicle suggests that the Duke students drink, on average, 1.2 cups of coffee. You are interested in evaluating if The Chronicle’s claim is too high. What are your hypotheses?

Writing hypotheses

As a researcher, you are interested in the average number of cups of coffee Duke students drink in a day.

An article on The Chronicle suggests that the Duke students drink, on average, 1.2 cups of coffee. \(\rightarrow H_0: \mu = 1.2\)
You are interested in evaluating if The Chronicle’s too high. \(\rightarrow H_A: \mu < 1.2\)

Collecting data

Let’s suppose you manage to take a random sample of 100 Duke students and ask them how many cups of coffee they drink and calculate the sample average to be \(\bar{x} = 1\).

Hypothesis testing “mindset”

Assume yoi live in a world where null hypothesis is true: \(\mu = 1.2\).
Ask yourself how likely you are to observe the sample statistic, or something even more extreme, in this world: \(P(\bar{x} < 1 | \mu = 1.2)\) = ?
- Read: Probability that the sample mean is smaller than 1 given that the population mean is 1.2.

Application exercise

Application exercise: `ae-16-equality-randomization`

Go back to your project called ae.
If there are any uncommitted files, commit them, and push.
Then pull and work on ae-16-equality-randomization.qmd.

Recap of AE

A hypothesis test is a statistical technique used to evaluate competing claims (null and alternative hypotheses) using data.
We simulate a null distribution using our original data.
We use our sample statistic and direction of the alternative hypothesis to calculate the p-value.
We use the p-value to determine conclusions about the alternative hypotheses.

Making decisions with randomization tests

Announcements

Goals

Review 1- Comparing models

Model 1

Model 1 evaluation

Model 2

Model 2 evaluation

Model comparison with adjusted R-squared

Review 2

Bootstrap intervals

Why do we construct confidence intervals?

What is bootstrapping?

What does each dot on the plot represent?

Bootstrapping for categorical data

Bootstrapping for other stats

Hypothesis testing

Hypothesis testing

Writing hypotheses

Writing hypotheses

Collecting data

Hypothesis testing “mindset”

Application exercise

Application exercise: ae-16-equality-randomization

Recap of AE

Bootstrapping for other `stat`s

Application exercise: `ae-16-equality-randomization`