Quantifying uncertainty with bootstrap intervals

Lecture 20

2024-06-13

Upcoming Schedule

Exam 2

  • Review: Next Monday 6/17 in lab- come with questions!

  • In-class : Next Thursday 6/20, for the first 1 hour 15 min

  • Take-home: Next Thursday after lab - Saturday 6/22 at 11:59 PM

Project progress

  • You should aim to make some progress on your project before lab on 6/20. At the very least: select a dataset and research question to work on; discuss meeting times before the final presentation on 6/24.

  • Milestone 3 (6/20) is a peer review session. The more you have completed, the more helpful the feedback you receive is!

  • Milestone 4 (6/24) is the final presentation and your final report is due at 11:59 PM. The final report should be at least 10 pages, not including code.

Recap

Predict interest rate

Recall the loans data from last class. We wish to predict the interest rate for a loan applicant.

rate_util_home_fit <- linear_reg() |>
  fit(interest_rate ~ credit_util + homeownership, 
      data = loans)

tidy(rate_util_home_fit)
# A tibble: 4 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)              9.93      0.140    70.8   0        
2 credit_util              5.34      0.207    25.7   2.20e-141
3 homeownershipMortgage    0.696     0.121     5.76  8.71e-  9
4 homeownershipOwn         0.128     0.155     0.827 4.08e-  1

Interpreting a 1% increase in regression output

What is the expected effect of a 1% increase in credit utilization for a renter?

\[ \begin{aligned} \widehat{interest\;rate} = {}& 9.93 +5.34 \times \big(credit \; utilization + 1/100 \big)\\ = {}& 9.93 +5.34 \times credit \; utilization + 5.34 \times 1/100\\ = {}& 9.93 +5.34 \times credit \; utilization + 0.0534 \end{aligned} \] - All else held constant, for each additional percent credit utilization is higher, interest rate is predicted to be higher, on average, by 0.0534%.

False positive rate

We may want to minimize the number of false positives, that is, true negatives that are predicted to be positive. A way to measure this is with a false positive rate defined as follows:

\[ false\;positive\;rate = \frac{\# false\; positives}{\# all \;true \;negatives} \]

False negative rate

False negatives, true positives that are predicted to be negative, may also be undesirable. The false negative rate is defined as follows:

\[ false\;negative\;rate = \frac{\# false\; negatives}{\# all \;true \;positives} \]

Quantifying uncertainty

Case study: Airbnb in Asheville, NC

We have data on the price per guest (ppg) for a random sample of 50 Airbnb listings in 2020 for Asheville, NC. We are going to use these data to investigate what we would of expected to pay for an Airbnb in in Asheville, NC in June 2020.

We have data on the price per guest (ppg) for a random sample of 50 Airbnb listings in 2020 for Asheville, NC. We are going to use these data to investigate what we would of expected to pay for an Airbnb in in Asheville, NC in June 2020.

abb <- read_csv("data/asheville.csv")

glimpse(abb)
Rows: 50
Columns: 1
$ ppg <dbl> 48.00000, 40.00000, 99.00000, 13.00000, 55.00000, 75.00000, 74.000…

Terminology

  • Population parameter - What we are interested in. Statistical measure that describes an entire population.

  • Sample statistic (point estimate) - describes a sample. A piece of information you get from a fraction of the population.

abb |> 
  summarize(ppg_mean = mean(ppg))
# A tibble: 1 × 1
  ppg_mean
     <dbl>
1     76.6

Catching a fish

Suppose you’re fishing in a murky lake. Are you more likely to catch a fish using a spear or a net?

  • Spear \(\rightarrow\) point estimate
  • Net \(\rightarrow\) interval estimate

Constructing confidence intervals

Quantifying the variability of the sample statistics to help calculate a range of plausible values for the population parameter of interest:

  • Via simulation \(\rightarrow\) using bootstrapping – using a statistical procedure that re samples a single data set to create many simulated samples.

  • Via mathematical formulas \(\rightarrow\) using the Central Limit Theorem

Bootstrapping, what?

  • The term bootstrapping comes from the phrase “pulling oneself up by one’s bootstraps”, which is a metaphor for accomplishing an impossible task without any outside help.

  • Impossible task: estimating a population parameter using data from only the given sample.

Note

Note: This notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference, it is not limited to bootstrapping.

Bootstrapping, how?

  • Resample with replacement from our data n times, where n is the sample size
  • Calculate the sample statistic of interest of the new, resampled (bootstrapped) sample and record the value
  • Do this entire process many many times to build the bootstrap distribution

Bootstrapping Airbnb rentals

set.seed(25) 

boot_dist_abb <- abb |>
  specify(response = ppg) |>
  generate(reps = 100, type = "bootstrap") |>
  calculate(stat = "mean")

The bootstrap distribution

glimpse(boot_dist_abb)
Rows: 100
Columns: 2
$ replicate <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ stat      <dbl> 73.11500, 78.78333, 80.19333, 83.42000, 70.15000, 73.03667, …

Visualzing the bootstrap distribution

What do you expect the center of the bootstrap distribution to be? Why? Check your guess by visualizing the distribution.

ggplot(boot_dist_abb, aes(x = stat)) + 
  geom_histogram(binwidth = 2)

Calculating the bootstrap distribution

boot_dist_abb |>
  summarize(
    lower = quantile(stat, 0.025),
    upper = quantile(stat, 0.975)
  )
# A tibble: 1 × 2
  lower upper
  <dbl> <dbl>
1  64.7  89.6

Interpretation

Which of the following is the correct interpretation of the bootstrap interval?

  1. There is a 95% probability the true mean price per guest for an Airbnb in Asheville is between $64.7 and $89.6.

  2. There is a 95% probability the price per guest for an Airbnb in Asheville in this sample is between $64.7 and $89.6.

  3. We are 95% confident the true mean price per guest for Airbnbs in Asheville is between $64.7 and $89.6.

  4. We are 95% confident the price per guest for an Airbnb in Asheville in this sample is between $64.7 and $89.6.

Leveraging tidymodels tools further

Calculating the observed sample statistic:

obs_stat_abb <- abb |>
  specify(response = ppg) |>
  calculate(stat = "mean")  

Leveraging tidymodels tools further

Calculating the interval:

ci_95_abb <- boot_dist_abb |>
  get_confidence_interval(
    point_estimate = obs_stat_abb, 
    level = 0.95
  )

Leveraging tidymodels tools further

Visualizing the interval:

visualize(boot_dist_abb) +
  shade_confidence_interval(ci_95_abb)

Application exercise

Application exercise: ae-15-duke-forest-bootstrap

  • Go back to your project called ae.
  • If there are any uncommitted files, commit them, and push.
  • Work on ae-15-duke-forest-bootstrap.qmd.