Logistic regression

Lecture 19

2024-06-11

Announcements

  • Please be going through all prepare materials. The class is based on the understanding that you are consuming these materials.

Predict interest rate…

from credit utilization and homeownership

rate_util_home_fit <- linear_reg() |>
  fit(interest_rate ~ credit_util + homeownership, data = loans)
tidy(rate_util_home_fit)
# A tibble: 4 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)              9.93      0.140    70.8   0        
2 credit_util              5.34      0.207    25.7   2.20e-141
3 homeownershipMortgage    0.696     0.121     5.76  8.71e-  9
4 homeownershipOwn         0.128     0.155     0.827 4.08e-  1

Intercept

# A tibble: 4 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)              9.93      0.140    70.8   0        
2 credit_util              5.34      0.207    25.7   2.20e-141
3 homeownershipMortgage    0.696     0.121     5.76  8.71e-  9
4 homeownershipOwn         0.128     0.155     0.827 4.08e-  1
  • Intercept: Loan applicants who rent and have 0 credit utilization are predicted to receive an interest rate of 9.93%, on average.

Slopes

# A tibble: 4 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)              9.93      0.140    70.8   0        
2 credit_util              5.34      0.207    25.7   2.20e-141
3 homeownershipMortgage    0.696     0.121     5.76  8.71e-  9
4 homeownershipOwn         0.128     0.155     0.827 4.08e-  1
  • All else held constant, for each additional percent credit utilization is higher, interest rate is predicted to be higher, on average, by 0.0534%.

  • All else held constant, the model predicts that loan applicants who have a mortgage for their home receive 0.696% higher interest rate than those who rent their home, on average.

  • All else held constant, the model predicts that loan applicants who own their home receive 0.128% higher interest rate than those who rent their home, on average.

Transformations

Predict log(interest rate)

rate_log_cc_fit <- linear_reg() |>
  fit(log(interest_rate) ~ credit_checks, data = loans)

tidy(rate_log_cc_fit)
# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     2.39     0.00512     468.  0       
2 credit_checks   0.0236   0.00166      14.2 2.39e-45

Model

# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     2.39     0.00512     468.  0       
2 credit_checks   0.0236   0.00166      14.2 2.39e-45

\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times credit~checks \]

Slope

# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     2.39     0.00512     468.  0       
2 credit_checks   0.0236   0.00166      14.2 2.39e-45

For each additional credit check, log of interest rate is predicted to be higher, on average, by 0.0236%.

  • Note, the interpretation here is \textit{0.0236%} because the response variable is a percent.

Interpreting the coefficient on credit check

All else constant, what is the effect of a 1 additional credit check.

\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times (credit~checks + 1) \]

\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times credit~checks +\mathbf{0.0236\times 1} \]

\[ \widehat{interest~rate} = e^{2.39 }e^{ 0.0236 \times credit~checks}e^{ \mathbf{0.0236\times 1}} \]

\[ e^{ \mathbf{0.0236\times 1}} = 1.024 \]

For each additional credit check, interest rate is predicted to be higher, on average, by a factor of 1.024.

Logistic regression

What is logistic regression?

  • Similar to linear regression…. but

  • Modeling tool when our response is categorical

Modelling binary outcomes

  • Variables with binary outcomes follow the Bernouilli distribution:

    • \(y_i \sim Bern(p)\)

    • \(p\): Probability of success

    • \(1-p\): Probability of failure

  • We can’t model \(y\) directly, so instead we model \(p\)

Linear model

\[ p_i = \beta_o + \beta_1 \times X_1 + \cdots + \epsilon \]

  • But remember that \(p\) must be between 0 and 1

  • We need a link function that transforms the linear model to have an appropriate range

This isn’t exactly what we need though…..

  • Recall, the goal is to take values between -\(\infty\) and \(\infty\) and map them to probabilities.

  • We need the opposite of the link function… or the inverse

  • Taking the inverse of the logit function will map arbitrary real values back to the range [0, 1]

Generalized linear model

  • We model the logit (log-odds) of \(p\) :

\[ logit(p) = log \bigg( \frac{p}{1 - p} \bigg) = \beta_o + \beta_1 \times X1_i + \cdots + \epsilon \]

  • Then take the inverse to obtain the predicted \(p\):

\[ p_i = \frac{e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}}{1 + e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}} \]

A logistic model visualized

Takeaways

  • Generalized linear models allow us to fit models to predict non-continuous outcomes

  • Predicting binary outcomes requires modeling the log-odds of success, where p = probability of success