Logistic regression

Lecture 19

2024-06-11

Announcements

Please be going through all prepare materials. The class is based on the understanding that you are consuming these materials.

Predict interest rate…

from credit utilization and homeownership

rate_util_home_fit <- linear_reg() |>
  fit(interest_rate ~ credit_util + homeownership, data = loans)

tidy(rate_util_home_fit)

# A tibble: 4 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)              9.93      0.140    70.8   0        
2 credit_util              5.34      0.207    25.7   2.20e-141
3 homeownershipMortgage    0.696     0.121     5.76  8.71e-  9
4 homeownershipOwn         0.128     0.155     0.827 4.08e-  1

Intercept

# A tibble: 4 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)              9.93      0.140    70.8   0        
2 credit_util              5.34      0.207    25.7   2.20e-141
3 homeownershipMortgage    0.696     0.121     5.76  8.71e-  9
4 homeownershipOwn         0.128     0.155     0.827 4.08e-  1

Intercept: Loan applicants who rent and have 0 credit utilization are predicted to receive an interest rate of 9.93%, on average.

Slopes

# A tibble: 4 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)              9.93      0.140    70.8   0        
2 credit_util              5.34      0.207    25.7   2.20e-141
3 homeownershipMortgage    0.696     0.121     5.76  8.71e-  9
4 homeownershipOwn         0.128     0.155     0.827 4.08e-  1

All else held constant, for each additional percent credit utilization is higher, interest rate is predicted to be higher, on average, by 0.0534%.
All else held constant, the model predicts that loan applicants who have a mortgage for their home receive 0.696% higher interest rate than those who rent their home, on average.
All else held constant, the model predicts that loan applicants who own their home receive 0.128% higher interest rate than those who rent their home, on average.

Transformations

Predict log(interest rate)

rate_log_cc_fit <- linear_reg() |>
  fit(log(interest_rate) ~ credit_checks, data = loans)

tidy(rate_log_cc_fit)

# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     2.39     0.00512     468.  0       
2 credit_checks   0.0236   0.00166      14.2 2.39e-45

Model

# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     2.39     0.00512     468.  0       
2 credit_checks   0.0236   0.00166      14.2 2.39e-45

\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times credit~checks \]

Slope

# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     2.39     0.00512     468.  0       
2 credit_checks   0.0236   0.00166      14.2 2.39e-45

For each additional credit check, log of interest rate is predicted to be higher, on average, by 0.0236%.

Note, the interpretation here is \textit{0.0236%} because the response variable is a percent.

Interpreting the coefficient on credit check

All else constant, what is the effect of a 1 additional credit check.

\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times (credit~checks + 1) \]

\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times credit~checks +\mathbf{0.0236\times 1} \]

\[ \widehat{interest~rate} = e^{2.39 }e^{ 0.0236 \times credit~checks}e^{ \mathbf{0.0236\times 1}} \]

\[ e^{ \mathbf{0.0236\times 1}} = 1.024 \]

For each additional credit check, interest rate is predicted to be higher, on average, by a factor of 1.024.

Logistic regression

What is logistic regression?

Similar to linear regression…. but
Modeling tool when our response is categorical

Modelling binary outcomes

Variables with binary outcomes follow the Bernouilli distribution:
- \(y_i \sim Bern(p)\)
- \(p\): Probability of success
- \(1-p\): Probability of failure
We can’t model \(y\) directly, so instead we model \(p\)

Linear model

\[ p_i = \beta_o + \beta_1 \times X_1 + \cdots + \epsilon \]

But remember that \(p\) must be between 0 and 1
We need a link function that transforms the linear model to have an appropriate range

Logit link function

The logit function take values between 0 and 1 (probabilities) and maps them to values in the range negative infinity to positive infinity:

\[ logit(p) = log \bigg( \frac{p}{1 - p} \bigg) \]

This isn’t exactly what we need though…..

Recall, the goal is to take values between -\(\infty\) and \(\infty\) and map them to probabilities.
We need the opposite of the link function… or the inverse
Taking the inverse of the logit function will map arbitrary real values back to the range [0, 1]

Generalized linear model

We model the logit (log-odds) of \(p\) :

\[ logit(p) = log \bigg( \frac{p}{1 - p} \bigg) = \beta_o + \beta_1 \times X1_i + \cdots + \epsilon \]

Then take the inverse to obtain the predicted \(p\):

\[ p_i = \frac{e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}}{1 + e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}} \]

A logistic model visualized

Takeaways

Generalized linear models allow us to fit models to predict non-continuous outcomes
Predicting binary outcomes requires modeling the log-odds of success, where p = probability of success