Lecture 19
2024-06-11
from credit utilization and homeownership
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9.93 0.140 70.8 0
2 credit_util 5.34 0.207 25.7 2.20e-141
3 homeownershipMortgage 0.696 0.121 5.76 8.71e- 9
4 homeownershipOwn 0.128 0.155 0.827 4.08e- 1
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9.93 0.140 70.8 0
2 credit_util 5.34 0.207 25.7 2.20e-141
3 homeownershipMortgage 0.696 0.121 5.76 8.71e- 9
4 homeownershipOwn 0.128 0.155 0.827 4.08e- 1
All else held constant, for each additional percent credit utilization is higher, interest rate is predicted to be higher, on average, by 0.0534%.
All else held constant, the model predicts that loan applicants who have a mortgage for their home receive 0.696% higher interest rate than those who rent their home, on average.
All else held constant, the model predicts that loan applicants who own their home receive 0.128% higher interest rate than those who rent their home, on average.
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.39 0.00512 468. 0
2 credit_checks 0.0236 0.00166 14.2 2.39e-45
\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times credit~checks \]
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.39 0.00512 468. 0
2 credit_checks 0.0236 0.00166 14.2 2.39e-45
For each additional credit check, log of interest rate is predicted to be higher, on average, by 0.0236%.
All else constant, what is the effect of a 1 additional credit check.
\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times (credit~checks + 1) \]
\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times credit~checks +\mathbf{0.0236\times 1} \]
\[ \widehat{interest~rate} = e^{2.39 }e^{ 0.0236 \times credit~checks}e^{ \mathbf{0.0236\times 1}} \]
\[ e^{ \mathbf{0.0236\times 1}} = 1.024 \]
For each additional credit check, interest rate is predicted to be higher, on average, by a factor of 1.024.
Similar to linear regression…. but
Modeling tool when our response is categorical
Variables with binary outcomes follow the Bernouilli distribution:
\(y_i \sim Bern(p)\)
\(p\): Probability of success
\(1-p\): Probability of failure
We can’t model \(y\) directly, so instead we model \(p\)
\[ p_i = \beta_o + \beta_1 \times X_1 + \cdots + \epsilon \]
But remember that \(p\) must be between 0 and 1
We need a link function that transforms the linear model to have an appropriate range
The logit function take values between 0 and 1 (probabilities) and maps them to values in the range negative infinity to positive infinity:
\[ logit(p) = log \bigg( \frac{p}{1 - p} \bigg) \]
Recall, the goal is to take values between -\(\infty\) and \(\infty\) and map them to probabilities.
We need the opposite of the link function… or the inverse
Taking the inverse of the logit function will map arbitrary real values back to the range [0, 1]
\[ logit(p) = log \bigg( \frac{p}{1 - p} \bigg) = \beta_o + \beta_1 \times X1_i + \cdots + \epsilon \]
\[ p_i = \frac{e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}}{1 + e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}} \]
Generalized linear models allow us to fit models to predict non-continuous outcomes
Predicting binary outcomes requires modeling the log-odds of success, where p = probability of success