Chapter 15

Set up

pacman::p_load(dplyr,
               tidyr,
               broom,
               janitor,
               marginaleffects,
               ggeffects,
               tinyplot,
               ggplot2)
tinytheme("ipsum",
          family = "Roboto Condensed",
          palette.qualitative = "Tableau 10",
          palette.sequential = "agSunset")
# Match tinyplot ipsum style
theme_set(
  theme_minimal(base_family = "Roboto Condensed") +
    theme(panel.grid.minor = element_blank())
)

# Match Tableau 10 palette used by tinytheme (hex codes from ggthemes)
tableau10 <- c("#5778a4", "#e49444", "#d1615d", "#85b6b2", "#6a9f58",
               "#e7ca60", "#a87c9f", "#f1a2a9", "#967662", "#b8b0ac")

options(
  ggplot2.discrete.colour = \() scale_color_manual(values = tableau10),
  ggplot2.discrete.fill   = \() scale_fill_manual(values = tableau10)
)

Measures of dependence in 2x2 tables

Before diving into logistic regression, it helps to think carefully about what we’re trying to model when we have two binary variables. The fundamental question is: are they associated? And if so, how do we measure the strength of that association?

We’ll work through an example: sociology admissions at Berkeley in 2009, broken down by citizenship status.

berk <- matrix(c(93, 4, 212, 33),
               nrow = 2,
               dimnames = list(admitted = c("no", "yes"),
                               citizenship = c("Other", "U.S.")))
addmargins(berk)
        citizenship
admitted Other U.S. Sum
     no     93  212 305
     yes     4   33  37
     Sum    97  245 342

Marginal and conditional probabilities

The marginal probability of admission is the overall rate, ignoring citizenship:

p_admit <- rowSums(berk)["yes"] / sum(berk)
p_admit
      yes 
0.1081871 

Similarly, \(P(\text{U.S.}) = 245/342 = .716\). These are “marginal” because they come from the margins of the table—they don’t condition on anything.

More interesting are conditional probabilities—the probability of admission given citizenship status:

p_admit_us    <- berk["yes", "U.S."]  / colSums(berk)["U.S."]
p_admit_other <- berk["yes", "Other"] / colSums(berk)["Other"]
c(US = p_admit_us, Other = p_admit_other)
    US.U.S. Other.Other 
 0.13469388  0.04123711 

U.S. applicants are admitted at 13.5%; non-U.S. applicants at 4.1%. You’ll often see this written as \(\pi_{\text{admitted}|\text{U.S.}} = .135\) or \(\Pr(Y = 1 \mid X = 1) = .135\).

Independence and expected values

Two variables are independent if the conditional probabilities equal the marginal:

\[ \pi_{\text{admitted}|\text{U.S.}} = \pi_{\text{admitted}|\text{Other}} = \pi_{\text{admitted}} \]

We can think about this in terms of expected cell counts: if the two variables were unrelated, how many people would we expect in each cell?

\[ E_{ij} = \frac{(\text{row total}_i)(\text{col total}_j)}{n} \]

outer(rowSums(berk), colSums(berk)) / sum(berk)
       Other      U.S.
no  86.50585 218.49415
yes 10.49415  26.50585

The observed counts depart from these expected values, which tells us there’s some association. But how much? Four measures below.

Four measures of dependence

1. Difference in probabilities

\[ \pi_{\text{admitted}|\text{U.S.}} - \pi_{\text{admitted}|\text{Other}} = .135 - .041 = .094 \]

p_admit_us - p_admit_other
      U.S. 
0.09345676 

Simple and interpretable—a 9.4 percentage-point gap. The problem: the same difference means different things at different baselines. Going from .001 to .020 is a much bigger deal than going from .500 to .519, even though both are “.019 differences.”

2. Relative risk ratio (RRR)

\[ \text{RRR} = \frac{\pi_{\text{admitted}|\text{U.S.}}}{\pi_{\text{admitted}|\text{Other}}} = \frac{.135}{.041} = 3.3 \]

p_admit_us / p_admit_other
    U.S. 
3.266327 

More informative than a raw difference when probabilities are far from .5. But the RRR for rejection—keeping non-U.S. applicants in the numerator for consistency—is only:

(1 - p_admit_other) / (1 - p_admit_us)
   Other 
1.108004 

The same association (\(3.3\times\) vs \(1.1\times\)) looks very different depending on whether you frame it as success or failure. This asymmetry is a real problem.

3. Odds ratio

The odds of an event is \(\pi / (1 - \pi)\): probability of success divided by probability of failure. If you’ve ever heard “the odds are 5 to 1 against,” that’s what it means.

odds_us    <- p_admit_us    / (1 - p_admit_us)
odds_other <- p_admit_other / (1 - p_admit_other)
OR <- odds_us / odds_other
OR
    U.S. 
3.619104 

There’s a convenient shortcut for 2×2 tables using cell counts directly—the product of the two diagonals divided by the product of the other two:

(berk["no", "Other"] * berk["yes", "U.S."]) /
  (berk["no", "U.S."]  * berk["yes", "Other"])
[1] 3.619104

The odds ratio solves the success-failure asymmetry: the OR for rejection is also 3.6.1 But the OR is still asymmetric in one way: it ranges from 0 to \(\infty\), and the same relationship looks like 3.6 or 0.28 depending on which group you put in the numerator.

1 Odds of rejection for non-U.S.: \((.959/.041) = 23.4\). For U.S.: \((.865/.135) = 6.4\). Ratio: \(23.4/6.4 = 3.6\).

4. Log odds ratio

Taking the natural log solves the remaining problems:

log(OR)
    U.S. 
1.286226 

The log odds ratio:

  • Is centered at 0 when there’s no association (instead of 1 for the OR)
  • Is symmetric: flipping which group is the reference just changes the sign
  • Ranges over all of \((-\infty, +\infty)\)

This is why logistic regression is logistic regression—it models the log odds of the outcome as a linear function of predictors. Every coefficient you estimate is, at heart, a log odds ratio. The key transformation is the logit:

\[ \text{logit}(\pi) = \log\left(\frac{\pi}{1 - \pi}\right) \]

Show code
tibble(p = seq(.01, .99, .001)) |>
  mutate(log_odds = log(p / (1 - p))) |>
  ggplot(aes(p, log_odds)) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray60") +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "gray60") +
  labs(x = "Probability", y = "Log odds",
       title = "The logit transformation")

The curve is symmetric around \(p = .5\), where \(\text{logit}(.5) = 0\), and it maps the bounded interval \((0, 1)\) to the full real line.

Summary of the four measures

Measure Formula Value Independence
Difference \(\pi_1 - \pi_2\) \(.094\) \(0\)
RRR \(\pi_1 / \pi_2\) \(3.3\) \(1\)
Odds ratio \((\pi_1/\bar\pi_1) / (\pi_2/\bar\pi_2)\) \(3.6\) \(1\)
Log odds ratio \(\log(\text{OR})\) \(1.28\) \(0\)

Practice

Here’s a table from the GSS on whether spanking is acceptable, by sex:

spank <- matrix(c(461, 226, 489, 132),
                nrow = 2,
                dimnames = list(spank_ok = c("yes", "no"),
                                sex = c("Male", "Female")))
addmargins(spank)
        sex
spank_ok Male Female  Sum
     yes  461    489  950
     no   226    132  358
     Sum  687    621 1308

Calculate all four measures of dependence. Which sex is more likely to say spanking is OK, and by how much?

Logistic regression

Let’s show how to do this using a model Let’s get the data, including a couple of binary variables:

  • `abany’: 1 = agrees that a woman should be able to have an abortion for any reason, 0 otherwise

  • notv: a variable built from tvhours that = 1 if the person watches no TV, 0 otherwise

d <- readRDS("data/gss2024.rds") |>
  haven::zap_labels() |> 
  select(sex, educ, age, abany, tvhours) |> 
  mutate(college = factor(if_else(educ >= 16, "Yes", "No")),
         abany = case_match(abany, 2 ~ 0, 1 ~ 1),
         notv = if_else(tvhours == 0, 1, 0),
         female = case_match(sex, 2 ~ 0, 1 ~ 1)) |> 
  drop_na()

Let’s look at the crosstab of college and abany.

d |> tabyl(college, abany)
 college   0   1
      No 283 326
     Yes 141 264

We can convert this into conditional probabilities.

d |> summarize(m_abany = mean(abany), .by = college)
# A tibble: 2 × 2
  college m_abany
  <fct>     <dbl>
1 No        0.535
2 Yes       0.652

We can model abany as a function of college using logistic regression.

m1 <- glm(abany ~ college,
          data = d,
          family = binomial(link = "logit"))
summary(m1)

Call:
glm(formula = abany ~ college, family = binomial(link = "logit"), 
    data = d)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.14145    0.08125   1.741 0.081684 .  
collegeYes   0.48574    0.13222   3.674 0.000239 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1378.4  on 1013  degrees of freedom
Residual deviance: 1364.7  on 1012  degrees of freedom
AIC: 1368.7

Number of Fisher Scoring iterations: 4

We can turn the college coefficient into an odds ratio by exponentiating it.

exp(coef(m1)[2])
collegeYes 
  1.625375 
d |> 
  summarize(m_abany = mean(abany), .by = college)
# A tibble: 2 × 2
  college m_abany
  <fct>     <dbl>
1 No        0.535
2 Yes       0.652
# get difference (default)
avg_comparisons(m1, variables = "college")

 Estimate Std. Error    z Pr(>|z|)    S  2.5 % 97.5 %
    0.117     0.0311 3.74   <0.001 12.4 0.0555  0.178

Term: college
Type: response
Comparison: Yes - No
# get risk ratio, not odds ratio!
avg_comparisons(m1, variables = "college", comparison = "ratio")

 Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
     1.22     0.0638 19.1   <0.001 267.4  1.09   1.34

Term: college
Type: response
Comparison: mean(Yes) / mean(No)
m2 <- glm(abany ~ college + age, # is this reasonable?
          data = d,
          family = binomial())
summary(m2)

Call:
glm(formula = abany ~ college + age, family = binomial(), data = d)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.616631   0.197385   3.124 0.001784 ** 
collegeYes   0.492372   0.132693   3.711 0.000207 ***
age         -0.009550   0.003604  -2.650 0.008060 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1378.4  on 1013  degrees of freedom
Residual deviance: 1357.6  on 1011  degrees of freedom
AIC: 1363.6

Number of Fisher Scoring iterations: 4
ggpredict(m2, terms = c("age [all]", "college")) |> plot()

avg_slopes(m2, variables = "college")

 Estimate Std. Error    z Pr(>|z|)    S  2.5 % 97.5 %
    0.117      0.031 3.78   <0.001 12.7 0.0565  0.178

Term: college
Type: response
Comparison: Yes - No
m3 <- glm(notv ~ college + age, # is this reasonable?
          data = d,
          family = binomial())
summary(m3)

Call:
glm(formula = notv ~ college + age, family = binomial(), data = d)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.576844   0.323784  -4.870 1.12e-06 ***
collegeYes   0.299481   0.222010   1.349  0.17735    
age         -0.018278   0.006461  -2.829  0.00467 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 612.34  on 1013  degrees of freedom
Residual deviance: 602.46  on 1011  degrees of freedom
AIC: 608.46

Number of Fisher Scoring iterations: 5
ggpredict(m3, terms = c("age [all]", "college"), back_transform = FALSE) |> plot()

slopes(m3)

    Term Contrast Estimate Std. Error     z Pr(>|z|)    S    2.5 %    97.5 %
 age     dY/dX    -0.00203   0.000993 -2.05   0.0408  4.6 -0.00398 -8.51e-05
 age     dY/dX    -0.00187   0.000869 -2.15   0.0313  5.0 -0.00357 -1.68e-04
 age     dY/dX    -0.00105   0.000296 -3.56   <0.001 11.4 -0.00163 -4.73e-04
 age     dY/dX    -0.00216   0.000951 -2.27   0.0234  5.4 -0.00402 -2.92e-04
 age     dY/dX    -0.00148   0.000580 -2.55   0.0107  6.6 -0.00262 -3.44e-04
--- 2018 rows omitted. See ?print.marginaleffects ---
 college Yes - No  0.02866   0.021761  1.32   0.1878  2.4 -0.01399  7.13e-02
 college Yes - No  0.02331   0.017650  1.32   0.1865  2.4 -0.01128  5.79e-02
 college Yes - No  0.01733   0.013283  1.30   0.1919  2.4 -0.00870  4.34e-02
 college Yes - No  0.01818   0.013888  1.31   0.1904  2.4 -0.00904  4.54e-02
 college Yes - No  0.01652   0.012707  1.30   0.1936  2.4 -0.00839  4.14e-02
Type: response