Chapter 9 (alt)

Goals

The goal of this chapter is show you how to use categorical predictors in a sensible way. We will also compare and contrast the book’s way of approaching coding to what is typical in sociology1.

1 The book makes a strong pitch for the value of orthogonal contrast coding versus dummy coding. This is the opposite of our convention in sociology and I admit I simply can’t understand the idea that contrast coding is easier. It must be cultural differences! I did try for a while but the availability of post-hoc test calculation functions like car::linearHypothesis() and marginaleffects::hypotheses() seem to make contrast coding mostly irrelevant to my use cases.

Set up

As usual, we’ll set packages and our theme.

pacman::p_load(dplyr,
               tidyr,
               broom,
               stringr,
               modelsummary,
               marginaleffects,
               tinyplot,
               ggplot2)
tinytheme("ipsum",
          family = "Roboto Condensed",
          palette.qualitative = "Tableau 10",
          palette.sequential = "agSunset")
# Match tinyplot ipsum style
theme_set(
  theme_minimal(base_family = "Roboto Condensed") +
    theme(panel.grid.minor = element_blank())
)

# Match Tableau 10 palette used by tinytheme (hex codes from ggthemes)
tableau10 <- c("#5778a4", "#e49444", "#d1615d", "#85b6b2", "#6a9f58",
               "#e7ca60", "#a87c9f", "#f1a2a9", "#967662", "#b8b0ac")

options(
  ggplot2.discrete.colour = \() scale_color_manual(values = tableau10),
  ggplot2.discrete.fill   = \() scale_fill_manual(values = tableau10)
)

We will use the GSS data. Our outcome will be tvhours and we will use various tranformations of education as the predictors.

gss2024 <- readRDS("data/gss2024.rds")

d <- gss2024 |> 
  select(tvhours, degree, educ) |> 
  drop_na() |>
  haven::zap_labels() |> 
  mutate(college_fac = if_else(degree > 3, "BA+", "No BA"),
         college = if_else(degree > 3, 1, 0),
         degree_fac = factor(
           case_match(degree,
                      0 ~ "None",
                      1 ~ "HS",
                      2 ~ "AA",
                      3 ~ "BA",
                      4 ~ "GradProf"),
           levels = c("None", "HS", "AA", "BA", "GradProf")))

Two-groups as t-test and model

Here is the comparison of the hours of TV watched per day between college-educated and non-college educated respondents.

Show plot code
plt(tvhours ~ college_fac,
    data = d,
    type = "barplot",
    xlab = "")

We can write this as a regression model where the outcome is a function of a dummy variable, college, where 1 means “B” and 0 means “no BA.”

m1 <- lm(tvhours ~ college,
         data = d)
Show code
msummary(list("Two groups" = m1),
         fmt = 3,
         estimate = "{estimate} ({std.error})",
         statistic = NULL,
         gof_map = c("nobs", "F", "r.squared"))
Two groups
(Intercept) 3.474 (0.076)
college -1.216 (0.199)
Num.Obs. 2140
F 37.426
R2 0.017

Many people would estimate this two-group comparison as a t-test. Here’s how you would do that in R.

t.test(tvhours ~ college,
       data = d,
       var.equal = TRUE) # assumes one sigma for both groups

    Two Sample t-test

data:  tvhours by college
t = 6.1177, df = 2138, p-value = 1.126e-09
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 0.8260789 1.6055561
sample estimates:
mean in group 0 mean in group 1 
       3.474493        2.258675 

You can see everything that is important is the same here between the two approaches.

One wrinkle here is that I had to set var.equal = TRUE here in order to make the lm() and the t.test() identical. That’s because the regression model only allows one \(\sigma\) (the SD of the residuals).

The default value of var.equal is FALSE, which allows the the two groups to have different within-group standard deviations. This is often called a Welch’s t-test.

t.test(tvhours ~ college,
       data = d,
       var.equal = FALSE) # allows two different sigmas

    Welch Two Sample t-test

data:  tvhours by college
t = 8.7244, df = 681.58, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 0.9421956 1.4894394
sample estimates:
mean in group 0 mean in group 1 
       3.474493        2.258675 

You can see why this might be better given that the two groups do seem to have different variances.

Show plot code
plt(~ tvhours | college_fac,
    data = d,
    type = "density",
    lw = 2,
    legend = "topright")

We don’t often do this in regression form but we absolutely can. We can modify the standard linear regression as follows:

\[\begin{align} Y_i &= \beta_0 + \beta_1 X_i + \varepsilon_i \\ X_i &\in \{0, 1\} \\ \varepsilon_i &\sim \mathcal{N}(0, \sigma^2_{X_i}) \end{align}\]

…and estimate it like this using generalized least squares.

m_welch <- nlme::gls(tvhours ~ college,
               data = d,
               weights = nlme::varIdent(form = ~ 1 | college),
               method = "ML")
Show code
summary(m_welch)
Generalized least squares fit by maximum likelihood
  Model: tvhours ~ college 
  Data: d 
       AIC      BIC    logLik
  11026.56 11049.24 -5509.282

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | college 
 Parameter estimates:
        0         1 
1.0000000 0.5888609 

Coefficients:
                Value  Std.Error  t-value p-value
(Intercept)  3.474493 0.08048258 43.17074       0
college     -1.215818 0.13926341 -8.73034       0

 Correlation: 
        (Intr)
college -0.578

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-1.1167316 -0.6223127 -0.1381457  0.1529983  8.7716463 

Residual standard error: 3.434727 
Degrees of freedom: 2140 total; 2138 residual

This looks pretty much like lm() output except for the Variance function part at the top. This tells you that the residual SD of the college group is 0.5888609 times the size of the non-college group (3.434727).

Putting this choice in a modeling context means that we can use all the tricks we normally use to decide if we need the more complicated model.

AIC(m1, m_welch)
        df      AIC
m1       3 11142.51
m_welch  4 11026.56
BIC(m1, m_welch)
        df      BIC
m1       3 11159.52
m_welch  4 11049.24

You can see here that both information criteria support the more complicated model with separate variances2.

2 We could also use a likelihood-ratio test here using anova(). But to do so, we’d need to reestimate the linear regression model using nlme::gls() where you just delete the weights line.

Multiple categories

We can extend the same idea to multiple groups. This is generally known as one-way ANOVA (“analysis of variance”) but we’re going to do this in the context of a regression model.

First let’s plot the group mean differences.

Show code
plt(tvhours ~ degree_fac,
    data = d,
    type = "bar")

One thing to watch out for is using a numeric variable. I have two versions of the degree variable: degree, which is numeric and degree_fac which is a factor variable. It’s easy to see if we make a frequency table.

Numeric version

d |> group_by(degree) |> 
  summarize(n = n())
# A tibble: 5 × 2
  degree     n
   <dbl> <int>
1      0   176
2      1   988
3      2   186
4      3   473
5      4   317

Factor version

d |> group_by(degree_fac) |> 
  summarize(n = n())
# A tibble: 5 × 2
  degree_fac     n
  <fct>      <int>
1 None         176
2 HS           988
3 AA           186
4 BA           473
5 GradProf     317

We want to estimate a separate mean for each group so we want to use the factor version.

If we accidentally use the numeric version, we get this:

m2_numeric <- lm(tvhours ~ degree,
               data = d)
tidy(m2_numeric)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    4.21     0.126      33.4  2.97e-197
2 degree        -0.485    0.0555     -8.74 4.43e- 18

This tells us that each additional degree predicts a -0.4849135 difference in tvhours. This treats degree as a continuous predictor, which is not how we implement a one-way ANOVA3.

3 As we’ll see below, this might actually be an reasonable model. But it’s not an ANOVA!

Here’s how we can estimate a linear model that is equivalent to a one-way ANOVA:

m2 <- lm(tvhours ~ degree_fac,
         data = d)

Dummy coding

Show code
msummary(list("ANOVA" = m2),
         fmt = 3,
         estimate = "{estimate} ({std.error})",
         statistic = NULL,
         gof_map = c("nobs", "F", "r.squared"))
ANOVA
(Intercept) 4.295 (0.244)
degree_facHS -0.548 (0.265)
degree_facAA -1.532 (0.340)
degree_facBA -1.416 (0.286)
degree_facGradProf -2.037 (0.304)
Num.Obs. 2140
F 20.362
R2 0.037

The intercept here is the expected value for those with no degree (the “reference category”). And each of the dummy variables indicate how different that category is from the reference.

Contrast coding

The book pushes orthogonal contrast coding rather than dummy coding. As I noted above, that’s not really the standard in sociology but I thought I would at least show you a few things.

Helmert coding

Helmert coding is for ordinal variables and compares each level of the factor to the preceding level. It’s pretty easy to implement in R:

contrasts(d$degree_fac) <- contr.helmert(5)
m2_helmert <- lm(tvhours ~ degree_fac, data = d)
tidy(m2_helmert)
# A tibble: 5 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    3.19     0.0852     37.4  3.93e-236
2 degree_fac1   -0.274    0.132      -2.07 3.84e-  2
3 degree_fac2   -0.419    0.0906     -4.63 3.88e-  6
4 degree_fac3   -0.181    0.0475     -3.80 1.50e-  4
5 degree_fac4   -0.233    0.0411     -5.65 1.79e-  8

“Theoretical” coding

If particular theoretical comparisons are of interest, these can be encoded through orthogonal contrasts. The book explains the rules here.

my_contrasts <- matrix(
  c( 4, -1, -1, -1, -1,   # LT HS vs. all others
     0,  3, -1, -1, -1,   # HS vs. AA/BA/Grad
     0,  0,  2, -1, -1,   # AA vs. BA/Grad
     0,  0,  0,  1, -1),  # BA vs. Grad
  nrow = 5
)
contrasts(d$degree_fac) <- my_contrasts
m2_ortho <- lm(tvhours ~ degree_fac, data = d)
tidy(m2_ortho)
# A tibble: 5 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   3.19      0.0852    37.4   3.93e-236
2 degree_fac1   0.277     0.0518     5.34  1.03e-  7
3 degree_fac2   0.278     0.0379     7.34  2.95e- 13
4 degree_fac3   0.0648    0.0882     0.734 4.63e-  1
5 degree_fac4   0.310     0.117      2.64  8.27e-  3

We’ll see below that there are easier ways to ask these kinds of questions.

Deviance, “effect”, or sum-to-zero coding

Another way to code these is to code their departures from the unweighted grand mean, which is what the mean would be if all groups were of equal size. We can get these deviations most easily using emmeans::emmeans():

emmeans::emmeans(m2, ~ degree_fac) |> 
  emmeans::contrast("eff",      # "effects" contrasts = deviation from grand mean
           adjust = "none")     # no p-value adjustment
 contrast        estimate    SE   df t.ratio p.value
 None effect        1.107 0.207 2135   5.340  <.0001
 HS effect          0.558 0.117 2135   4.783  <.0001
 AA effect         -0.425 0.203 2135  -2.100  0.0358
 BA effect         -0.309 0.143 2135  -2.158  0.0310
 GradProf effect   -0.930 0.165 2135  -5.653  <.0001

You can see these estimates sum to zero.

Plotting

In a case like this, regardless of how we code the predictors, plotting the estimated marginal means is probably the easiest thing to do visually.

Data prep
my_means <- emmeans::emmeans(m2, ~ degree_fac)
wgm <- my_means |> as_tibble() |> pull(emmean) |> mean()

preds <- my_means |> 
  as_tibble() |>
  mutate(ub = emmean + SE,
         lb = emmean - SE)
Show plot code
plt(emmean ~ degree_fac,
    data = preds,
    type = "pointrange",
    ymax = ub,
    ymin = lb,
    lw = 2,
    xlab = "Level of education",
    ylab = "Hours per day",
    main = "TV watching by education",
    sub = "2024 General Social Survey (estimates +/- 1 SE)")
plt_add(type = type_hline(h = wgm),
        lw = 2,
        lty = 3)

This gives you a sense of how different the groups are from each other in a general sense.

Post-hoc tests

The main reason the authors seem to like orthogonal contrast coding is that it allows you to make the comparisons you want. For example, in the dummy coding results above, you can see that both BA and GradProf are significantly different from the reference category but not whether they are different from each other.

Rather than requiring us to set up such contrasts in advance, we can simply get them after the model is estimated using marginaleffects:

marginaleffects::hypotheses(m2, 
                            "degree_facGradProf = degree_facBA")

                      Hypothesis Estimate Std. Error     z Pr(>|z|)   S 2.5 %
 degree_facGradProf=degree_facBA   -0.621      0.235 -2.64  0.00821 6.9 -1.08
 97.5 %
 -0.161

This gives you the difference and standard error of the difference between these groups.

Model comparison

In the spirit of what we’ve done so far, let’s think of how to compare these models. Consider three options for modeling educational differences on TV watching:

  • college vs. non-college
  • degree status as a continuous variable (the “accident” from above)
  • degree status as a nominal variable
  • years of education as a continuous variable
  • years of education as a nominal variable

First, we could visualize the raw data.

Show plot code
plt(tvhours ~ educ,
    data = d,
    type = "jitter",
    alpha = .15)

This isn’t very useful since we have so many points.

Let’s visualize the predictions these different models make to simplify this relationship.

Data prep
# do the missing models
m_educ <- lm(tvhours ~ educ,
             data = d)

m_educ_nom <- lm(tvhours ~ factor(educ),
                 data = d)

# get prediction grid
preds2 <- tibble(
  educ = 0:20) |> 
  mutate(college = if_else(educ > 15, 1, 0),
         degree = case_match(educ,
                             0:11 ~ 0,
                             12:13 ~ 1,
                             14:15 ~ 2,
                             16 ~ 3,
                             17:20 ~ 4),
         degree_fac = case_match(degree,
                                 0 ~ "None",
                                 1 ~ "HS",
                                 2 ~ "AA",
                                 3 ~ "BA",
                                 4 ~ "GradProf"))

# add predictions
preds2$preds_twogroup <- predict(m1, newdata = preds2)
preds2$preds_deg_ANOVA <- predict(m2, newdata = preds2)
preds2$preds_deg_numeric <- predict(m2_numeric, newdata = preds2)
preds2$preds_ed_numeric <- predict(m_educ, newdata = preds2)
preds2$preds_ed_ANOVA <- predict(m_educ_nom, newdata = preds2)

# reshape
preds2_long <- pivot_longer(preds2,
                            starts_with("preds"),
                            names_prefix = "preds_",
                            names_to = "model",
                            values_to = "prediction")
Show plot code
plt(prediction ~ educ,
    data = preds2_long,
    facet = model,
    type = "s",
    lw = 2,
    yaxb = 0:8,
    xlab = "Years of education",
    main = "Different models of TV by education")

These models differ in their level of complexity, their fit to the sample and their risk of overfitting the population data.

Let’s compare all these models using our old friends, AIC and BIC. We can organize all this information nicely using the {performance} package.

performance::compare_performance(
  list("t-test" = m1, 
        "cont. (degree)" = m2_numeric,
       "ANOVA (degree)" = m2, 
       "cont. (educ)" = m_educ, 
       "ANOVA (educ)" = m_educ_nom),
  metrics = list("AIC", "BIC", "R2")) |> 
  select(-Model) |> 
  tinytable::tt() |> 
  tinytable::format_tt(digits = 3,
                       num_fmt = "decimal",
                       num_zero = TRUE)
Name AIC AIC_wt BIC BIC_wt R2
t-test 11142.515 0.000 11159.520 0.000 0.017
cont. (degree) 11104.443 0.601 11121.449 0.922 0.035
ANOVA (degree) 11105.530 0.349 11139.541 0.000 0.037
cont. (educ) 11109.396 0.050 11126.402 0.078 0.032
ANOVA (educ) 11122.625 0.000 11247.333 0.000 0.043

You can see that the most flexible model—the ANOVA that treats every distinct year of education as its own category—has the highest PRE but isn’t preferred by AIC or BIC. It’s a classic case of overfitting.

The model that “wins” here is the model that one might fit by accident—the one that treats degree as continuous. This assumes that only degree transitions matter and that every transition size is the same. This is clearly a simplification but it’s one that that fits surprisingly well!