Chapter 10 (alt)

Goals

We’re going to look at models with multiple categorical predictors, including their interactions. I’m going to take a very different approach than the book, using dummy coding and model comparison rather than contrast coding.

Set up

pacman::p_load(dplyr,
               tidyr,
               broom,
               emmeans,
               modelsummary,
               marginaleffects,
               ggeffects,
               tinyplot,
               ggplot2)
tinytheme("ipsum",
          family = "Roboto Condensed",
          palette.qualitative = "Tableau 10",
          palette.sequential = "agSunset")
# Match tinyplot ipsum style
theme_set(
  theme_minimal(base_family = "Roboto Condensed") +
    theme(panel.grid.minor = element_blank())
)

# Match Tableau 10 palette used by tinytheme (hex codes from ggthemes)
tableau10 <- c("#5778a4", "#e49444", "#d1615d", "#85b6b2", "#6a9f58",
               "#e7ca60", "#a87c9f", "#f1a2a9", "#967662", "#b8b0ac")

options(
  ggplot2.discrete.colour = \() scale_color_manual(values = tableau10),
  ggplot2.discrete.fill   = \() scale_fill_manual(values = tableau10)
)

We are going to use the full GSS data for this example. The code below is how I originally got it and saved it to my machine…

data(gss_all, package = "gssr")
saveRDS(gss_all, file = "data/gss.rds")

… and this is me retrieving it for this session and cleaning things up a bit.

d <- readRDS("data/gss.rds") |> 
  haven::zap_labels() |> 
  select(tvhours, degree, year, sex) |> 
  drop_na() |> 
  mutate(female = if_else(sex == 2, 1, 0),    # female
         degree_fac = factor(degree),         # degree as factor
         year_fac = factor(year))             # svy year as factor

Multiple categorical predictors

We’re going to consider a few categorical predictors of tvhours:

  • female: whether the respondent is female (0, 1)
  • degree_fac: highest degree earned from none (0) to graduate degree (4); stored as a factor
  • year_fac: a factor encoding the survey year1 from 1975 to 2024 (with gaps)

1 year is clearly a “numeric” variable in the obvious sense. We have year_fac stored as a factor so that we make no assumptions about its functional (e.g., straight line) relationship to the outcome when we use it in a model.

Let’s consider a simple model that uses all of these additively.

m1 <- lm(tvhours ~ degree_fac + female + year_fac,
         data = d)

The output here is super unwieldy but I’ll put it here if you want to see the summary().

summary(m1)

Call:
lm(formula = tvhours ~ degree_fac + female + year_fac, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7364 -1.3895 -0.4297  0.9042 21.9266 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.54935    0.06894  51.484  < 2e-16 ***
degree_fac1  -0.72826    0.03205 -22.725  < 2e-16 ***
degree_fac2  -1.19014    0.05425 -21.937  < 2e-16 ***
degree_fac3  -1.61124    0.04060 -39.681  < 2e-16 ***
degree_fac4  -1.91856    0.04910 -39.074  < 2e-16 ***
female        0.18315    0.02345   7.810 5.83e-15 ***
year_fac1977 -0.10716    0.09099  -1.178 0.238899    
year_fac1978 -0.21179    0.09091  -2.330 0.019833 *  
year_fac1980 -0.04476    0.09208  -0.486 0.626872    
year_fac1982  0.18080    0.08694   2.079 0.037577 *  
year_fac1983  0.03034    0.09000   0.337 0.736058    
year_fac1985  0.05173    0.09099   0.569 0.569654    
year_fac1986  0.15821    0.09187   1.722 0.085063 .  
year_fac1988  0.29295    0.10274   2.851 0.004357 ** 
year_fac1989  0.14147    0.10232   1.383 0.166757    
year_fac1990  0.04205    0.10475   0.401 0.688087    
year_fac1991  0.24879    0.10182   2.444 0.014547 *  
year_fac1993  0.09034    0.09017   1.002 0.316402    
year_fac1994  0.04430    0.08605   0.515 0.606636    
year_fac1996  0.19042    0.08619   2.209 0.027160 *  
year_fac1998  0.09076    0.08311   1.092 0.274845    
year_fac2000  0.18342    0.08746   2.097 0.035989 *  
year_fac2002  0.20883    0.10536   1.982 0.047477 *  
year_fac2004  0.15770    0.10567   1.492 0.135608    
year_fac2006  0.21870    0.08590   2.546 0.010896 *  
year_fac2008  0.25951    0.09455   2.745 0.006061 ** 
year_fac2010  0.30289    0.09275   3.266 0.001092 ** 
year_fac2012  0.37884    0.09506   3.985 6.75e-05 ***
year_fac2014  0.30112    0.08934   3.371 0.000751 ***
year_fac2016  0.35250    0.08698   4.053 5.07e-05 ***
year_fac2018  0.25950    0.09086   2.856 0.004294 ** 
year_fac2021  1.00387    0.08194  12.251  < 2e-16 ***
year_fac2022  0.74754    0.08329   8.975  < 2e-16 ***
year_fac2024  0.72786    0.08482   8.582  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.492 on 45966 degrees of freedom
Multiple R-squared:  0.05467,   Adjusted R-squared:  0.05399 
F-statistic: 80.55 on 33 and 45966 DF,  p-value: < 2.2e-16

There are three sets of coefficients:

  • the degree_fac ones that show how each level of degree_fac are different from the “none” reference category
  • the female one that shows how female respondents are different from males
  • the year_fac ones that show how different each survey is from the 1975 reference

This is easier to see in picture form. I will use plot_predictions() from the marginaleffects package. Using the newdata = balanced argument means that the predictions are averaged over equal values of the other predictors (i.e, in the first plot: half male, equal representation from all survey years).

plot_predictions(m1, 
                 by = "degree_fac", 
                 newdata = "balanced")

And here it is for the other ones. (I messed with the angle of the x-axis labels to avoid a mess on year.)

Show code
plot_predictions(m1, 
                 by = "year_fac", 
                 newdata = "balanced") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Show code
plot_predictions(m1, 
                 by = "female", 
                 newdata = "balanced")

Interactions

The model above says that there are educational differences and year differences and sex differences. But the model also assumes that those differences are constant. That is, m1 assumes that, for example, the educational differences don’t differ by sex or year.

We can relax that assumption in a several ways and compare the results. We can allow any pair of those differences to moderate each other (3 options) or all three to affect each other.

m_deg_sex <- update(m1, tvhours ~ degree_fac * female + year_fac)
m_deg_yr <- update(m1, tvhours ~ degree_fac * year_fac + female)
m_sex_yr <- update(m1, tvhours ~ female * year_fac + degree_fac)
m_all <- update(m1, tvhours ~ degree_fac * female * year_fac)

Now we can compare their performance.

performance::compare_performance(list(m1, 
                                      m_deg_sex, 
                                      m_deg_yr, 
                                      m_sex_yr, 
                                      m_all)) |>
  select(AIC, AIC_wt, BIC, BIC_wt, PRE = R2) |> 
  tinytable::tt() |>
  tinytable::format_tt(digits = 3,
                       num_fmt = "decimal",
                       num_zero = TRUE)
AIC AIC_wt BIC BIC_wt PRE
214593.707 0.000 214899.481 0.997 0.055
214570.267 1.000 214910.986 0.003 0.055
214657.462 0.000 215941.712 0.000 0.058
214602.943 0.000 215153.336 0.000 0.056
214766.603 0.000 217308.894 0.000 0.062

With 46000 cases, it’s not too surprising that the BIC prefers the simple additive model. The AIC, however, prefers the model where degree and sex moderate each other. Let’s take a look at that.

preds <- avg_predictions(m_deg_sex,
                         variables = c("degree_fac", "female"), 
                         newdata = "balanced")
Show code
plt(estimate ~ degree_fac | factor(female),
    data = preds,
    type = "pointrange",
    ymax = estimate + 1.96 * std.error,
    ymin = estimate - 1.96 * std.error,
    lw = 2)