Chapter 2

Goals

We’ll go over some R code here to lock in the content from Chapter 2 and to see how to implement it.

Set up

Load packages here.

pacman::p_load(gssr,
               gssrdoc,
               dplyr,
               tidyr,
               broom,
               stringr,
               tinyplot,
               ggplot2)

tinyplot
ggplot2

tinytheme("ipsum",
          family = "Roboto Condensed",
          palette.qualitative = "Tableau 10",
          palette.sequential = "agSunset")

theme_set(
  theme_minimal(base_family = "Roboto Condensed") +
    theme(panel.grid.minor = element_blank())
)

tableau10 <- c("#5778a4", "#e49444", "#d1615d", "#85b6b2", "#6a9f58",
               "#e7ca60", "#a87c9f", "#f1a2a9", "#967662", "#b8b0ac")

options(
  ggplot2.discrete.colour = \() scale_color_manual(values = tableau10),
  ggplot2.discrete.fill   = \() scale_fill_manual(values = tableau10)
)

Get data from NORC using Kieran Healy’s {gssr} package. You only have to do this one time and then save it to your local machine. I have the chunk option set to #| eval: false so this doesn’t run in the future.

gss2024 <- gss_get_yr(2024)
saveRDS(gss2024, file = here::here("data", "gss2024.rds"))

Once it’s saved locally you can load it locally in the future.

gss2024 <- readRDS(file = here::here("data", "gss2024.rds"))

We only need a small subset of variables to demonstrate today’s material so let’s get that.¹

¹ This section uses haven::zap_labels() to get rid of the labels that come from the Stata import that is the origin of data in the {gssr} package. These can be useful sometimes but they can also get in the way of certain functions.

d <- gss2024 |> 
  select(polviews, sex) |> 
  drop_na() |> 
  haven::zap_labels()

Unconditional predictions and SSE

We will rely on the sum of squared errors (SSE) to quantify ERROR in this framework (at least for now). This is also sometimes called the sum of squared residuals (SSR) or the residual sum of squares (RSS). These are all the same quantities.

We can get R to estimate “model C” (our zero parameter guess).

d <- d |> 
  mutate(mod_c_pred = 4,
         mod_c_err = polviews - mod_c_pred)

sse_c <- sum(d$mod_c_err^2) # SSE for model C
sse_c

[1] 7408

We can now get R to estimate “model A” (our one parameter estimate).

d <- d |> 
  mutate(mod_a_pred = mean(polviews),
         mod_a_err = polviews - mod_a_pred)

sse_a <- sum(d$mod_a_err^2)
sse_a

[1] 7372.137

We can use these quantities to estimate PRE.

(sse_c - sse_a) / sse_c

[1] 0.004841087

We can also estimate model C with lm(). First let’s make a version of polviews “centered” on 4.

d <- d |> 
  mutate(polviews_dev = polviews - 4)

Now we can estimate model C by regressing polviews_dev in a model with no constant (i.e., estimating no parameters at all).

mod_c <- lm(polviews_dev ~ 0, 
            data = d)
deviance(mod_c) # confirm this is the same as above

[1] 7408

This is similar for model A, except we now estimate the “intercept” by changing the formula to ~ 1. This asks R to estimate the intercept. Because of the way we transformed the outcome variable (deviating it from 4), this means “how different is the estimated mean from 4”.

mod_a <- lm(polviews_dev ~ 1,
            data = d)
deviance(mod_a)

[1] 7372.137

(deviance(mod_c) - deviance(mod_a)) / deviance(mod_c)

[1] 0.004841087

This makes a very small difference, reducing the SSE/SSR/RSS by .4%.

If we want, we can also use the predict() function after lm() to add the model estimate to the data frame. Here the same number will be added to each observation because the model is making the same guess for every row.

d$mod_a_pred <- predict(mod_a)
d$mod_c_pred <- predict(mod_c)

glance() is also a useful function to look at models. It’s too early for us to dig into this too much, but sigma here (\(\sigma\)) is the RMSE (the square root of the mean squared error).

glance(mod_a)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1         0             0  1.53        NA      NA    NA -5806. 11616. 11629.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Conditional predictions

We won’t talk too much about this until later, but we can also use the data to make different predictions for different groups. For example, is the mean of polviews different for male and female respondents?

d2 <- d |> 
  group_by(sex) |> 
  summarize(mean_pv = mean(polviews))

d2

# A tibble: 2 × 2
    sex mean_pv
  <dbl>   <dbl>
1     1    4.20
2     2    4.03

We can also do this with lm() Let’s start with the “bad way” of conditional predictions. This is bad because it doesn’t make sense to have sex be coded 1 to 2.

model3 <- lm(polviews ~ 1 + sex,
             data = d)

tidy(model3)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    4.38     0.0890     49.3  0      
2 sex           -0.178    0.0547     -3.26 0.00113

One good way is to make a new variable with a meaningful zero. That means that the intercept is now the prediction for the “zero group” or reference category.

d <- d |> 
  mutate(male = if_else(sex == 1, 1, 0))

model3 <- lm(polviews ~ 1 + male,
             data = d)

tidy(model3)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    4.03     0.0368    109.   0      
2 male           0.178    0.0547      3.26 0.00113

Another way is using the factor() wrapper to force R to make a dummy variable. Here it chose 1 (male) as the reference category and and 2 (female) as the “1” category. So the sign is different than when we manually chose above.

model3 <- lm(polviews ~ 1 + factor(sex),
             data = d)
tidy(model3)

# A tibble: 2 × 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)     4.20     0.0405    104.   0      
2 factor(sex)2   -0.178    0.0547     -3.26 0.00113

In any case, we can compare one-parameter models that make unconditional predictions (i.e., the same prediction for everyone) to two-parameter models that make predictions conditional on the sex of the respondent.

deviance(mod_a)  # just b0

[1] 7372.137

deviance(model3) # b0 and b1

[1] 7347.317

(deviance(mod_a) - deviance(model3)) / deviance(mod_a)

[1] 0.003366721

Here the conditional model reduces the error by .34%. This isn’t a lot but how do we know if it’s “worth it”? We’ll talk about that later.