library(gssr) # you may need to install
library(gssrdoc) # ditto
library(dplyr)
library(tidyr)
library(broom)
library(stringr)
library(tinyplot)
tinytheme("ipsum",
family = "Roboto Condensed",
palette.qualitative = "Tableau 10",
palette.sequential = "agSunset")Chapter 2
Goals
We’ll go over some R code here to lock in the content from Chapter 2 and to see how to implement it.
Set up
Load packages here.
Get data from NORC using Kieran Healy’s {gssr} package. You only have to do this one time and then save it to your local machine. I have the chunk option set to #| eval: false so this doesn’t run in the future.
gss2024 <- gss_get_yr(2024)
saveRDS(gss2024, file = here::here("data", "gss2024.rds"))Once it’s saved locally you can load it locally in the future.
gss2024 <- readRDS(file = here::here("data", "gss2024.rds"))We only need a small subset of variables to demonstrate today’s material so let’s get that.1
1 This section uses haven::zap_labels() to get rid of the labels that come from the Stata import that is the origin of data in the {gssr} package. These can be useful sometimes but they can also get in the way of certain functions.
d <- gss2024 |>
select(polviews, sex) |>
drop_na() |>
haven::zap_labels()Unconditional predictions and SSE
We will rely on the sum of squared errors (SSE) to quantify ERROR in this framework (at least for now). This is also sometimes called the sum of squared residuals (SSR) or the residual sum of squares (RSS). These are all the same quantities.
We can get R to estimate “model C” (our zero parameter guess).
d <- d |>
mutate(mod_c_pred = 4,
mod_c_err = polviews - mod_c_pred)
sse_c <- sum(d$mod_c_err^2) # SSE for model C
sse_c[1] 7408
We can now get R to estimate “model A” (our one parameter estimate).
d <- d |>
mutate(mod_a_pred = mean(polviews),
mod_a_err = polviews - mod_a_pred)
sse_a <- sum(d$mod_a_err^2)
sse_a[1] 7372.137
We can use these quantities to estimate PRE.
(sse_c - sse_a) / sse_c[1] 0.004841087
We can also estimate model C with lm(). First let’s make a version of polviews “centered” on 4.
d <- d |>
mutate(polviews_dev = polviews - 4)Now we can estimate model C by regressing polviews_dev in a model with no constant (i.e., estimating no parameters at all).
mod_c <- lm(polviews_dev ~ 0,
data = d)
deviance(mod_c) # confirm this is the same as above[1] 7408
This is similar for model A, except we now estimate the “intercept” by changing the formula to ~ 1. This asks R to estimate the intercept. Because of the way we transformed the outcome variable (deviating it from 4), this means “how different is the estimated mean from 4”.
mod_a <- lm(polviews_dev ~ 1,
data = d)
deviance(mod_a)[1] 7372.137
(deviance(mod_c) - deviance(mod_a)) / deviance(mod_c)[1] 0.004841087
This makes a very small difference, reducing the SSE/SSR/RSS by .4%.
If we want, we can also use the predict() function after lm() to add the model estimate to the data frame. Here the same number will be added to each observation because the model is making the same guess for every row.
d$mod_a_pred <- predict(mod_a)
d$mod_c_pred <- predict(mod_c)glance() is also a useful function to look at models. It’s too early for us to dig into this too much, but sigma here (\(\sigma\)) is the RMSE (the square root of the mean squared error).
glance(mod_a)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 1.53 NA NA NA -5806. 11616. 11629.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Conditional predictions
We won’t talk too much about this until later, but we can also use the data to make different predictions for different groups. For example, is the mean of polviews different for male and female respondents?
d2 <- d |>
group_by(sex) |>
summarize(mean_pv = mean(polviews))
d2# A tibble: 2 × 2
sex mean_pv
<dbl> <dbl>
1 1 4.20
2 2 4.03
We can also do this with lm() Let’s start with the “bad way” of conditional predictions. This is bad because it doesn’t make sense to have sex be coded 1 to 2.
model3 <- lm(polviews ~ 1 + sex,
data = d)
tidy(model3)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.38 0.0890 49.3 0
2 sex -0.178 0.0547 -3.26 0.00113
One good way is to make a new variable with a meaningful zero. That means that the intercept is now the prediction for the “zero group” or reference category.
d <- d |>
mutate(male = if_else(sex == 1, 1, 0))
model3 <- lm(polviews ~ 1 + male,
data = d)
tidy(model3)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.03 0.0368 109. 0
2 male 0.178 0.0547 3.26 0.00113
Another way is using the factor() wrapper to force R to make a dummy variable. Here it chose 1 (male) as the reference category and and 2 (female) as the “1” category. So the sign is different than when we manually chose above.
model3 <- lm(polviews ~ 1 + factor(sex),
data = d)
tidy(model3)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.20 0.0405 104. 0
2 factor(sex)2 -0.178 0.0547 -3.26 0.00113
In any case, we can compare one-parameter models that make unconditional predictions (i.e., the same prediction for everyone) to two-parameter models that make predictions conditional on the sex of the respondent.
deviance(mod_a) # just b0[1] 7372.137
deviance(model3) # b0 and b1[1] 7347.317
(deviance(mod_a) - deviance(model3)) / deviance(mod_a)[1] 0.003366721
Here the conditional model reduces the error by .34%. This isn’t a lot but how do we know if it’s “worth it”? We’ll talk about that later.