library(dplyr)
library(tidyr)
library(broom)
library(stringr)
library(modelsummary)
library(tinyplot)
tinytheme("ipsum",
family = "Roboto Condensed",
palette.qualitative = "Tableau 10",
palette.sequential = "agSunset")Chapter 6
Overview
Goals
This is our first step into a larger world (of regression analysis). Here, instead of comparing zero-parameter models (MC) to one-parameter models (MA), we’re going to compare one-parameter models (MC) to two-parameter models (MA). The basic question is whether we want to make the same prediction for every observation or different predictions for different observations, condition on their value of some predictor.
As before, we’ll augment the book by using simulations to understand null distributions, statistical inference, and power.
Set up
Here are the packages we’re going to need. We’ll also set our {tinyplot} theme.
Additional considerations
Before moving onto the chapter proper, I want to say a something about the t-test. The book really doesn’t properly talk about this until Chapter 9, but we’ve been skirting around it and talking about the t distribution already so I’d like to talk a little about this.
The t-test as a two-group lm() comparison*
Let’s look at Southern and non-Southern states in 1977 and compare them on income per capita.
states <- bind_cols(state.x77,
region = state.division) |> # both are in base R
janitor::clean_names() |> # lower case and underscore
mutate(south = as.integer( # make a south dummy
str_detect(as.character(region),
"South")))We can write:
\[\begin{align} &\text{Model C: } \text{LIFEEXP}_i = \beta_0 + \epsilon_i \\ &\text{Model A: } \text{LIFEEXP}_i = \beta_0 + \beta_1 (\text{SOUTH}_i) + \epsilon_i \end{align}\]
Estimate the models.
mc <- lm(life_exp ~ 1,
data = states)
ma <- lm(life_exp ~ 1 + south,
data = states)We can see the output using the {modelsummary} package, using the msummary() alias for short.
msummary(list("MC" = mc,
"MA" = ma),
fmt = 2,
estimate = "{estimate} ({std.error})",
statistic = NULL,
gof_map = c("nobs", "F", "r.squared"))| MC | MA | |
|---|---|---|
| (Intercept) | 70.88 (0.19) | 71.43 (0.19) |
| south | -1.72 (0.33) | |
| Num.Obs. | 50 | 50 |
| F | 27.739 | |
| R2 | 0.000 | 0.366 |
This “regression with a single dummy predictor” is exactly the same as a t-test (that assumes that the two groups have equal within-group variances). Again, we’ll talk more about that later.
Two approaches to coding a predictor
The coding of south above is called dummy coding or reference coding and is by far the most common in sociology. But sometimes (including in the book) you will see effect coding (also called deviation or contrast coding) that codes the predictor (-1, 1) instead of (0, 1).
states <- states |>
mutate(south_eff = if_else(south == 1, 1, -1))
ma_effect <- lm(life_exp ~ south_eff, data = states)Show code
msummary(list("MA (dummy)" = ma,
"MA (effect)" = ma_effect),
fmt = 2,
estimate = "{estimate} ({std.error})",
statistic = NULL,
gof_map = c("nobs", "F", "r.squared"))| MA (dummy) | MA (effect) | |
|---|---|---|
| (Intercept) | 71.43 (0.19) | 70.57 (0.16) |
| south | -1.72 (0.33) | |
| south_eff | -0.86 (0.16) | |
| Num.Obs. | 50 | 50 |
| F | 27.739 | 27.739 |
| R2 | 0.366 | 0.366 |
With effect coding, the intercept is the grand mean, or the overall mean you’d expect if both groups were the same size. And \(b_1\) is half the difference between the two groups.
Defining a linear two-parameter model
As in the book, we’ll look at regression first through the lens of two continuous variables.
Show code
plt(life_exp ~ hs_grad,
data = states,
ylim = c(67, 74),
main = "Life expectancy and education",
sub = "US states, 1977",
xlab = "% high school graduates",
ylab = "Life expectancy at birth")
We can add a regression line to this using plt_add().
Show code
plt(life_exp ~ hs_grad,
data = states,
ylim = c(67, 74),
main = "Life expectancy and education",
sub = "US states, 1977",
xlab = "% high school graduates",
ylab = "Life expectancy at birth")
plt_add(type = "lm",
level = .99)
Contrast that to assuming the same prediction for each state.
Show code
states$mean_lex <- mean(states$life_exp)
width <- sd(states$life_exp / sqrt(nrow(states))) * qt(.995, 49) # 99%
plt(mean_lex ~ hs_grad,
data = states,
ymax = mean_lex + width,
ymin = mean_lex - width,
ylim = c(67, 74),
type = "ribbon",
main = "Life expectancy and education",
sub = "US states, 1977",
xlab = "% high school graduates",
ylab = "Life expectancy at birth")
plt_add(life_exp ~ hs_grad,
data = states,
type = "p")
Estimating a linear two-parameter model
Estimate the models.
mc <- lm(life_exp ~ 1, data = states)
ma <- lm(life_exp ~ hs_grad, data = states)Show code
msummary(list("MC" = mc,
"MA" = ma),
fmt = 3,
estimate = "{estimate} ({std.error})",
statistic = NULL,
gof_map = c("nobs", "F", "r.squared"))| MC | MA | |
|---|---|---|
| (Intercept) | 70.879 (0.190) | 65.740 (1.047) |
| hs_grad | 0.097 (0.020) | |
| Num.Obs. | 50 | 50 |
| F | 24.615 | |
| R2 | 0.000 | 0.339 |
An alternative specification [or many…]
Centering
states <- states |>
mutate(c_hs_grad = hs_grad - mean(hs_grad))Show code
fit_cent <- lm(life_exp ~ c_hs_grad,
data = states)
msummary(list("Centered" = fit_cent),
fmt = 3,
estimate = "{estimate} ({std.error})",
statistic = NULL,
gof_map = c("nobs", "F", "r.squared"))| Centered | |
|---|---|
| (Intercept) | 70.879 (0.156) |
| c_hs_grad | 0.097 (0.020) |
| Num.Obs. | 50 |
| F | 24.615 |
| R2 | 0.339 |
Show code
plt(life_exp ~ c_hs_grad,
data = states,
ylim = c(67, 74),
main = "Life expectancy and education",
sub = "US states, 1977",
xlab = "% high school graduates (centered)",
ylab = "Life expectancy at birth")
plt_add(type = "lm",
level = .99)
The intercept, \(b_0\), is now the expected value of the outcome when hs_grad is at its mean. But the slope, \(b_1\), is the same.
Normalization*
See Cohen et al., “The Problem of Units and the Circumstance for POMP”.
Create normalized (or “POMP”) predictor and visualize.
states <- states |>
mutate(hs_grad01 = (hs_grad - min(hs_grad)) / # subtract min
(max(hs_grad) - min(hs_grad)), # div by range
hs_grad01 = scales::rescale(hs_grad)) # an easier wayShow code
plt(life_exp ~ hs_grad01,
data = states,
ylim = c(67, 74),
main = "Life expectancy and education",
sub = "US states, 1977",
xlab = "high school graduate rate (normalized)",
ylab = "Life expectancy at birth")
plt_add(type = "lm",
level = .99)
Show code
fit_pomp <- lm(life_exp ~ hs_grad01,
data = states)
msummary(list("Norm." = fit_pomp),
fmt = 3,
estimate = "{estimate} ({std.error})",
statistic = NULL,
gof_map = c("nobs", "F", "r.squared"))| Norm. | |
|---|---|
| (Intercept) | 69.397 (0.337) |
| hs_grad01 | 2.855 (0.575) |
| Num.Obs. | 50 |
| F | 24.615 |
| R2 | 0.339 |
Now \(b_0\) is the predicted value when X is at its sample minimum and \(b_1\) is how much the prediction would change when X is set to its sample maximum. So here moving from the least- to most-educated state, the model would expect life expectancy to be about 2.9 years higher.
X-standardization*
Sometimes when Y is in naturally interpretable units (like years) and X is a bit more abstract, we can use x-standardization. This means converting X into a z-score before using it as predictor.
\[X^* = \frac{X_i-\bar{X}}{s_X}\]
states <- states |>
mutate(z_hs_grad = (hs_grad - mean(hs_grad)) / sd(hs_grad))
plt(z_hs_grad ~ hs_grad,
data = states,
ylim = c(-2,2),
main = "Original and standardized values",
xlab = "% HS grads",
ylab = "z-score")
Show code
fit_stdx <- lm(life_exp ~ z_hs_grad,
data = states)
msummary(list("X std." = fit_stdx),
fmt = 3,
estimate = "{estimate} ({std.error})",
statistic = NULL,
gof_map = c("nobs", "F", "r.squared"))| X std. | |
|---|---|
| (Intercept) | 70.879 (0.156) |
| z_hs_grad | 0.782 (0.158) |
| Num.Obs. | 50 |
| F | 24.615 |
| R2 | 0.339 |
This now means that a one standard deviation change in the % of high school graduates predicts .78 additional years of life expectancy.
Full standardization*
Both X and Y can also be standardized in this way.
Show code
states <- states |>
mutate(z_life_exp = (life_exp - (mean(life_exp))) / sd(life_exp))
fit_std <- lm(z_life_exp ~ z_hs_grad,
data = states)
msummary(list("XY std." = fit_std),
fmt = 3,
estimate = "{estimate} ({std.error})",
statistic = NULL,
gof_map = c("nobs", "F", "r.squared"))| XY std. | |
|---|---|
| (Intercept) | -0.000 (0.116) |
| z_hs_grad | 0.582 (0.117) |
| Num.Obs. | 50 |
| F | 24.615 |
| R2 | 0.339 |
Here the intercept will always be zero, since simple regression always expects the outcome to be at its average when the predictor is at its average. (And here, both averages are zero.)
The slope here means that “in a state that is one SD more educated, we expect life expectancy to be .582 SD higher.”
A fully standardized simple regression coefficient is exactly the same as Pearson’s r or the correlation coefficient. And this value squared is exactly R2 (hence the name!), or PRE.
Comparing all versions*
Let’s compare all versions.
Show code
msummary(list("Original" = ma,
"Centered" = fit_cent,
"Norm." = fit_pomp,
"X-std." = fit_stdx,
"XY-std." = fit_std),
fmt = 3,
estimate = "{estimate}",
statistic = NULL,
gof_map = c("nobs", "F", "r.squared")) | Original | Centered | Norm. | X-std. | XY-std. | |
|---|---|---|---|---|---|
| (Intercept) | 65.740 | 70.879 | 69.397 | 70.879 | -0.000 |
| hs_grad | 0.097 | ||||
| c_hs_grad | 0.097 | ||||
| hs_grad01 | 2.855 | ||||
| z_hs_grad | 0.782 | 0.582 | |||
| Num.Obs. | 50 | 50 | 50 | 50 | 50 |
| F | 24.615 | 24.615 | 24.615 | 24.615 | 24.615 |
| R2 | 0.339 | 0.339 | 0.339 | 0.339 | 0.339 |
As you can see, the fit (F, PRE) stay the same but the interepretation of the cofficients changes. That difference in coefficients is for human interpretability, so use whichever makes the most sense to you and your likely readers.
Statistical inference in two-parameter models
Testing the null
We’re going to do this with simulations. Specifically, we’re going to implement a permutation test. First let’s check out the vectors and see how extreme we could make the positive and negative slopes using the data we have.
# to make things a bit quicker
y <- states$life_exp
x <- states$hs_grad
# most extreme positive possible
plt(sort(y) ~ sort(x))
lm(sort(y) ~ sort(x))$coefficients[2] sort(x)
0.1629119
# most extreme negative possible
plt(sort(y) ~ sort(x, decreasing = TRUE))
lm(sort(y) ~ sort(x, decreasing = TRUE))$coefficients[2]sort(x, decreasing = TRUE)
-0.1620876
Let’s make a function to randomly pair x and y values and then estimate a regression line.
# NB: we could make this faster but I want to make it clear...
get_coefs <- function(x, y) {
d <- tibble(
y = sample(y, length(y)),
x = sample(x, length(x))
)
fit <- lm(y ~ x, data = d)
tibble(
b0 = unname(coef(fit)[1]),
b1 = unname(coef(fit)[2])
)
}Skeleton and apply.
mynulls <- tibble(id = 1:1000) |>
rowwise() |>
mutate(coefs = list(get_coefs(x, y))) |>
tidyr::unnest_wider(coefs) |>
ungroup()Plot with real value.
Show code
est_b1 <- ma$coefficients[2]
plt(~ b1,
data = mynulls,
type = "hist",
main = "Observed and null estimates of b1",
xlim = c(min(mynulls$b1) * 1.05 , est_b1 * 1.05))
plt_add(type = type_vline(v = est_b1),
col = "#F28E2B",
lw = 2)
Power analysis
Two-parameter model comparison
[That W3 = W1, not father son]