library(dplyr)
library(tidyr)
library(broom)
library(stringr)
library(tinyplot)
tinytheme("ipsum",
family = "Roboto Condensed",
palette.qualitative = "Tableau 10",
palette.sequential = "agSunset")Chapter 3
Goals
We will use simulations to understand sampling distributions. This is a supplement to the more theory-based discussion in the book.
Set up
Load packages and set theme.
Get data1.
1 I showed how to download this dataset in the Chapter 2 notes.
gss2024 <- readRDS(file = here::here("data", "gss2024.rds"))Sampling distributions
We need some skewed data to show that we get a normal sampling distribution for the mean (with a big enough sample) no matter what the shape of the underlying distribution. We’ll use the GSS classic, tvhours.
d <- gss2024 |>
select(tvhours) |>
drop_na()
plt(~ tvhours,
data = d,
type = type_hist(breaks = seq(-.5, 24.5, 1)),
main = "Daily hours of television among US adults",
sub = "2024 General Social Survey",
xaxt = "n",
xlim = c(-.6, 24.6))
axis(1, at = seq(0, 24, 4),
labels = seq(0, 24, 4),
tck = 0,
lwd = 0)
Consider this sample a population for now and take repeated samples from it. First step is to write a function that grabs a sample and computes the mean.
get_sample_mean <- function(n) {
d |>
slice_sample(n = n, replace = TRUE) |>
summarize(m = mean(tvhours)) |>
as.numeric()
}Now I like to make a simulation “skeleton” that I can plug results into.
sims <- tibble(
sim_number = 1:1000
)Now I add the sampled means to the skeleton. If you’re trying this at home, it’s good to vary the sample size so you can see how the simulated sampling distribution behaves.
sims <- sims |>
rowwise() |> # do separately by row
mutate(m = get_sample_mean(n = 2152)) # vary the NNow I can plot the results.
plt(~ m,
data = sims,
type = type_hist(freq = FALSE),
ylim = c(0, 6),
main = "Simulated sampling distribution",
sub = "N = 2152",
ylab = "Density",
xlab = "Mean estimate")
plt_add(~ m,
data = sims,
type = type_density(bw = "SJ"),
col = "black")