Chapter 1

Goals

Practice some basic skills before getting into the content.

Set up

Load packages and set your graph theme.

pacman::p_load(dplyr,
               tidyr,
               broom,
               stringr,
               tinyplot,
               WDI,
               ggplot2)
tinytheme("ipsum",
          family = "Roboto Condensed",
          palette.qualitative = "Tableau 10",
          palette.sequential = "agSunset")
theme_set(
  theme_minimal(base_family = "Roboto Condensed") +
    theme(panel.grid.minor = element_blank())
)

tableau10 <- c("#5778a4", "#e49444", "#d1615d", "#85b6b2", "#6a9f58",
               "#e7ca60", "#a87c9f", "#f1a2a9", "#967662", "#b8b0ac")

options(
  ggplot2.discrete.colour = \() scale_color_manual(values = tableau10),
  ggplot2.discrete.fill   = \() scale_fill_manual(values = tableau10)
)

Practice

Let’s get (more or less) the same data the authors are using. You will need the {WDI} package from CRAN.1 I will fetch the data once and then save it locally. You can unfold the code to see how I did it if you want.

1 We will have a lot more countries using the WDI data directly.

Show code
rawdata <- WDI(indicator = "IT.NET.USER.ZS", 
               start = 2021, 
               end = 2021,
               extra = TRUE)
saveRDS(rawdata, file = "data/WDI.rds")
rawdata <- readRDS("data/WDI.rds")

d <- rawdata |> 
  filter(region != "Aggregates") |> 
  select(country, 
         iso = iso3c, 
         intpct = IT.NET.USER.ZS,
         income) |> 
  drop_na() |> 
  mutate(myguess = 70,
         residual = intpct - 70) # unconditional

Let’s make different guesses for high-income and not high-income countries.

d <- d |> 
  mutate(highinc = if_else(income == "High income", 1, 0),
         my_cond_guess = if_else(highinc == 1, 90, 70),
         my_cond_resid = intpct - my_cond_guess)

sum(abs(d$residual))
[1] 3681.513
sum(abs(d$my_cond_resid))
[1] 2804.545

Notice that making separate guesses makes the ERROR (SSR or RSS or SSE) go down. That’s an improvement.2

2 See the next chapter for what these terms mean in practice.

Let’s make some visualizations.

Here’s a histogram.

Show code
plt(~ intpct,
    data = d,
    type = type_hist(breaks = "Sturges"),
    main = "Internet access by country, 2021",
    sub = "World Development Indicators data",
    xlab = "% households with internet")

Show code
ggplot(d, aes(x = intpct)) +
  geom_histogram(bins = nclass.Sturges(d$intpct),
                 fill = tableau10[1], color = "white", alpha = 0.8) +
  labs(
    title = "Internet access by country, 2021",
    subtitle = "World Development Indicators data",
    x = "% households with internet",
    y = "Count")

Here’s a dotplot with the countries sorted by rank.

Show code
plt(~ sort(intpct),
    data = d,
    main = "Internet access by country, 2021",
    sub = "World Development Indicators data",
    ylab = "% households with internet",
    xaxt = "n",
    xlab = "")

Show code
ggplot(d |> arrange(intpct) |> mutate(rank = row_number()),
       aes(x = rank, y = intpct)) +
  geom_point(color = tableau10[1]) +
  labs(
    title = "Internet access by country, 2021",
    subtitle = "World Development Indicators data",
    y = "% households with internet",
    x = NULL) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())