Chapter 1

Goals

Practice some basic skills before getting into the content.

Set up

Load packages and set your graph theme.

library(dplyr)
library(tidyr)
library(broom)
library(stringr)
library(tinyplot)

library(WDI)

tinytheme("ipsum",
          family = "Roboto Condensed",
          palette.qualitative = "Tableau 10",
          palette.sequential = "agSunset")

Practice

Let’s get (more or less) the same data the authors are using. You will need the {WDI} package from CRAN.1 I will fetch the data once and then save it locally. You can unfold the code to see how I did it if you want.

1 We will have a lot more countries using the WDI data directly.

Show code
rawdata <- WDI(indicator = "IT.NET.USER.ZS", 
               start = 2021, 
               end = 2021,
               extra = TRUE)
saveRDS(rawdata, file = "data/WDI.rds")
rawdata <- readRDS("data/WDI.rds")

d <- rawdata |> 
  filter(region != "Aggregates") |> 
  select(country, 
         iso = iso3c, 
         intpct = IT.NET.USER.ZS,
         income) |> 
  drop_na() |> 
  mutate(myguess = 70,
         residual = intpct - 70) # unconditional

Let’s make different guesses for high-income and not high-income countries.

d <- d |> 
  mutate(highinc = if_else(income == "High income", 1, 0),
         my_cond_guess = if_else(highinc == 1, 90, 70),
         my_cond_resid = intpct - my_cond_guess)

sum(abs(d$residual))
[1] 3681.513
sum(abs(d$my_cond_resid))
[1] 2804.545

Notice that making separate guesses makes the ERROR (SSR or RSS or SSE) go down. That’s an improvement.2

2 See the next chapter for what these terms mean in practice.

Let’s make some visualizations.

Here’s a histogram.

plt(~ intpct,
    data = d,
    type = type_hist(),
    main = "Internet access by country, 2021",
    sub = "World Development Indicators data",
    xlab = "% households with internet")

Here’s a dotplot with the countries sorted by rank.

plt(~ sort(intpct),
    data = d,
    main = "Internet access by country, 2021",
    sub = "World Development Indicators data",
    ylab = "% households with internet",
    xaxt = "n",
    xlab = "")