Tidying data

Tidying data

  • Data rarely come to us as we want to use them.
  • Before we can do analysis, typically have organizing to do.
  • This is typical of ANOVA-type data, “wide format”:
     pig feed1 feed2 feed3 feed4
       1  60.8  68.7  92.6  87.9
       2  57.0  67.7  92.1  84.2
       3  65.0  74.0  90.2  83.1
       4  58.6  66.3  96.5  85.7
       5  61.7  69.8  99.1  90.3
  • 20 pigs randomly allocated to one of four feeds. At end of study, weight of each pig is recorded.
  • Are any differences in mean weights among the feeds?
  • Problem: want all weights in one column, with 2nd column labelling which feed. Untidy!

Tidy and untidy data (Wickham)

  • Data set easier to deal with if:
    • each observation is one row
    • each variable is one column
    • each type of observation unit is one table
  • Data arranged this way called “tidy”; otherwise called “untidy”.
  • For the pig data:
    • response variable is weight, but scattered over 4 columns, which are levels of a factor feed.
    • Want all the weights in one column, with a second column feed saying which feed that weight goes with.
    • Then we can run aov.

Packages for this section

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Reading in the pig data

my_url <- "http://datafiles.ritsokiguess.site/pigs1.txt"
pigs1 <- read_delim(my_url, " ")
pigs1
# A tibble: 5 × 5
    pig feed1 feed2 feed3 feed4
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1  60.8  68.7  92.6  87.9
2     2  57    67.7  92.1  84.2
3     3  65    74    90.2  83.1
4     4  58.6  66.3  96.5  85.7
5     5  61.7  69.8  99.1  90.3

Making it longer

  • We wanted all the weights in one column, labelled by which feed they went with.
  • This is a very common reorganization, and the magic “verb” is pivot_longer:
pigs1 %>% pivot_longer(feed1:feed4, names_to="feed", 
                       values_to="weight") -> pigs2

The long dataframe pigs2

# A tibble: 20 × 3
     pig feed  weight
   <dbl> <chr>  <dbl>
 1     1 feed1   60.8
 2     1 feed2   68.7
 3     1 feed3   92.6
 4     1 feed4   87.9
 5     2 feed1   57  
 6     2 feed2   67.7
 7     2 feed3   92.1
 8     2 feed4   84.2
 9     3 feed1   65  
10     3 feed2   74  
11     3 feed3   90.2
12     3 feed4   83.1
13     4 feed1   58.6
14     4 feed2   66.3
15     4 feed3   96.5
16     4 feed4   85.7
17     5 feed1   61.7
18     5 feed2   69.8
19     5 feed3   99.1
20     5 feed4   90.3

Inputs to pivot_longer:

  • columns to combine
  • a name for column that will contain groups (“names”)
  • a name for column that will contain measurements (“values”)

Alternatives

Any way of choosing the columns to pivot longer is good, eg:

pigs1 %>% pivot_longer(-pig, 
                       names_to = "feed", 
                       values_to = "weight") -> pigs2

or

pigs1 %>% pivot_longer(starts_with("feed"), 
                       names_to = "feed", 
                       values_to = "weight") -> pigs2
  • pigs2 now in “long” format, ready for analysis.

Tuberculosis

  • The World Health Organization keeps track of number of cases of various diseases, eg. tuberculosis.
  • Some data:
my_url <- "http://datafiles.ritsokiguess.site/tb.csv"
tb <- read_csv(my_url)
Rows: 5769 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): iso2
dbl (21): year, m04, m514, m014, m1524, m2534, m3544, m4554, m5564, m65, mu,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data (10 randomly chosen rows)

tb 
# A tibble: 5,769 × 22
   iso2   year   m04  m514  m014 m1524 m2534 m3544 m4554 m5564   m65
   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 AD     1989    NA    NA    NA    NA    NA    NA    NA    NA    NA
 2 AD     1990    NA    NA    NA    NA    NA    NA    NA    NA    NA
 3 AD     1991    NA    NA    NA    NA    NA    NA    NA    NA    NA
 4 AD     1992    NA    NA    NA    NA    NA    NA    NA    NA    NA
 5 AD     1993    NA    NA    NA    NA    NA    NA    NA    NA    NA
 6 AD     1994    NA    NA    NA    NA    NA    NA    NA    NA    NA
 7 AD     1996    NA    NA     0     0     0     4     1     0     0
 8 AD     1997    NA    NA     0     0     1     2     2     1     6
 9 AD     1998    NA    NA     0     0     0     1     0     0     0
10 AD     1999    NA    NA     0     0     0     1     1     0     0
# ℹ 5,759 more rows
# ℹ 11 more variables: mu <dbl>, f04 <dbl>, f514 <dbl>, f014 <dbl>,
#   f1524 <dbl>, f2534 <dbl>, f3544 <dbl>, f4554 <dbl>, f5564 <dbl>,
#   f65 <dbl>, fu <dbl>

Many rows and columns

nrow(tb)
[1] 5769
ncol(tb)
[1] 22

What we have

  • Variables: country (abbreviated), year. Then number of cases for each gender and age group, eg. m1524 is males aged 15–24. Also mu and fu, where age is unknown.
  • Lots of missings. Want to get rid of.
  • Abbreviations here.
tb %>% 
  pivot_longer(m04:fu, 
               names_to = "genage", 
               values_to = "freq", 
               values_drop_na = TRUE) 
  • Code for pivot_longer:
    • columns to make longer
    • column to contain the names (categorical)
    • column to contain the values (quantitative)
    • drop missings in the values

Results (some)

tb %>% 
  pivot_longer(m04:fu, 
               names_to = "genage", 
               values_to = "freq", 
               values_drop_na = TRUE) 
# A tibble: 35,750 × 4
   iso2   year genage  freq
   <chr> <dbl> <chr>  <dbl>
 1 AD     1996 m014       0
 2 AD     1996 m1524      0
 3 AD     1996 m2534      0
 4 AD     1996 m3544      4
 5 AD     1996 m4554      1
 6 AD     1996 m5564      0
 7 AD     1996 m65        0
 8 AD     1996 f014       0
 9 AD     1996 f1524      1
10 AD     1996 f2534      1
# ℹ 35,740 more rows

Examine

  • Not quite right, though:

    • column genage contains both gender and age
    • we want two columns, one containing gender and the other containing age group.
  • Idea:

    • put two things in names_to
    • then add a names_sep to say where one ends and the other starts: in this case, after the first character, so a number “1”.

The improved pivot_longer:

tb %>% 
  pivot_longer(m04:fu,
               names_to = c("gender", "age"),
               names_sep = 1,
               values_to = "frequency",
               values_drop_na = TRUE)
  • Tip: the number of things in names_sep should be one fewer than the number of things in names_to (if you have two things to separate, you need one thing to separate them with).

… with result

# A tibble: 35,750 × 5
   iso2   year gender age   frequency
   <chr> <dbl> <chr>  <chr>     <dbl>
 1 AD     1996 m      014           0
 2 AD     1996 m      1524          0
 3 AD     1996 m      2534          0
 4 AD     1996 m      3544          4
 5 AD     1996 m      4554          1
 6 AD     1996 m      5564          0
 7 AD     1996 m      65            0
 8 AD     1996 f      014           0
 9 AD     1996 f      1524          1
10 AD     1996 f      2534          1
# ℹ 35,740 more rows

Save it

This looks tidy, so save it:

tb %>% 
  pivot_longer(m04:fu,
               names_to = c("gender", "age"),
               names_sep = 1,
               values_to = "frequency",
               values_drop_na = TRUE) -> tb_tidy

Comments

  • You can split the R code over as many lines as you like, as long as each line is incomplete, so that R knows more is to come.
  • I like to put the pipe symbol on the end of the line.
  • Sometimes one function call gets very long, in which case “one thing per line” is often the easiest to read.

Aside

If we knew the age groups were always four digits (first two as the lower age limit, last two as upper), we could do even better (but in real data we were not so lucky):

tb %>% 
  select(iso2, year, m1524:m5564, f1524:f5564) -> tb_aside
tb_aside
# A tibble: 5,769 × 12
   iso2   year m1524 m2534 m3544 m4554 m5564 f1524 f2534 f3544 f4554
   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 AD     1989    NA    NA    NA    NA    NA    NA    NA    NA    NA
 2 AD     1990    NA    NA    NA    NA    NA    NA    NA    NA    NA
 3 AD     1991    NA    NA    NA    NA    NA    NA    NA    NA    NA
 4 AD     1992    NA    NA    NA    NA    NA    NA    NA    NA    NA
 5 AD     1993    NA    NA    NA    NA    NA    NA    NA    NA    NA
 6 AD     1994    NA    NA    NA    NA    NA    NA    NA    NA    NA
 7 AD     1996     0     0     4     1     0     1     1     0     0
 8 AD     1997     0     1     2     2     1     1     2     3     0
 9 AD     1998     0     0     1     0     0    NA    NA    NA    NA
10 AD     1999     0     0     1     1     0     0     0     1     0
# ℹ 5,759 more rows
# ℹ 1 more variable: f5564 <dbl>

Get gender and lower and upper ends of age group:

tb_aside %>% 
  pivot_longer(m1524:f5564,
               names_to = c("gender", "age_low", "age_high"),
               names_sep = c(1, 3),
               values_to = "frequency",
               values_drop_na = TRUE) 

… with result

# A tibble: 24,052 × 6
   iso2   year gender age_low age_high frequency
   <chr> <dbl> <chr>  <chr>   <chr>        <dbl>
 1 AD     1996 m      15      24               0
 2 AD     1996 m      25      34               0
 3 AD     1996 m      35      44               4
 4 AD     1996 m      45      54               1
 5 AD     1996 m      55      64               0
 6 AD     1996 f      15      24               1
 7 AD     1996 f      25      34               1
 8 AD     1996 f      35      44               0
 9 AD     1996 f      45      54               0
10 AD     1996 f      55      64               1
# ℹ 24,042 more rows

End of aside.

Total tuberculosis cases by year (some of the years)

tb_tidy %>%
  filter(between(year, 1991, 1998)) %>% 
  group_by(year) %>% 
  summarize(total_freq = sum(frequency))
# A tibble: 8 × 2
   year total_freq
  <dbl>      <dbl>
1  1991        544
2  1992        512
3  1993        492
4  1994        750
5  1995     513971
6  1996     635705
7  1997     733204
8  1998     840389
  • Something very interesting happened between 1994 and 1995.

To find out what

  • try counting up total cases by country:
tb_tidy %>% 
  group_by(iso2) %>% 
  summarize(total_freq = sum(frequency)) %>% 
  arrange(desc(total_freq))
# A tibble: 213 × 2
   iso2  total_freq
   <chr>      <dbl>
 1 CN       4065174
 2 IN       3966169
 3 ID       1129015
 4 ZA        900349
 5 BD        758008
 6 VN        709695
 7 CD        603095
 8 PH        490040
 9 BR        440609
10 KE        431523
# ℹ 203 more rows

What years do I have for China?

China started recording in 1995, which is at least part of the problem:

tb_tidy %>% filter(iso2 == "CN") %>% 
  group_by(year) %>% 
  summarize(total_freq = sum(frequency))
# A tibble: 14 × 2
    year total_freq
   <dbl>      <dbl>
 1  1995     131194
 2  1996     168270
 3  1997     195895
 4  1998     214404
 5  1999     212258
 6  2000     213766
 7  2001     212766
 8  2002     194972
 9  2003     267280
10  2004     384886
11  2005     472719
12  2006     468291
13  2007     465877
14  2008     462596

First year of recording by country?

  • A lot of countries started recording in about 1995, in fact:
tb_tidy %>% group_by(iso2) %>% 
  summarize(first_year = min(year)) %>% 
  count(first_year)
# A tibble: 14 × 2
   first_year     n
        <dbl> <int>
 1       1980     2
 2       1994     2
 3       1995   130
 4       1996    31
 5       1997    17
 6       1998     6
 7       1999    10
 8       2000     4
 9       2001     1
10       2002     3
11       2003     2
12       2004     2
13       2005     2
14       2007     1

Comment

  • So the reason for the big jump in cases is that so many countries started recording then, not that there really were more cases.

Some Toronto weather data

my_url <- "http://datafiles.ritsokiguess.site/toronto_weather.csv"
weather <- read_csv(my_url)
Rows: 24 Columns: 35
── Column specification ────────────────────────────────────────────────
Delimiter: ","
chr  (3): station, Month, element
dbl (32): Year, d01, d02, d03, d04, d05, d06, d07, d08, d09, d10, d1...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weather
# A tibble: 24 × 35
   station  Year Month element   d01   d02   d03   d04   d05   d06   d07
   <chr>   <dbl> <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 TORONT…  2018 01    tmax     -7.9  -7.1  -5.3  -7.7 -14.7 -15.4  -1  
 2 TORONT…  2018 01    tmin    -18.6 -12.5 -11.2 -19.7 -20.6 -22.3 -17.5
 3 TORONT…  2018 02    tmax      5.6  -8.6   0.4   1.8  -6.6  -3.2  -4.1
 4 TORONT…  2018 02    tmin     -8.9 -15    -9.7  -8.8 -12    -8.2  -8.7
 5 TORONT…  2018 03    tmax     NA    NA    NA    NA    NA    NA     3.1
 6 TORONT…  2018 03    tmin     NA    -0.5  NA    -3.1  NA    -1.4   0.4
 7 TORONT…  2018 04    tmax      4.5   6.5   5     5.7   2.9   5.4   2  
 8 TORONT…  2018 04    tmin     -2.6  -1.2   2.4  -3.2  -3.9  -2.6  -4.4
 9 TORONT…  2018 05    tmax     23.5  26.3  23    24    24.1  17.4  15.9
10 TORONT…  2018 05    tmin      8.5  14.4  11.4   9.2   8.5  13.3  10.6
# ℹ 14 more rows
# ℹ 24 more variables: d08 <dbl>, d09 <dbl>, d10 <dbl>, d11 <dbl>,
#   d12 <dbl>, d13 <dbl>, d14 <dbl>, d15 <dbl>, d16 <dbl>, d17 <dbl>,
#   d18 <dbl>, d19 <dbl>, d20 <dbl>, d21 <dbl>, d22 <dbl>, d23 <dbl>,
#   d24 <dbl>, d25 <dbl>, d26 <dbl>, d27 <dbl>, d28 <dbl>, d29 <dbl>,
#   d30 <dbl>, d31 <dbl>

The columns

  • Daily weather records for “Toronto City” weather station in 2018:

    • station: identifier for this weather station (always same here)
    • Year, Month
    • element: whether temperature given was daily max or daily min
    • d01, d02,… d31: day of the month from 1st to 31st.

Off we go

Numbers in data frame all temperatures (for different days of the month), so first step is

weather %>% 
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", 
               values_drop_na = TRUE)
# A tibble: 703 × 6
   station       Year Month element day   temperature
   <chr>        <dbl> <chr> <chr>   <chr>       <dbl>
 1 TORONTO CITY  2018 01    tmax    d01          -7.9
 2 TORONTO CITY  2018 01    tmax    d02          -7.1
 3 TORONTO CITY  2018 01    tmax    d03          -5.3
 4 TORONTO CITY  2018 01    tmax    d04          -7.7
 5 TORONTO CITY  2018 01    tmax    d05         -14.7
 6 TORONTO CITY  2018 01    tmax    d06         -15.4
 7 TORONTO CITY  2018 01    tmax    d07          -1  
 8 TORONTO CITY  2018 01    tmax    d08           3  
 9 TORONTO CITY  2018 01    tmax    d09           1.6
10 TORONTO CITY  2018 01    tmax    d10           5.9
# ℹ 693 more rows

Element

  • Column element contains names of two different variables, that should each be in separate column.
  • Distinct from eg. m1524 in tuberculosis data, that contained levels of two different factors, handled by separate.
  • Untangling names of variables handled by pivot_wider.

Handling element

weather %>%
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", 
               values_drop_na = TRUE) %>% 
  pivot_wider(names_from=element, 
                values_from=temperature) 
# A tibble: 355 × 6
   station       Year Month day    tmax  tmin
   <chr>        <dbl> <chr> <chr> <dbl> <dbl>
 1 TORONTO CITY  2018 01    d01    -7.9 -18.6
 2 TORONTO CITY  2018 01    d02    -7.1 -12.5
 3 TORONTO CITY  2018 01    d03    -5.3 -11.2
 4 TORONTO CITY  2018 01    d04    -7.7 -19.7
 5 TORONTO CITY  2018 01    d05   -14.7 -20.6
 6 TORONTO CITY  2018 01    d06   -15.4 -22.3
 7 TORONTO CITY  2018 01    d07    -1   -17.5
 8 TORONTO CITY  2018 01    d08     3    -1.7
 9 TORONTO CITY  2018 01    d09     1.6  -0.6
10 TORONTO CITY  2018 01    d10     5.9  -1.3
# ℹ 345 more rows

Further improvements

  • We have tidy data now, but can improve things further.
  • Station name has no value to us.
  • Would like to make actual dates.
  • Our pivot_longer trick works again to get rid of the “d” on the day number.

Further improvements

weather %>%
  pivot_longer(d01:d31, 
               names_to = c("ddd", "Day"), 
               names_sep = 1,
               values_to = "temperature", 
               values_drop_na = TRUE) %>% 
  pivot_wider(names_from = element, 
              values_from = temperature) %>% 
  select(-station)

Result

# A tibble: 355 × 6
    Year Month ddd   Day    tmax  tmin
   <dbl> <chr> <chr> <chr> <dbl> <dbl>
 1  2018 01    d     01     -7.9 -18.6
 2  2018 01    d     02     -7.1 -12.5
 3  2018 01    d     03     -5.3 -11.2
 4  2018 01    d     04     -7.7 -19.7
 5  2018 01    d     05    -14.7 -20.6
 6  2018 01    d     06    -15.4 -22.3
 7  2018 01    d     07     -1   -17.5
 8  2018 01    d     08      3    -1.7
 9  2018 01    d     09      1.6  -0.6
10  2018 01    d     10      5.9  -1.3
# ℹ 345 more rows

Final step(s)

  • Make year-month-day into proper date.
  • Keep only date, tmax, tmin:
weather %>%
  pivot_longer(d01:d31, 
               names_to = c("ddd", "Day"), 
               names_sep = 1,
               values_to = "temperature", 
               values_drop_na = TRUE) %>% 
  pivot_wider(names_from = element, 
              values_from = temperature) %>% 
  select(-station) %>% 
  unite(datestr, c(Year, Month, Day), sep = "-") %>%
  mutate(date = as.Date(datestr)) %>%
  select(date, tmax, tmin) -> weather_tidy

Our tidy data frame

weather_tidy
# A tibble: 355 × 3
   date        tmax  tmin
   <date>     <dbl> <dbl>
 1 2018-01-01  -7.9 -18.6
 2 2018-01-02  -7.1 -12.5
 3 2018-01-03  -5.3 -11.2
 4 2018-01-04  -7.7 -19.7
 5 2018-01-05 -14.7 -20.6
 6 2018-01-06 -15.4 -22.3
 7 2018-01-07  -1   -17.5
 8 2018-01-08   3    -1.7
 9 2018-01-09   1.6  -0.6
10 2018-01-10   5.9  -1.3
# ℹ 345 more rows

Plotting the temperatures

  • Plot temperature against date joined by lines, but with separate lines for max and min. ggplot requires something like
ggplot(..., aes(x = date, y = temperature)) + geom_point() + 
  geom_line()

only we have two temperatures, one a max and one a min, that we want to keep separate.

  • The trick: combine tmax and tmin together into one column, keeping track of what kind of temp they are. (This actually same format as untidy weather.) Are making weather_tidy untidy for purposes of drawing graph only.
  • Then can do something like
ggplot(d, aes(x = date, y = temperature, colour = maxmin)) 
  + geom_point() + geom_line()

to distinguish max and min on graph.

Setting up plot

  • Since we only need data frame for plot, we can do the column-creation and plot in a pipeline.
  • For a ggplot in a pipeline, the initial data frame is omitted, because it is whatever came out of the previous step.
  • To make those “one column”s: pivot_longer. I save the graph to show overleaf:
weather_tidy %>%
  pivot_longer(tmax:tmin, names_to="maxmin", 
               values_to="temperature") %>%
  ggplot(aes(x = date, y = temperature, colour = maxmin)) + 
      geom_point() + geom_line() -> g

The plot

g
Warning: Removed 7 rows containing missing values or values outside the scale
range (`geom_point()`).