Tidying data

Data rarely come to us as we want to use them.
Before we can do analysis, typically have organizing to do.
This is typical of ANOVA-type data, “wide format”:

     pig feed1 feed2 feed3 feed4
       1  60.8  68.7  92.6  87.9
       2  57.0  67.7  92.1  84.2
       3  65.0  74.0  90.2  83.1
       4  58.6  66.3  96.5  85.7
       5  61.7  69.8  99.1  90.3

20 pigs randomly allocated to one of four feeds. At end of study, weight of each pig is recorded.
Are any differences in mean weights among the feeds?
Problem: want all weights in one column, with 2nd column labelling which feed. Untidy!

Tidy and untidy data (Wickham)

Data set easier to deal with if:
- each observation is one row
- each variable is one column
- each type of observation unit is one table
Data arranged this way called “tidy”; otherwise called “untidy”.
For the pig data:
- response variable is weight, but scattered over 4 columns, which are levels of a factor feed.
- Want all the weights in one column, with a second column feed saying which feed that weight goes with.
- Then we can run aov.

Packages for this section

library(tidyverse)

Reading in the pig data

my_url <- "http://datafiles.ritsokiguess.site/pigs1.txt"
pigs1 <- read_delim(my_url, " ")
pigs1

Making it longer

We wanted all the weights in one column, labelled by which feed they went with.
This is a very common reorganization, and the magic “verb” is pivot_longer:

pigs1 %>% pivot_longer(feed1:feed4, names_to="feed", 
                       values_to="weight") -> pigs2

The long dataframe `pigs2`

Inputs to `pivot_longer`:

columns to combine
a name for column that will contain groups (“names”)
a name for column that will contain measurements (“values”)

Alternatives

Any way of choosing the columns to pivot longer is good, eg:

pigs1 %>% pivot_longer(-pig, 
                       names_to = "feed", 
                       values_to = "weight") -> pigs2

pigs1 %>% pivot_longer(starts_with("feed"), 
                       names_to = "feed", 
                       values_to = "weight") -> pigs2

pigs2 now in “long” format, ready for analysis.

Tuberculosis

The World Health Organization keeps track of number of cases of various diseases, eg. tuberculosis.
Some data:

my_url <- "http://datafiles.ritsokiguess.site/tb.csv"
tb <- read_csv(my_url)

The data (10 randomly chosen rows)

tb

Many rows and columns

nrow(tb)

[1] 5769

ncol(tb)

[1] 22

What we have

Variables: country (abbreviated), year. Then number of cases for each gender and age group, eg. m1524 is males aged 15–24. Also mu and fu, where age is unknown.
Lots of missings. Want to get rid of.
Abbreviations here.

tb %>% 
  pivot_longer(m04:fu, 
               names_to = "genage", 
               values_to = "freq", 
               values_drop_na = TRUE)

Code for pivot_longer:
- columns to make longer
- column to contain the names (categorical)
- column to contain the values (quantitative)
- drop missings in the values

Results (some)

tb %>% 
  pivot_longer(m04:fu, 
               names_to = "genage", 
               values_to = "freq", 
               values_drop_na = TRUE)

Examine

Not quite right, though:
- column genage contains both gender and age
- we want two columns, one containing gender and the other containing age group.
Idea:
- put two things in names_to
- then add a names_sep to say where one ends and the other starts: in this case, after the first character, so a number “1”.

The improved `pivot_longer`:

tb %>% 
  pivot_longer(m04:fu,
               names_to = c("gender", "age"),
               names_sep = 1,
               values_to = "frequency",
               values_drop_na = TRUE)

Tip: the number of things in names_sep should be one fewer than the number of things in names_to (if you have two things to separate, you need one thing to separate them with).

… with result

Save it

This looks tidy, so save it:

tb %>% 
  pivot_longer(m04:fu,
               names_to = c("gender", "age"),
               names_sep = 1,
               values_to = "frequency",
               values_drop_na = TRUE) -> tb_tidy

Comments

You can split the R code over as many lines as you like, as long as each line is incomplete, so that R knows more is to come.
I like to put the pipe symbol on the end of the line.
Sometimes one function call gets very long, in which case “one thing per line” is often the easiest to read.

Aside

If we knew the age groups were always four digits (first two as the lower age limit, last two as upper), we could do even better (but in real data we were not so lucky):

tb %>% 
  select(iso2, year, m1524:m5564, f1524:f5564) -> tb_aside
tb_aside

Get gender and lower and upper ends of age group:

tb_aside %>% 
  pivot_longer(m1524:f5564,
               names_to = c("gender", "age_low", "age_high"),
               names_sep = c(1, 3),
               values_to = "frequency",
               values_drop_na = TRUE)

… with result

End of aside.

Total tuberculosis cases by year (some of the years)

tb_tidy %>%
  filter(between(year, 1991, 1998)) %>% 
  group_by(year) %>% 
  summarize(total_freq = sum(frequency))

Something very interesting happened between 1994 and 1995.

To find out what

try counting up total cases by country:

tb_tidy %>% 
  group_by(iso2) %>% 
  summarize(total_freq = sum(frequency)) %>% 
  arrange(desc(total_freq))

What years do I have for China?

China started recording in 1995, which is at least part of the problem:

tb_tidy %>% filter(iso2 == "CN") %>% 
  group_by(year) %>% 
  summarize(total_freq = sum(frequency))

First year of recording by country?

A lot of countries started recording in about 1995, in fact:

tb_tidy %>% group_by(iso2) %>% 
  summarize(first_year = min(year)) %>% 
  count(first_year)

Comment

So the reason for the big jump in cases is that so many countries started recording then, not that there really were more cases.

Some Toronto weather data

my_url <- "http://datafiles.ritsokiguess.site/toronto_weather.csv"
weather <- read_csv(my_url)
weather

The columns

Daily weather records for “Toronto City” weather station in 2018:
- station: identifier for this weather station (always same here)
- Year, Month
- element: whether temperature given was daily max or daily min
- d01, d02,… d31: day of the month from 1st to 31st.

Off we go

Numbers in data frame all temperatures (for different days of the month), so first step is

weather %>% 
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", 
               values_drop_na = TRUE)

`Element`

Column element contains names of two different variables, that should each be in separate column.
Distinct from eg. m1524 in tuberculosis data, that contained levels of two different factors, handled by separate.
Untangling names of variables handled by pivot_wider.

Handling `element`

weather %>%
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", 
               values_drop_na = TRUE) %>% 
  pivot_wider(names_from=element, 
                values_from=temperature)

Further improvements

We have tidy data now, but can improve things further.
Station name has no value to us.
Would like to make actual dates.
Our pivot_longer trick works again to get rid of the “d” on the day number.

Further improvements

weather %>%
  pivot_longer(d01:d31, 
               names_to = c("ddd", "Day"), 
               names_sep = 1,
               values_to = "temperature", 
               values_drop_na = TRUE) %>% 
  pivot_wider(names_from = element, 
              values_from = temperature) %>% 
  select(-station)

Result

Final step(s)

Make year-month-day into proper date.
Keep only date, tmax, tmin:

weather %>%
  pivot_longer(d01:d31, 
               names_to = c("ddd", "Day"), 
               names_sep = 1,
               values_to = "temperature", 
               values_drop_na = TRUE) %>% 
  pivot_wider(names_from = element, 
              values_from = temperature) %>% 
  select(-station) %>% 
  unite(datestr, c(Year, Month, Day), sep = "-") %>%
  mutate(date = as.Date(datestr)) %>%
  select(date, tmax, tmin) -> weather_tidy

Our tidy data frame

weather_tidy

Plotting the temperatures, the “Excel way”

ggplot(weather_tidy, aes(x = date)) +
  geom_line(aes(y = tmin), colour = "blue") +
  geom_line(aes(y = tmax), colour = "red")

The graph

Comments

Here, plotting two “series”, one at a time, rather than one collection of temperatures coloured according to whether they are a max or a min.
I only specify the x in the first aes because the y is going to be different according to what I am plotting.
In each geom_line, I add another aes to say what is different about that geom_line (first one plots min temperatures, second one max)
The colour is outside the aes, because I want all lines blue in the first one, red in second, not coloured by a categorical variable.

Alternatively, the “ggplot” way

I recognize that ggplot works more smoothly with one column of temperatures, with a second column saying what kind of temperatures they are:

weather_tidy %>% 
  pivot_longer(starts_with("t"), 
               names_to = "what_temp",
               values_to = "temperature") %>% 
  ggplot(aes(x = date, y = temperature, 
             colour = what_temp)) +
    geom_line()

The plotting code is much simpler, at the expense of doing some “retidying” first.

The graph

The pig feed data again: pivoting wider

pigs1

Make longer (as before)

pigs1 %>% pivot_longer(-pig, names_to="feed", 
                      values_to="weight") -> pigs_longer
pigs_longer

Make wider two ways 1/2

pivot_wider is inverse of pivot_longer:

pigs_longer %>% 
  pivot_wider(names_from=feed, values_from=weight)

we are back where we started.

Make wider 2/2

pigs_longer %>% 
  pivot_wider(names_from=pig, values_from=weight)

but…

pigs_longer %>% 
  pivot_wider(names_from=pig, values_from=weight) %>% 
  select(2)

this has selected the column called 1, which is the column numbered 2.

To get the column we want

pigs_longer %>% 
  pivot_wider(names_from=pig, values_from=weight) %>% 
  select(`2`)

Disease presence and absence at two locations

Frequencies of plants observed with and without disease at two locations:

Species     Disease present         Disease absent
       Location X Location Y  Location X Location Y
A            44         12          38        10
B            28         22          20        18

This has two rows of headers, so I rewrote the data file:

Species  present_x present_y    absent_x  absent_y
A            44         12          38        10
B            28         22          20        18

Read in

… into data frame called prevalence:

my_url <- 
  "http://datafiles.ritsokiguess.site/disease_prevalence.txt"
prevalence <- read_table(my_url)
prevalence

Comments

the columns we are going to pivot longer encode two things: disease status and location
so we need the version of pivot_longer with two things in names_to, and a values_sep to say what they are separated by.

Making longer

prevalence %>% 
  pivot_longer(-Species, names_to=c("disease", "location"),
               names_sep="_", 
               values_to="frequency") -> prevalence_longer 
prevalence_longer

Making wider, different ways 1/2

prevalence_longer %>% 
  pivot_wider(names_from=c(Species, location), 
              values_from=frequency)

Making wider, different ways 2/2

prevalence_longer %>% 
  pivot_wider(names_from=location, values_from=frequency)

A hairy one

18 people receive one of three treatments. At 3 different times (pre, post, followup) two variables y and z are measured on each person:

my_url <- "http://datafiles.ritsokiguess.site/repmes.txt"
repmes0 <- read_table(my_url)
repmes0

Create unique ids

repmes0 %>% mutate(id=str_c(treatment, ".", rep)) %>% 
  select(-rep) %>% 
  select(id, everything()) -> repmes
repmes

Attempt 1

repmes %>% pivot_longer(contains("_"),
                        names_to=c("time", "var"),
                        names_sep="_",
                        values_to = "vvv"
                         )

Comment

This is too long! We wanted a column called y and a column called z, but they have been pivoted-longer too.

Attempt 2

repmes %>% pivot_longer(contains("_"),
                        names_to=c("time", ".value"),
                        names_sep="_"
                        ) -> repmes3
repmes3

Comment

This has done what we wanted.

Make a graph

ggplot(repmes3, aes(x=fct_inorder(time), y=y, 
                    colour=treatment, group = id)) +
  geom_point() + geom_line()

Comment

A so-called “spaghetti plot”:
- The three measurements for each person are joined by lines
- The lines are coloured by treatment.

Or do the plot with means

repmes3 %>% group_by(treatment, 
                     ftime = fct_inorder(time)) %>% 
  summarize(mean_y = mean(y)) %>% 
  ggplot(aes(x = ftime, y = mean_y, 
             colour = treatment, group = treatment)) + 
    geom_point() + geom_line()

The graph of means

Comment

On average, the two real treatments go up and level off
but the control group is very different.

When pivot-wider goes wrong

Some long data that should be wide:

Six observations of variable y, but three measured before some treatment and three measured after.
Really matched pairs, so want column of y-values for pre and for post.
pivot_wider.

What happens here?

d %>% pivot_wider(names_from = time, values_from = y)

Should be three pre values and three post. Why did this happen?
pivot_wider needs to know which row to put each observation in.
Uses combo of columns not named in pivot_wider, here obs (only).

The problem

d %>% pivot_wider(names_from = time, values_from = y)

There are 6 different obs values, so 6 different rows.
No data for obs B and pre, so that cell missing (NA).
Not enough data (6 obs) to fill 12 (\(= 2 \times 6\)) cells.
obs needs to say which subject provided which 2 observations.

Fixing it up

d2

column subject shows which subject provided each pre and post.
when we do pivot_wider, now only 3 rows, one per subject.

Coming out right

d2 %>% pivot_wider(names_from = time, values_from = y)

row each observation goes to determined by other column subject, and now a pre and post for each subject.
right layout for matched pairs \(t\) or to make differences for sign test or normal quantile plot.

Another example

Two independent samples this time:

These should be arranged like this
but what if we make them wider?

Wider

d3 %>% pivot_wider(names_from = group, values_from = y)

row determined by what not used for pivot_wider: nothing!
everything smooshed into one row!
this time, too much data for the layout.
Four data values squeezed into each of the two cells: “list-columns”.

To make it “work”

make sure everything goes to the right row by explicitly labelling the observations within each group:

d4

Making it work

d4 %>% 
  pivot_wider(names_from = group, values_from = y)

Tidying data

Tidying data

Tidy and untidy data (Wickham)

Packages for this section

Reading in the pig data

Making it longer

The long dataframe pigs2

Inputs to pivot_longer:

Alternatives

Tuberculosis

The data (10 randomly chosen rows)

Many rows and columns

What we have

Results (some)

Examine

The improved pivot_longer:

… with result

Save it

Comments

Aside

Get gender and lower and upper ends of age group:

… with result

Total tuberculosis cases by year (some of the years)

To find out what

What years do I have for China?

First year of recording by country?

Comment

Some Toronto weather data

The columns

Off we go

Element

Handling element

Further improvements

Further improvements

Result

Final step(s)

Our tidy data frame

Plotting the temperatures, the “Excel way”

The graph

Comments

Alternatively, the “ggplot” way

The graph

The pig feed data again: pivoting wider

Make longer (as before)

Make wider two ways 1/2

Make wider 2/2

but…

To get the column we want

Disease presence and absence at two locations

Read in

Comments

Making longer

Making wider, different ways 1/2

Making wider, different ways 2/2

A hairy one

Create unique ids

Attempt 1

Comment

Attempt 2

Comment

Make a graph

Comment

Or do the plot with means

The graph of means

Comment

When pivot-wider goes wrong

What happens here?

The problem

Fixing it up

Coming out right

Another example

Wider

To make it “work”

Making it work

The long dataframe `pigs2`

Inputs to `pivot_longer`:

The improved `pivot_longer`:

`Element`

Handling `element`