Recoding, replacing, and looking up

Packages and data

library(tidyverse)
my_url <- "http://datafiles.ritsokiguess.site/ais.txt"
athletes <- read_tsv(my_url)

Recoding

When you want to create a new column based on an old column, in a way beyond a simple formula, you are “recoding”
If you recode a categorical variable, recode_values will do the job for you.
Make numeric codes: female = 1, male = 2:

athletes %>% 
  mutate(sex_code = recode_values(
    Sex,
    "female" ~ 1,
    "male"   ~ 2
  )) %>% 
  select(Sex, sex_code) -> d

The result

d %>% slice_sample(n = 10)

Recoding another categorical variable

Categorize the sports by where they are played: a court, as part of track and field, on or in water:

athletes %>% 
  mutate(where_played = recode_values(
    Sport,
    c("BBall", "Netball", "Tennis") ~ "court",
    c("Row", "Swim", "WPolo")       ~ "water",
    c("Field", "T400m", "TSprnt")   ~ "track&field",
    default                         = Sport
  )) %>% 
  select(Sport, where_played) -> d

The result

d %>% slice_sample(n = 10)

Recoding a quantitative variable 1/2

for example, creating a new column categorizing the RCC value as low or high, use case_when:

athletes %>% 
  mutate(rcc_status = case_when(
    RCC < 4.5  ~ "rcc_low",
    RCC >= 4.5 ~ "rcc_high"
  )) %>% 
  select(RCC, rcc_status) -> d

Result 1

d %>% slice_sample(n = 10)

Recoding a quantitative variable 2/2

There is also between, if we instead wished to do this:

athletes %>% 
  mutate(rcc_status = case_when(
    RCC < 4.0              ~ "rcc_low",
    between(RCC, 4.0, 4.5) ~ "rcc_medium",
    RCC > 4.5              ~ "rcc_high"
  )) %>% 
  select(RCC, rcc_status) -> d

Result 2

d %>% slice_sample(n = 10)

Replacing a few values in a column

To make a few small changes to a column (rather than create a whole new column):
- replace_values (like recode_values)
- replace_when (like case_when).
These change a few values and leave everything else unchanged.

Replacing a few values in a categorical column

For the athletes, let’s change the two track running events to “Track”:

athletes %>% 
  mutate(Sport = replace_values(
    Sport,
    c("T400m", "TSprnt") ~ "Track"
    )) -> d

The result

d %>% slice_sample(n = 10)

Making small changes in a quantitative column

Let’s suppose we know that an RCC value less than 4 cannot be correct, and we want to replace all such values by NA (“missing”). Suppose the highest possible value is 4.5, and we also want to replace values higher than that by 4.5. This is what replace_when is for:

athletes %>% 
  mutate(RCC = replace_when(
    RCC,
    RCC < 4   ~ NA,
    RCC > 4.5 ~ 4.5
  )) -> d

The result

d %>% slice_sample(n = 10)

Making small changes to columns using a lookup table

Example: 12 cities and states, but some of the states are wrong (should not be Wisconsin):

my_url <- "https://datafiles.ritsokiguess.site/wisc_wrong.csv"
wisc <- read_csv(my_url)

The dataframe

wisc

Setting up corrections

Some of these cities are in the wrong state: Dubuque is in Iowa (IA), St. Paul in Minnesota (MN), and Chicago is in Illinois (IL). First make a “lookup table” with the corrections we want to make:

corrections <- tribble(
  ~location, ~state,
  "Dubuque", "IA",
  "St.Paul", "MN",
  "Chicago", "IL"
)
corrections

Note: columns of this dataframe have same names as the ones in wisc.

Applying the corrections

wisc %>% 
  rows_update(corrections, by = "location")

Addendum: does this work with `replace_when` ?

Yes, but you have to specify the corrections by hand:

wisc %>% 
  mutate(state = replace_when(
    state,
    location == "Dubuque" ~ "IA",
    location == "St.Paul" ~ "MN",
    location == "Chicago" ~ "IL"
  )) -> d

The result

Looking things up in another data frame

Suppose you are working in the nails department of a hardware store and you find that you have sold these items:

my_url <- "http://ritsokiguess.site/datafiles/nail_sales.csv"
sales <- read_csv(my_url)
sales

Product descriptions and prices

but you don’t remember what these product codes are, and you would like to know the total revenue from these sales.
Fortunately you found a list of product descriptions and prices:

my_url <- "http://ritsokiguess.site/datafiles/nail_desc.csv"
desc <- read_csv(my_url)
desc

The lookup

How do you “look up” the product codes to find the product descriptions and prices?
left_join.

sales %>% left_join(desc)

What we have

this looks up all the rows in the first dataframe that are also in the second.
by default matches all columns with same name in two dataframes (product_code here)
get all columns in both dataframes. The rows are the ones for that product_code.

So now can work out how much the total revenue was:

sales %>% left_join(desc) %>% 
  mutate(product_revenue = sales*price) %>% 
  summarize(total_revenue = sum(product_revenue))

More comments

if any product codes are not matched, you get NA in the added columns
anything in the second dataframe that was not in the first does not appear (here, any products that were not sold)
other variations (examples follow):
- if there are two columns with the same name in the two dataframes, and you only want to match on one, use by with one column name
- if the columns you want to look up have different names in the two dataframes, use by with a “named list”

Matching on only some matching names

Suppose the sales dataframe also had a column qty (which was the quantity sold):

sales %>% rename("qty"="sales") -> sales1
sales1

The qty in sales1 is the quantity sold, but the qty in desc is the number of nails in a package. These should not be matched: they are different things.

Matching only on product code

sales1 %>% 
  left_join(desc, join_by(product_code))

Get qty.x (from sales1) and qty.y (from desc).

Matching on different names 1/2

Suppose the product code in sales was just code:

sales %>% rename("code" = "product_code") -> sales2
sales2

How to match the two product codes that have different names?

Matching on different names 2/2

Use join_by, but like this:

sales2 %>% 
  left_join(desc, join_by(code == product_code))

Other types of join

right_join: interchanges roles, looking up keys from second dataframe in first.
anti_join: give me all the rows in the first dataframe that are not in the second. (Use this eg. to see whether the product descriptions are incomplete.)
full_join: give me all the rows in both dataframes, with missings as needed.

Full join here

sales %>% full_join(desc)

The missing sales for “masonry nail” says that it was in the lookup table desc, but we didn’t sell any.

The same thing, but with `anti_join`

Anything in first df but not in second?

desc %>% anti_join(sales)

Masonry nails are the only thing in our product description file that we did not sell any of.

The other way around

sales %>% anti_join(desc)

There was nothing we sold that was not in the description file.

Recoding, replacing, and looking up

Packages and data

Recoding

The result

Recoding another categorical variable

The result

Recoding a quantitative variable 1/2

Result 1

Recoding a quantitative variable 2/2

Result 2

Replacing a few values in a column

Replacing a few values in a categorical column

The result

Making small changes in a quantitative column

The result

Making small changes to columns using a lookup table

The dataframe

Setting up corrections

Applying the corrections

Addendum: does this work with replace_when ?

The result

Looking things up in another data frame

Product descriptions and prices

The lookup

What we have

More comments

Matching on only some matching names

Matching only on product code

Matching on different names 1/2

Matching on different names 2/2

Other types of join

Full join here

The same thing, but with anti_join

The other way around

Addendum: does this work with `replace_when` ?

The same thing, but with `anti_join`