Recoding, replacing, and looking up

Packages and data

library(tidyverse)
my_url <- "http://datafiles.ritsokiguess.site/ais.txt"
athletes <- read_tsv(my_url)

Recoding

  • When you want to create a new column based on an old column, in a way beyond a simple formula, you are “recoding”
  • If you recode a categorical variable, recode_values will do the job for you.
  • Make numeric codes: female = 1, male = 2:
athletes %>% 
  mutate(sex_code = recode_values(
    Sex,
    "female" ~ 1,
    "male"   ~ 2
  )) %>% 
  select(Sex, sex_code) -> d

The result

d %>% slice_sample(n = 10)

Recoding another categorical variable

Categorize the sports by where they are played: a court, as part of track and field, on or in water:

athletes %>% 
  mutate(where_played = recode_values(
    Sport,
    c("BBall", "Netball", "Tennis") ~ "court",
    c("Row", "Swim", "WPolo")       ~ "water",
    c("Field", "T400m", "TSprnt")   ~ "track&field",
    default                         = Sport
  )) %>% 
  select(Sport, where_played) -> d

The result

d %>% slice_sample(n = 10)

Recoding a quantitative variable 1/2

for example, creating a new column categorizing the RCC value as low or high, use case_when:

athletes %>% 
  mutate(rcc_status = case_when(
    RCC < 4.5  ~ "rcc_low",
    RCC >= 4.5 ~ "rcc_high"
  )) %>% 
  select(RCC, rcc_status) -> d

Result 1

d %>% slice_sample(n = 10)

Recoding a quantitative variable 2/2

There is also between, if we instead wished to do this:

athletes %>% 
  mutate(rcc_status = case_when(
    RCC < 4.0              ~ "rcc_low",
    between(RCC, 4.0, 4.5) ~ "rcc_medium",
    RCC > 4.5              ~ "rcc_high"
  )) %>% 
  select(RCC, rcc_status) -> d

Result 2

d %>% slice_sample(n = 10)

Replacing a few values in a column

  • To make a few small changes to a column (rather than create a whole new column):
    • replace_values (like recode_values)
    • replace_when (like case_when).
  • These change a few values and leave everything else unchanged.

Replacing a few values in a categorical column

For the athletes, let’s change the two track running events to “Track”:

athletes %>% 
  mutate(Sport = replace_values(
    Sport,
    c("T400m", "TSprnt") ~ "Track"
    )) -> d

The result

d %>% slice_sample(n = 10)

Making small changes in a quantitative column

Let’s suppose we know that an RCC value less than 4 cannot be correct, and we want to replace all such values by NA (“missing”). Suppose the highest possible value is 4.5, and we also want to replace values higher than that by 4.5. This is what replace_when is for:

athletes %>% 
  mutate(RCC = replace_when(
    RCC,
    RCC < 4   ~ NA,
    RCC > 4.5 ~ 4.5
  )) -> d

The result

d %>% slice_sample(n = 10)

Making small changes to columns using a lookup table

Example: 12 cities and states, but some of the states are wrong (should not be Wisconsin):

my_url <- "https://datafiles.ritsokiguess.site/wisc_wrong.csv"
wisc <- read_csv(my_url)

The dataframe

wisc

Setting up corrections

Some of these cities are in the wrong state: Dubuque is in Iowa (IA), St. Paul in Minnesota (MN), and Chicago is in Illinois (IL). First make a “lookup table” with the corrections we want to make:

corrections <- tribble(
  ~location, ~state,
  "Dubuque", "IA",
  "St.Paul", "MN",
  "Chicago", "IL"
)
corrections

Note: columns of this dataframe have same names as the ones in wisc.

Applying the corrections

wisc %>% 
  rows_update(corrections, by = "location")

Addendum: does this work with replace_when ?

Yes, but you have to specify the corrections by hand:

wisc %>% 
  mutate(state = replace_when(
    state,
    location == "Dubuque" ~ "IA",
    location == "St.Paul" ~ "MN",
    location == "Chicago" ~ "IL"
  )) -> d

The result

d

Looking things up in another data frame

  • Suppose you are working in the nails department of a hardware store and you find that you have sold these items:
my_url <- "http://ritsokiguess.site/datafiles/nail_sales.csv"
sales <- read_csv(my_url)
sales

Product descriptions and prices

  • but you don’t remember what these product codes are, and you would like to know the total revenue from these sales.

  • Fortunately you found a list of product descriptions and prices:

my_url <- "http://ritsokiguess.site/datafiles/nail_desc.csv"
desc <- read_csv(my_url)
desc

The lookup

  • How do you “look up” the product codes to find the product descriptions and prices?
  • left_join.
sales %>% left_join(desc)

What we have

  • this looks up all the rows in the first dataframe that are also in the second.
  • by default matches all columns with same name in two dataframes (product_code here)
  • get all columns in both dataframes. The rows are the ones for that product_code.

So now can work out how much the total revenue was:

sales %>% left_join(desc) %>% 
  mutate(product_revenue = sales*price) %>% 
  summarize(total_revenue = sum(product_revenue))

More comments

  • if any product codes are not matched, you get NA in the added columns
  • anything in the second dataframe that was not in the first does not appear (here, any products that were not sold)
  • other variations (examples follow):
    • if there are two columns with the same name in the two dataframes, and you only want to match on one, use by with one column name
    • if the columns you want to look up have different names in the two dataframes, use by with a “named list”

Matching on only some matching names

  • Suppose the sales dataframe also had a column qty (which was the quantity sold):
sales %>% rename("qty"="sales") -> sales1
sales1
  • The qty in sales1 is the quantity sold, but the qty in desc is the number of nails in a package. These should not be matched: they are different things.

Matching only on product code

sales1 %>% 
  left_join(desc, join_by(product_code))
  • Get qty.x (from sales1) and qty.y (from desc).

Matching on different names 1/2

  • Suppose the product code in sales was just code:
sales %>% rename("code" = "product_code") -> sales2
sales2
  • How to match the two product codes that have different names?

Matching on different names 2/2

  • Use join_by, but like this:
sales2 %>% 
  left_join(desc, join_by(code == product_code))

Other types of join

  • right_join: interchanges roles, looking up keys from second dataframe in first.
  • anti_join: give me all the rows in the first dataframe that are not in the second. (Use this eg. to see whether the product descriptions are incomplete.)
  • full_join: give me all the rows in both dataframes, with missings as needed.

Full join here

sales %>% full_join(desc)
  • The missing sales for “masonry nail” says that it was in the lookup table desc, but we didn’t sell any.

The same thing, but with anti_join

Anything in first df but not in second?

desc %>% anti_join(sales)

Masonry nails are the only thing in our product description file that we did not sell any of.

The other way around

sales %>% anti_join(desc)

There was nothing we sold that was not in the description file.