--- title: "Recoding, replacing, and looking up" editor: markdown: wrap: 72 --- ## Packages and data ```{r} library(tidyverse) my_url <- "http://datafiles.ritsokiguess.site/ais.txt" athletes <- read_tsv(my_url) ``` ## Recoding - When you want to create a new column based on an old column, in a way beyond a simple formula, you are "recoding" - If you recode a categorical variable, `recode_values` will do the job for you. - Make numeric codes: female = 1, male = 2: ```{r} athletes %>% mutate(sex_code = recode_values( Sex, "female" ~ 1, "male" ~ 2 )) %>% select(Sex, sex_code) -> d ``` ## The result ```{r} d %>% slice_sample(n = 10) ``` ## Recoding another categorical variable Categorize the sports by where they are played: a court, as part of track and field, on or in water: ```{r} athletes %>% mutate(where_played = recode_values( Sport, c("BBall", "Netball", "Tennis") ~ "court", c("Row", "Swim", "WPolo") ~ "water", c("Field", "T400m", "TSprnt") ~ "track&field", default = Sport )) %>% select(Sport, where_played) -> d ``` ## The result ```{r} d %>% slice_sample(n = 10) ``` ## Recoding a quantitative variable 1/2 for example, creating a new column categorizing the `RCC` value as low or high, use `case_when`: ```{r} athletes %>% mutate(rcc_status = case_when( RCC < 4.5 ~ "rcc_low", RCC >= 4.5 ~ "rcc_high" )) %>% select(RCC, rcc_status) -> d ``` ## Result 1 ```{r} d %>% slice_sample(n = 10) ``` ## Recoding a quantitative variable 2/2 There is also `between`, if we instead wished to do this: ```{r} athletes %>% mutate(rcc_status = case_when( RCC < 4.0 ~ "rcc_low", between(RCC, 4.0, 4.5) ~ "rcc_medium", RCC > 4.5 ~ "rcc_high" )) %>% select(RCC, rcc_status) -> d ``` ## Result 2 ```{r} d %>% slice_sample(n = 10) ``` ## Replacing a few values in a column - To make a few small changes to a column (rather than create a whole new column): - `replace_values` (like `recode_values`) - `replace_when` (like `case_when`). - These change a few values and leave everything else unchanged. ## Replacing a few values in a categorical column For the athletes, let's change the two track running events to "Track": ```{r} athletes %>% mutate(Sport = replace_values( Sport, c("T400m", "TSprnt") ~ "Track" )) -> d ``` ## The result ```{r} d %>% slice_sample(n = 10) ``` ## Making small changes in a quantitative column Let's suppose we know that an `RCC` value less than 4 cannot be correct, and we want to replace all such values by `NA` ("missing"). Suppose the highest possible value is 4.5, and we also want to replace values higher than that by 4.5. This is what `replace_when` is for: ```{r} athletes %>% mutate(RCC = replace_when( RCC, RCC < 4 ~ NA, RCC > 4.5 ~ 4.5 )) -> d ``` ## The result ```{r} d %>% slice_sample(n = 10) ``` ## Making small changes to columns using a lookup table Example: 12 cities and states, but some of the states are wrong (should not be Wisconsin): ```{r} my_url <- "https://datafiles.ritsokiguess.site/wisc_wrong.csv" wisc <- read_csv(my_url) ``` ## The dataframe ```{r} wisc ``` ## Setting up corrections Some of these cities are in the wrong state: Dubuque is in Iowa (IA), St. Paul in Minnesota (MN), and Chicago is in Illinois (IL). First make a "lookup table" with the corrections we want to make: \footnotesize ```{r} corrections <- tribble( ~location, ~state, "Dubuque", "IA", "St.Paul", "MN", "Chicago", "IL" ) corrections ``` \normalsize Note: columns of this dataframe have *same names* as the ones in `wisc`. ## Applying the corrections \small ```{r} wisc %>% rows_update(corrections, by = "location") ``` \normalsize ## Addendum: does this work with `replace_when` ? Yes, but you have to specify the corrections by hand: ```{r} wisc %>% mutate(state = replace_when( state, location == "Dubuque" ~ "IA", location == "St.Paul" ~ "MN", location == "Chicago" ~ "IL" )) -> d ``` ## The result \small ```{r} d ``` \normalsize ## Looking things up in another data frame - Suppose you are working in the nails department of a hardware store and you find that you have sold these items: ```{r choosing-R-28, message=FALSE} my_url <- "http://ritsokiguess.site/datafiles/nail_sales.csv" sales <- read_csv(my_url) sales ``` ## Product descriptions and prices - but you don't remember what these product codes are, and you would like to know the total revenue from these sales. - Fortunately you found a list of product descriptions and prices: \small ```{r choosing-R-29, message=FALSE} my_url <- "http://ritsokiguess.site/datafiles/nail_desc.csv" desc <- read_csv(my_url) desc ``` \normalsize ## The lookup - How do you "look up" the product codes to find the product descriptions and prices? - `left_join`. ```{r choosing-R-30} sales %>% left_join(desc) ``` ## What we have - this looks up all the rows in the *first* dataframe that are also in the *second*. - by default matches all columns with same name in two dataframes (`product_code` here) - get *all* columns in *both* dataframes. The rows are the ones for that `product_code`. So now can work out how much the total revenue was: ```{r choosing-R-31} sales %>% left_join(desc) %>% mutate(product_revenue = sales*price) %>% summarize(total_revenue = sum(product_revenue)) ``` ## More comments - if any product codes are not matched, you get NA in the added columns - anything in the *second* dataframe that was not in the first does not appear (here, any products that were not sold) - other variations (examples follow): - if there are two columns with the same name in the two dataframes, and you only want to match on one, use `by` with one column name - if the columns you want to look up have different names in the two dataframes, use `by` with a "named list" ## Matching on only some matching names \small - Suppose the `sales` dataframe *also* had a column `qty` (which was the quantity sold): ```{r choosing-R-32} sales %>% rename("qty"="sales") -> sales1 sales1 ``` - The `qty` in `sales1` is the quantity sold, but the `qty` in `desc` is the number of nails in a package. These should *not* be matched: they are different things. \normalsize ## Matching only on product code ```{r choosing-R-33} sales1 %>% left_join(desc, join_by(product_code)) ``` - Get `qty.x` (from `sales1`) and `qty.y` (from `desc`). ## Matching on different names 1/2 - Suppose the product code in `sales` was just `code`: ```{r choosing-R-34} sales %>% rename("code" = "product_code") -> sales2 sales2 ``` - How to match the two product codes that have different names? ## Matching on different names 2/2 - Use `join_by`, but like this: ```{r choosing-R-35} sales2 %>% left_join(desc, join_by(code == product_code)) ``` ## Other types of join - `right_join`: interchanges roles, looking up keys from second dataframe in first. - `anti_join`: give me all the rows in the first dataframe that are *not* in the second. (Use this eg. to see whether the product descriptions are incomplete.) - `full_join`: give me all the rows in both dataframes, with missings as needed. ## Full join here ```{r choosing-R-36} sales %>% full_join(desc) ``` - The missing `sales` for "masonry nail" says that it was in the lookup table `desc`, but we didn't sell any. ## The same thing, but with `anti_join` Anything in first df but not in second? ```{r} desc %>% anti_join(sales) ``` Masonry nails are the only thing in our product description file that we did not sell any of. ## The other way around ```{r} sales %>% anti_join(desc) ``` There was nothing we sold that was not in the description file.