---
title: "Recoding, replacing, and looking up"
editor: 
  markdown: 
    wrap: 72
---

## Packages and data

```{r}
library(tidyverse)
my_url <- "http://datafiles.ritsokiguess.site/ais.txt"
athletes <- read_tsv(my_url)
```

## Recoding

- When you want to create a new column based on an old column, in a way beyond a simple formula, you are "recoding"
- If you recode a categorical variable, `recode_values` will do the job for you. 
- Make numeric codes: female = 1, male = 2:

```{r}
athletes %>% 
  mutate(sex_code = recode_values(
    Sex,
    "female" ~ 1,
    "male"   ~ 2
  )) %>% 
  select(Sex, sex_code) -> d
```

## The result

```{r}
d %>% slice_sample(n = 10)
```


## Recoding another categorical variable

Categorize the sports by where they are played: a court, as part of track and field, on or in water:

```{r}
athletes %>% 
  mutate(where_played = recode_values(
    Sport,
    c("BBall", "Netball", "Tennis") ~ "court",
    c("Row", "Swim", "WPolo")       ~ "water",
    c("Field", "T400m", "TSprnt")   ~ "track&field",
    default                         = Sport
  )) %>% 
  select(Sport, where_played) -> d
```

## The result

```{r}
d %>% slice_sample(n = 10)
```


## Recoding a quantitative variable 1/2

for example, creating a new column categorizing the `RCC` value as low or high, use `case_when`:

```{r}
athletes %>% 
  mutate(rcc_status = case_when(
    RCC < 4.5  ~ "rcc_low",
    RCC >= 4.5 ~ "rcc_high"
  )) %>% 
  select(RCC, rcc_status) -> d
```

## Result 1

```{r}
d %>% slice_sample(n = 10)
```


## Recoding a quantitative variable 2/2

There is also `between`, if we instead wished to do this:

```{r}
athletes %>% 
  mutate(rcc_status = case_when(
    RCC < 4.0              ~ "rcc_low",
    between(RCC, 4.0, 4.5) ~ "rcc_medium",
    RCC > 4.5              ~ "rcc_high"
  )) %>% 
  select(RCC, rcc_status) -> d
```

## Result 2

```{r}
d %>% slice_sample(n = 10)
```


## Replacing a few values in a column

- To make a few small changes to a column (rather than create a whole new column):
  - `replace_values` (like `recode_values`) 
  - `replace_when` (like `case_when`). 
- These change a few values and leave everything else unchanged.

## Replacing a few values in a categorical column

For the athletes, let's change the two track running events to "Track":

```{r}
athletes %>% 
  mutate(Sport = replace_values(
    Sport,
    c("T400m", "TSprnt") ~ "Track"
    )) -> d
```

## The result

```{r}
d %>% slice_sample(n = 10)
```


## Making small changes in a quantitative column

Let's suppose we know that an `RCC` value less than 4 cannot be correct, and we want to replace all such values by `NA` ("missing"). Suppose the highest possible value is 4.5, and we also want to replace values higher than that by 4.5. This is what `replace_when` is for:

```{r}
athletes %>% 
  mutate(RCC = replace_when(
    RCC,
    RCC < 4   ~ NA,
    RCC > 4.5 ~ 4.5
  )) -> d
```

## The result

```{r}
d %>% slice_sample(n = 10)
```


## Making small changes to columns using a lookup table

Example: 12 cities and states, but some of the states are wrong (should not be Wisconsin):


```{r}
my_url <- "https://datafiles.ritsokiguess.site/wisc_wrong.csv"
wisc <- read_csv(my_url)
```

## The dataframe

```{r}
wisc
```


## Setting up corrections

Some of these cities are in the wrong state: Dubuque is in Iowa (IA), St. Paul in Minnesota (MN), and Chicago is in Illinois (IL). First make a "lookup table" with the corrections we want to make:

\footnotesize

```{r}
corrections <- tribble(
  ~location, ~state,
  "Dubuque", "IA",
  "St.Paul", "MN",
  "Chicago", "IL"
)
corrections
```

\normalsize


Note: columns of this dataframe have *same names* as the ones in `wisc`.

## Applying the corrections

\small

```{r}
wisc %>% 
  rows_update(corrections, by = "location")
```

\normalsize

## Addendum: does this work with `replace_when` ?

Yes, but you have to specify the corrections by hand:

```{r}
wisc %>% 
  mutate(state = replace_when(
    state,
    location == "Dubuque" ~ "IA",
    location == "St.Paul" ~ "MN",
    location == "Chicago" ~ "IL"
  )) -> d
```

## The result

\small

```{r}
d
```

\normalsize


## Looking things up in another data frame

-   Suppose you are working in the nails department of a hardware store
    and you find that you have sold these items:

```{r choosing-R-28, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/nail_sales.csv"
sales <- read_csv(my_url)
sales
```

## Product descriptions and prices

-   but you don't remember what these product codes are, and you would
    like to know the total revenue from these sales.

-   Fortunately you found a list of product descriptions and prices:

\small

```{r choosing-R-29, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/nail_desc.csv"
desc <- read_csv(my_url)
desc
```

\normalsize

## The lookup

-   How do you "look up" the product codes to find the product
    descriptions and prices?
-   `left_join`.

```{r choosing-R-30}
sales %>% left_join(desc)
```

## What we have

-   this looks up all the rows in the *first* dataframe that are also in
    the *second*.
-   by default matches all columns with same name in two dataframes
    (`product_code` here)
-   get *all* columns in *both* dataframes. The rows are the ones for
    that `product_code`.

So now can work out how much the total revenue was:

```{r choosing-R-31}
sales %>% left_join(desc) %>% 
  mutate(product_revenue = sales*price) %>% 
  summarize(total_revenue = sum(product_revenue))
```

## More comments

-   if any product codes are not matched, you get NA in the added
    columns
-   anything in the *second* dataframe that was not in the first does
    not appear (here, any products that were not sold)
-   other variations (examples follow):
    -   if there are two columns with the same name in the two
        dataframes, and you only want to match on one, use `by` with one
        column name
    -   if the columns you want to look up have different names in the
        two dataframes, use `by` with a "named list"

## Matching on only some matching names

\small

-   Suppose the `sales` dataframe *also* had a column `qty` (which was
    the quantity sold):

```{r choosing-R-32}
sales %>% rename("qty"="sales") -> sales1
sales1
```

-   The `qty` in `sales1` is the quantity sold, but the `qty` in `desc`
    is the number of nails in a package. These should *not* be matched:
    they are different things.
    
\normalsize

## Matching only on product code

```{r choosing-R-33}
sales1 %>% 
  left_join(desc, join_by(product_code))
```

-   Get `qty.x` (from `sales1`) and `qty.y` (from `desc`).

## Matching on different names 1/2

-   Suppose the product code in `sales` was just `code`:

```{r choosing-R-34}
sales %>% rename("code" = "product_code") -> sales2
sales2
```

-   How to match the two product codes that have different names?

## Matching on different names 2/2

-   Use `join_by`, but like this:

```{r choosing-R-35}
sales2 %>% 
  left_join(desc, join_by(code == product_code))
```

## Other types of join

-   `right_join`: interchanges roles, looking up keys from second
    dataframe in first.
-   `anti_join`: give me all the rows in the first dataframe that are
    *not* in the second. (Use this eg. to see whether the product
    descriptions are incomplete.)
-   `full_join`: give me all the rows in both dataframes, with missings
    as needed.

## Full join here

```{r choosing-R-36}
sales %>% full_join(desc)
```

-   The missing `sales` for "masonry nail" says that it was in the
    lookup table `desc`, but we didn't sell any.

## The same thing, but with `anti_join`

Anything in first df but not in second?

```{r}
desc %>% anti_join(sales)
```

Masonry nails are the only thing in our product description file that we
did not sell any of.

## The other way around

```{r}
sales %>% anti_join(desc)
```

There was nothing we sold that was not in the description file.