To make a few small changes to a column (rather than create a whole new column):
replace_values (like recode_values)
replace_when (like case_when).
These change a few values and leave everything else unchanged.
Replacing a few values in a categorical column
For the athletes, let’s change the two track running events to “Track”:
athletes %>%mutate(Sport =replace_values( Sport,c("T400m", "TSprnt") ~"Track" )) -> d
The result
d %>%slice_sample(n =10)
Making small changes in a quantitative column
Let’s suppose we know that an RCC value less than 4 cannot be correct, and we want to replace all such values by NA (“missing”). Suppose the highest possible value is 4.5, and we also want to replace values higher than that by 4.5. This is what replace_when is for:
Some of these cities are in the wrong state: Dubuque is in Iowa (IA), St. Paul in Minnesota (MN), and Chicago is in Illinois (IL). First make a “lookup table” with the corrections we want to make:
if any product codes are not matched, you get NA in the added columns
anything in the second dataframe that was not in the first does not appear (here, any products that were not sold)
other variations (examples follow):
if there are two columns with the same name in the two dataframes, and you only want to match on one, use by with one column name
if the columns you want to look up have different names in the two dataframes, use by with a “named list”
Matching on only some matching names
Suppose the sales dataframe also had a column qty (which was the quantity sold):
sales %>%rename("qty"="sales") -> sales1sales1
The qty in sales1 is the quantity sold, but the qty in desc is the number of nails in a package. These should not be matched: they are different things.
right_join: interchanges roles, looking up keys from second dataframe in first.
anti_join: give me all the rows in the first dataframe that are not in the second. (Use this eg. to see whether the product descriptions are incomplete.)
full_join: give me all the rows in both dataframes, with missings as needed.
Full join here
sales %>%full_join(desc)
The missing sales for “masonry nail” says that it was in the lookup table desc, but we didn’t sell any.
The same thing, but with anti_join
Anything in first df but not in second?
desc %>%anti_join(sales)
Masonry nails are the only thing in our product description file that we did not sell any of.
The other way around
sales %>%anti_join(desc)
There was nothing we sold that was not in the description file.