Choosing things in dataframes

Packages

The usual:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Doing things with data frames

Let’s go back to our Australian athletes:

athletes
# A tibble: 202 × 13
   Sex    Sport   RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 female Netb…  4.56  13.3  42.2  13.6    20  19.2  49      11.3
 2 female Netb…  4.15   6    38    12.7    59  21.2 110.     25.3
 3 female Netb…  4.16   7.6  37.5  12.3    22  21.4  89      19.4
 4 female Netb…  4.32   6.4  37.7  12.3    30  21.0  98.3    19.6
 5 female Netb…  4.06   5.8  38.7  12.8    78  21.8 122.     23.1
 6 female Netb…  4.12   6.1  36.6  11.8    21  21.4  90.4    16.9
 7 female Netb…  4.17   5    37.4  12.7   109  21.5 107.     21.3
 8 female Netb…  3.8    6.6  36.5  12.4   102  24.4 157.     26.6
 9 female Netb…  3.96   5.5  36.3  12.4    71  22.6 101.     17.9
10 female Netb…  4.44   9.7  41.4  14.1    64  22.8 126.     25.0
# ℹ 192 more rows
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

Choosing a column

athletes %>% select(Sport)
# A tibble: 202 × 1
   Sport  
   <chr>  
 1 Netball
 2 Netball
 3 Netball
 4 Netball
 5 Netball
 6 Netball
 7 Netball
 8 Netball
 9 Netball
10 Netball
# ℹ 192 more rows

Choosing several columns

athletes %>% select(Sport, Hg, BMI)
# A tibble: 202 × 3
   Sport      Hg   BMI
   <chr>   <dbl> <dbl>
 1 Netball  13.6  19.2
 2 Netball  12.7  21.2
 3 Netball  12.3  21.4
 4 Netball  12.3  21.0
 5 Netball  12.8  21.8
 6 Netball  11.8  21.4
 7 Netball  12.7  21.5
 8 Netball  12.4  24.4
 9 Netball  12.4  22.6
10 Netball  14.1  22.8
# ℹ 192 more rows

Choosing consecutive columns

athletes %>% select(Sex:WCC, BMI)
# A tibble: 202 × 5
   Sex    Sport     RCC   WCC   BMI
   <chr>  <chr>   <dbl> <dbl> <dbl>
 1 female Netball  4.56  13.3  19.2
 2 female Netball  4.15   6    21.2
 3 female Netball  4.16   7.6  21.4
 4 female Netball  4.32   6.4  21.0
 5 female Netball  4.06   5.8  21.8
 6 female Netball  4.12   6.1  21.4
 7 female Netball  4.17   5    21.5
 8 female Netball  3.8    6.6  24.4
 9 female Netball  3.96   5.5  22.6
10 female Netball  4.44   9.7  22.8
# ℹ 192 more rows

Choosing all-but some columns

athletes %>% select(-(RCC:LBM))
# A tibble: 202 × 4
   Sex    Sport      Ht    Wt
   <chr>  <chr>   <dbl> <dbl>
 1 female Netball  177.  59.9
 2 female Netball  173.  63  
 3 female Netball  176   66.3
 4 female Netball  170.  60.7
 5 female Netball  183   72.9
 6 female Netball  178.  67.9
 7 female Netball  177.  67.5
 8 female Netball  174.  74.1
 9 female Netball  174.  68.2
10 female Netball  174.  68.8
# ℹ 192 more rows

Select-helpers

Other ways to select columns: those whose name:

  • starts_with something
  • ends_with something
  • contains something
  • matches a “regular expression”
  • everything() select all the columns

Columns whose names begin with S

athletes %>% select(starts_with("S"))
# A tibble: 202 × 3
   Sex    Sport     SSF
   <chr>  <chr>   <dbl>
 1 female Netball  49  
 2 female Netball 110. 
 3 female Netball  89  
 4 female Netball  98.3
 5 female Netball 122. 
 6 female Netball  90.4
 7 female Netball 107. 
 8 female Netball 157. 
 9 female Netball 101. 
10 female Netball 126. 
# ℹ 192 more rows

Columns whose names end with C

either uppercase or lowercase:

athletes %>% select(ends_with("c"))
# A tibble: 202 × 3
     RCC   WCC    Hc
   <dbl> <dbl> <dbl>
 1  4.56  13.3  42.2
 2  4.15   6    38  
 3  4.16   7.6  37.5
 4  4.32   6.4  37.7
 5  4.06   5.8  38.7
 6  4.12   6.1  36.6
 7  4.17   5    37.4
 8  3.8    6.6  36.5
 9  3.96   5.5  36.3
10  4.44   9.7  41.4
# ℹ 192 more rows

Case-sensitive

This works with any of the select-helpers:

athletes %>% select(ends_with("C", ignore.case=FALSE))
# A tibble: 202 × 2
     RCC   WCC
   <dbl> <dbl>
 1  4.56  13.3
 2  4.15   6  
 3  4.16   7.6
 4  4.32   6.4
 5  4.06   5.8
 6  4.12   6.1
 7  4.17   5  
 8  3.8    6.6
 9  3.96   5.5
10  4.44   9.7
# ℹ 192 more rows

Column names containing letter R

athletes %>% select(contains("r"))
# A tibble: 202 × 3
   Sport     RCC  Ferr
   <chr>   <dbl> <dbl>
 1 Netball  4.56    20
 2 Netball  4.15    59
 3 Netball  4.16    22
 4 Netball  4.32    30
 5 Netball  4.06    78
 6 Netball  4.12    21
 7 Netball  4.17   109
 8 Netball  3.8    102
 9 Netball  3.96    71
10 Netball  4.44    64
# ℹ 192 more rows

Exactly two characters, ending with T

In regular expression terms, this is ^.t$:

  • ^ means “start of text”
  • . means “exactly one character, but could be anything”
  • t means a literal letter t (uppercase or lowercase)
  • $ means “end of text”.

Matching a regular expression

athletes %>% select(matches("^.t$"))
# A tibble: 202 × 2
      Ht    Wt
   <dbl> <dbl>
 1  177.  59.9
 2  173.  63  
 3  176   66.3
 4  170.  60.7
 5  183   72.9
 6  178.  67.9
 7  177.  67.5
 8  174.  74.1
 9  174.  68.2
10  174.  68.8
# ℹ 192 more rows

Choosing columns by property

  • Use where as with summarizing several columns
  • eg, to choose text columns:
athletes %>% select(where(is.character))
# A tibble: 202 × 2
   Sex    Sport  
   <chr>  <chr>  
 1 female Netball
 2 female Netball
 3 female Netball
 4 female Netball
 5 female Netball
 6 female Netball
 7 female Netball
 8 female Netball
 9 female Netball
10 female Netball
# ℹ 192 more rows

Choosing rows by number

athletes %>% slice(16:25)
# A tibble: 10 × 13
   Sex    Sport   RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 female Netb…  4.25  10.7  39.5  13.2   127  24.5 157.     26.5
 2 female Netb…  4.46  10.9  39.7  13.7   102  24.0 116.     23.0
 3 female Netb…  4.4    9.3  40.4  13.6    86  26.2 182.     30.1
 4 female Netb…  4.83   8.4  41.8  13.4    40  20.0  71.6    13.9
 5 female Netb…  4.23   6.9  38.3  12.6    50  25.7 144.     26.6
 6 female Netb…  4.24   8.4  37.6  12.5    58  25.6 201.     35.5
 7 female Netb…  3.95   6.6  38.4  12.8    33  19.9  68.9    15.6
 8 female Netb…  4.03   8.5  37.7  13      51  23.4 104.     19.6
 9 female BBall  3.96   7.5  37.5  12.3    60  20.6 109.     19.8
10 female BBall  4.41   8.3  38.2  12.7    68  20.7 103.     21.3
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

Non-consecutive rows

athletes %>% 
  slice(10, 13, 17, 42)
# A tibble: 4 × 13
  Sex    Sport    RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
  <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1 female Netba…  4.44   9.7  41.4  14.1    64  22.8  126.    25.0
2 female Netba…  4.02   9.1  37.7  12.7   107  23.0   77     18.1
3 female Netba…  4.46  10.9  39.7  13.7   102  24.0  116.    23.0
4 female Row     4.37   8.1  41.8  14.3    53  23.5   98     21.8
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

A random sample of rows

athletes %>% slice_sample(n=8)
# A tibble: 8 × 13
  Sex    Sport    RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
  <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1 male   T400m   5.03   6.6  44.7  15.9   191  19.8  30.9    6.53
2 male   Row     5.04   7.1  44    14.8    64  25.8  61.8   12.6 
3 female T400m   4.32   6.8  40.6  13.7    46  17.5  54.6   12.2 
4 male   Row     5.4    6.8  49.5  17.3   183  26.1  44.7    8.61
5 male   Field   5.48   6.2  48.2  16.3    94  34.4  82.7   13.9 
6 female Field   4.48   9.5  36.5  13.3    54  20.1  49.9   11.8 
7 male   WPolo   4.63  14.3  44.8  15     133  25.4  49.5    8.97
8 female Netba…  4.27  10.6  37.7  12.5    68  23.6 114     22.6 
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

Rows for which something is true

athletes %>% filter(Sport == "Tennis")
# A tibble: 11 × 13
   Sex    Sport   RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 female Tenn…  4      4.2  36.6  12      57  25.4 109     20.9 
 2 female Tenn…  4.4    4    40.8  13.9    73  22.1  98.1   19.6 
 3 female Tenn…  4.38   7.9  39.8  13.5    88  21.2  80.6   17.1 
 4 female Tenn…  4.08   6.6  37.8  12.1   182  20.5  68.3   15.3 
 5 female Tenn…  4.98   6.4  44.8  14.8    80  17.1  47.6   11.1 
 6 female Tenn…  5.16   7.2  44.3  14.5    88  18.3  61.9   12.9 
 7 female Tenn…  4.66   6.4  40.9  13.9   109  18.4  38.2    8.45
 8 male   Tenn…  5.66   8.3  50.2  17.7    38  23.8  56.5   10.0 
 9 male   Tenn…  5.03   6.4  42.7  14.3   122  22.0  47.6    8.51
10 male   Tenn…  4.97   8.8  43    14.9   233  22.3  60.4   11.5 
11 male   Tenn…  5.38   6.3  46    15.7    32  21.1  34.9    6.26
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

Rows for which something is not true

athletes %>% filter_out(Sport == "Tennis")
# A tibble: 191 × 13
   Sex    Sport   RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 female Netb…  4.56  13.3  42.2  13.6    20  19.2  49      11.3
 2 female Netb…  4.15   6    38    12.7    59  21.2 110.     25.3
 3 female Netb…  4.16   7.6  37.5  12.3    22  21.4  89      19.4
 4 female Netb…  4.32   6.4  37.7  12.3    30  21.0  98.3    19.6
 5 female Netb…  4.06   5.8  38.7  12.8    78  21.8 122.     23.1
 6 female Netb…  4.12   6.1  36.6  11.8    21  21.4  90.4    16.9
 7 female Netb…  4.17   5    37.4  12.7   109  21.5 107.     21.3
 8 female Netb…  3.8    6.6  36.5  12.4   102  24.4 157.     26.6
 9 female Netb…  3.96   5.5  36.3  12.4    71  22.6 101.     17.9
10 female Netb…  4.44   9.7  41.4  14.1    64  22.8 126.     25.0
# ℹ 181 more rows
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

More complicated selections

Athletes who are tennis players and whose RCC is less than 5:

athletes %>% filter(when_all(
  Sport == "Tennis",
  RCC < 5
))
# A tibble: 7 × 13
  Sex    Sport    RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
  <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1 female Tennis  4      4.2  36.6  12      57  25.4 109     20.9 
2 female Tennis  4.4    4    40.8  13.9    73  22.1  98.1   19.6 
3 female Tennis  4.38   7.9  39.8  13.5    88  21.2  80.6   17.1 
4 female Tennis  4.08   6.6  37.8  12.1   182  20.5  68.3   15.3 
5 female Tennis  4.98   6.4  44.8  14.8    80  17.1  47.6   11.1 
6 female Tennis  4.66   6.4  40.9  13.9   109  18.4  38.2    8.45
7 male   Tennis  4.97   8.8  43    14.9   233  22.3  60.4   11.5 
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

Either/Or

Athletes that are either tennis players or whose RCC is greater than 5:

athletes %>% filter(when_any(
  Sport == "Tennis",
  RCC > 5
))
# A tibble: 66 × 13
   Sex    Sport   RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 female Row    5.02   6.4  44.8  15.2    48  19.8  91      19.2
 2 female T400m  5.31   9.5  47.1  15.9    29  21.4  57.9    11.1
 3 female Field  5.33   9.3  47    15      62  25.3 103.     19.5
 4 female TSpr…  5.16   8.2  45.3  14.7    34  20.3  46.1    10.2
 5 female Tenn…  4      4.2  36.6  12      57  25.4 109      20.9
 6 female Tenn…  4.4    4    40.8  13.9    73  22.1  98.1    19.6
 7 female Tenn…  4.38   7.9  39.8  13.5    88  21.2  80.6    17.1
 8 female Tenn…  4.08   6.6  37.8  12.1   182  20.5  68.3    15.3
 9 female Tenn…  4.98   6.4  44.8  14.8    80  17.1  47.6    11.1
10 female Tenn…  5.16   7.2  44.3  14.5    88  18.3  61.9    12.9
# ℹ 56 more rows
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

Even more complicated

Females whose RCC is bigger than 5, or males whose RCC is bigger than 6:

athletes %>% 
  filter(when_any(
    when_all(
      Sex == "female",
      RCC > 5
    ),
    when_all(
      Sex == "male",
      RCC > 6
    )
  ))
# A tibble: 6 × 13
  Sex    Sport    RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
  <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1 female Row     5.02   6.4  44.8  15.2    48  19.8  91     19.2 
2 female T400m   5.31   9.5  47.1  15.9    29  21.4  57.9   11.1 
3 female Field   5.33   9.3  47    15      62  25.3 103.    19.5 
4 female TSprnt  5.16   8.2  45.3  14.7    34  20.3  46.1   10.2 
5 female Tennis  5.16   7.2  44.3  14.5    88  18.3  61.9   12.9 
6 male   TSprnt  6.72   7.1  59.7  19.2    76  24.8  44.8    9.56
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

Sorting into order

athletes %>% arrange(RCC)
# A tibble: 202 × 13
   Sex    Sport   RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 female Netb…  3.8    6.6  36.5  12.4   102  24.4 157.     26.6
 2 female Netb…  3.9    6.3  35.9  12.1    78  20.1  70      15.0
 3 female T400m  3.9    6    38.9  13.5    16  19.4  48.4    10.5
 4 female Row    3.91   7.3  37.6  12.9    43  22.3 126.     25.2
 5 female Netb…  3.95   6.6  38.4  12.8    33  19.9  68.9    15.6
 6 female Row    3.95   3.3  36.9  12.5    40  24.5  74.9    16.4
 7 female Netb…  3.96   5.5  36.3  12.4    71  22.6 101.     17.9
 8 female BBall  3.96   7.5  37.5  12.3    60  20.6 109.     19.8
 9 female Tenn…  4      4.2  36.6  12      57  25.4 109      20.9
10 female Netb…  4.02   9.1  37.7  12.7   107  23.0  77      18.1
# ℹ 192 more rows
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

Breaking ties by another variable

athletes %>% arrange(RCC, BMI)
# A tibble: 202 × 13
   Sex    Sport   RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 female Netb…  3.8    6.6  36.5  12.4   102  24.4 157.     26.6
 2 female T400m  3.9    6    38.9  13.5    16  19.4  48.4    10.5
 3 female Netb…  3.9    6.3  35.9  12.1    78  20.1  70      15.0
 4 female Row    3.91   7.3  37.6  12.9    43  22.3 126.     25.2
 5 female Netb…  3.95   6.6  38.4  12.8    33  19.9  68.9    15.6
 6 female Row    3.95   3.3  36.9  12.5    40  24.5  74.9    16.4
 7 female BBall  3.96   7.5  37.5  12.3    60  20.6 109.     19.8
 8 female Netb…  3.96   5.5  36.3  12.4    71  22.6 101.     17.9
 9 female Tenn…  4      4.2  36.6  12      57  25.4 109      20.9
10 female Netb…  4.02   9.1  37.7  12.7   107  23.0  77      18.1
# ℹ 192 more rows
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

Descending order

athletes %>% arrange(desc(BMI))
# A tibble: 202 × 13
   Sex    Sport   RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`
   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 male   Field  5.48   6.2  48.2  16.3    94  34.4  82.7   13.9 
 2 male   Field  4.96   8.3  45.3  15.7   141  33.7 114.    17.4 
 3 male   Field  5.48   4.6  49.4  18     132  32.5  55.7    8.51
 4 female Field  4.75   7.5  43.8  15.2    90  31.9 132.    23.0 
 5 male   Field  5.01   8.9  46    15.9   212  30.2 112.    19.9 
 6 male   Field  5.01   8.9  46    15.9   212  30.2  96.9   18.1 
 7 male   Field  5.09   8.9  46.3  15.4    44  30.0  71.1   14.0 
 8 female Field  4.58   5.8  42.1  14.7   164  28.6 110.    21.3 
 9 female Field  4.51   9    39.7  14.3    36  28.1 136.    24.9 
10 male   WPolo  5.34   6.2  49.8  17.2   143  27.8  75.7   13.5 
# ℹ 192 more rows
# ℹ 3 more variables: LBM <dbl>, Ht <dbl>, Wt <dbl>

“The top ones”

athletes %>%
  arrange(desc(Wt)) %>%
  slice(1:7) %>%
  select(Sport, Wt)
# A tibble: 7 × 2
  Sport    Wt
  <chr> <dbl>
1 Field  123.
2 BBall  114.
3 Field  111.
4 Field  108.
5 Field  103.
6 WPolo  101 
7 BBall  100.

Another way

athletes %>% 
  slice_max(order_by = Wt, n=7) %>% 
  select(Sport, Wt)
# A tibble: 7 × 2
  Sport    Wt
  <chr> <dbl>
1 Field  123.
2 BBall  114.
3 Field  111.
4 Field  108.
5 Field  103.
6 WPolo  101 
7 BBall  100.

Create new variables from old ones

athletes %>% 
  mutate(wt_lb = Wt * 2.2) %>% 
  select(Sport, Sex, Wt, wt_lb) %>% 
  arrange(Wt) 
# A tibble: 202 × 4
   Sport   Sex       Wt wt_lb
   <chr>   <chr>  <dbl> <dbl>
 1 Gym     female  37.8  83.2
 2 Gym     female  43.8  96.4
 3 Gym     female  45.1  99.2
 4 Tennis  female  45.8 101. 
 5 Tennis  female  47.4 104. 
 6 Gym     female  47.8 105. 
 7 T400m   female  49.2 108. 
 8 Row     female  49.8 110. 
 9 T400m   female  50.9 112. 
10 Netball female  51.9 114. 
# ℹ 192 more rows

Turning the result into a number

Output is always data frame unless you explicitly turn it into something else, eg. the weight of the heaviest athlete, as a number:

athletes %>% arrange(desc(Wt)) %>% pluck("Wt", 1) -> heavy
heavy
[1] 123.2

Or the 20 heaviest weights in descending order:

athletes %>%
  arrange(desc(Wt)) %>%
  slice(1:20) %>%
  pluck("Wt")
 [1] 123.20 113.70 111.30 108.20 102.70 101.00 100.20  98.00  97.90  97.90
[11]  97.00  96.90  96.30  94.80  94.80  94.70  94.70  94.60  94.25  94.20

To find the mean height of the women athletes

Two ways:

athletes %>% group_by(Sex) %>% summarize(m = mean(Ht))
# A tibble: 2 × 2
  Sex        m
  <chr>  <dbl>
1 female  175.
2 male    186.
athletes %>%
  filter(Sex == "female") %>%
  summarize(w_mean = mean(Ht))
# A tibble: 1 × 1
  w_mean
   <dbl>
1   175.

Summary of data selection/arrangement “verbs”

Verb Purpose
select Choose columns
slice Choose rows by number
slice_sample Choose random rows
slice_max Choose rows with largest values on a variable (also slice_min)
filter Choose rows satisfying conditions
arrange Sort in order by column(s)
mutate Create new variables
group_by Create groups to work with
summarize Calculate summary statistics (by groups if defined)
pluck Extract items from data frame
pull Extract a single column from a data frame as a vector