Numerical Summaries

Summarizing data

  • A graph is a summary of data that is a picture
  • but how do we get summaries of data that are numbers?
    • a summary or two (eg. mean, median, SD) of just one column
    • a count of observations in each category of a categorical variable
    • summaries by group (defined by a categorical variable)
    • a summary of all of several columns.
  • To do this, meet pipe operator %>%. This takes input data frame, does something to it, and outputs result. (Learn: Ctrl-Shift-M.)

cont’d

  • Output from a pipe can be used as input to something else, so can have a sequence of pipes.
  • Summaries include: mean, median, min, max, sd, IQR, quantile (for obtaining quartiles or any percentile), n (for counting observations).
  • Use our Australian athletes data again.
  • to begin, as usual:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The athletes, again

Rows: 202
Columns: 13
$ Sex     <chr> "female", "female", "female", "female", "female", "female", "f…
$ Sport   <chr> "Netball", "Netball", "Netball", "Netball", "Netball", "Netbal…
$ RCC     <dbl> 4.56, 4.15, 4.16, 4.32, 4.06, 4.12, 4.17, 3.80, 3.96, 4.44, 4.…
$ WCC     <dbl> 13.3, 6.0, 7.6, 6.4, 5.8, 6.1, 5.0, 6.6, 5.5, 9.7, 10.6, 6.3, …
$ Hc      <dbl> 42.2, 38.0, 37.5, 37.7, 38.7, 36.6, 37.4, 36.5, 36.3, 41.4, 37…
$ Hg      <dbl> 13.6, 12.7, 12.3, 12.3, 12.8, 11.8, 12.7, 12.4, 12.4, 14.1, 12…
$ Ferr    <dbl> 20, 59, 22, 30, 78, 21, 109, 102, 71, 64, 68, 78, 107, 39, 58,…
$ BMI     <dbl> 19.16, 21.15, 21.40, 21.03, 21.77, 21.38, 21.47, 24.45, 22.63,…
$ SSF     <dbl> 49.0, 110.2, 89.0, 98.3, 122.1, 90.4, 106.9, 156.6, 101.1, 126…
$ `%Bfat` <dbl> 11.29, 25.26, 19.39, 19.63, 23.11, 16.86, 21.32, 26.57, 17.93,…
$ LBM     <dbl> 53.14, 47.09, 53.44, 48.78, 56.05, 56.45, 53.11, 54.41, 55.97,…
$ Ht      <dbl> 176.8, 172.6, 176.0, 169.9, 183.0, 178.2, 177.3, 174.1, 173.6,…
$ Wt      <dbl> 59.90, 63.00, 66.30, 60.70, 72.90, 67.90, 67.50, 74.10, 68.20,…

Summarizing one column

  • Mean height:
athletes %>% summarize(mean_ht = mean(Ht))
# A tibble: 1 × 1
  mean_ht
    <dbl>
1    180.

or to get mean and SD of BMI:

athletes %>% summarize(mean_bmi = mean(BMI), 
                       sd_bmi = sd(BMI))
# A tibble: 1 × 2
  mean_bmi sd_bmi
     <dbl>  <dbl>
1     23.0   2.86

A warning

This doesn’t work:

mean(BMI)
Error:
! object 'BMI' not found

because R needs to know what dataframe BMI lives in.

Quartiles

  • quantile calculates percentiles (“fractiles”); quartiles are 25th and 75th percentiles:
athletes %>% summarize( Q1_wt = quantile(Wt, 0.25),
                        Q3_wt = quantile(Wt, 0.75))
# A tibble: 1 × 2
  Q1_wt Q3_wt
  <dbl> <dbl>
1  66.5  84.1

Creating new columns

  • These weights are in kilograms. Maybe we want to summarize the weights in pounds.
  • Convert kg to lb by multiplying by 2.2.
  • Create new column and summarize that:
athletes %>% mutate(wt_lb = Wt * 2.2) %>%
  summarize(Q1_lb=quantile(wt_lb, 0.25),
            Q3_lb=quantile(wt_lb, 0.75)) 
# A tibble: 1 × 2
  Q1_lb Q3_lb
  <dbl> <dbl>
1  146.  185.

Counting how many

for example, number of athletes in each sport:

athletes %>% count(Sport)
# A tibble: 10 × 2
   Sport       n
   <chr>   <int>
 1 BBall      25
 2 Field      19
 3 Gym         4
 4 Netball    23
 5 Row        37
 6 Swim       22
 7 T400m      29
 8 TSprnt     15
 9 Tennis     11
10 WPolo      17

Summaries by group

  • Might want separate summaries for each “group”, eg. mean and SD of height for males and females. Strategy is group_by (to define the groups) and then summarize:
athletes %>% 
  group_by(Sex) %>% 
  summarize(mean_ht = mean(Ht), sd_ht = sd(Ht))
# A tibble: 2 × 3
  Sex    mean_ht sd_ht
  <chr>    <dbl> <dbl>
1 female    175.  8.24
2 male      186.  7.90

Counting how many, variation 2:

athletes %>% group_by(Sport) %>%
  summarize(count = n())
# A tibble: 10 × 2
   Sport   count
   <chr>   <int>
 1 BBall      25
 2 Field      19
 3 Gym         4
 4 Netball    23
 5 Row        37
 6 Swim       22
 7 T400m      29
 8 TSprnt     15
 9 Tennis     11
10 WPolo      17

Count plus stats

  • If you want number of observations per group plus some stats, you need to go the n() way:
athletes %>% 
  group_by(Sex) %>%
  summarize(n = n(), mean_ht = mean(Ht), sd_ht = sd(Ht))
# A tibble: 2 × 4
  Sex        n mean_ht sd_ht
  <chr>  <int>   <dbl> <dbl>
1 female   100    175.  8.24
2 male     102    186.  7.90

Summarizing several columns 1/2

  • Standard deviation of each (numeric) column:
athletes %>% 
  summarize(across(where(is.numeric), 
                   \(x) sd(x))) 
# A tibble: 1 × 11
    RCC   WCC    Hc    Hg  Ferr   BMI   SSF `%Bfat`   LBM    Ht    Wt
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>
1 0.458  1.80  3.66  1.36  47.5  2.86  32.6    6.19  13.1  9.73  13.9

Summarizing several columns 2/2

  • Median of all columns whose name starts with H:
athletes %>% 
  summarize(across(starts_with("H"), 
                   \(x) median(x)))
# A tibble: 1 × 3
     Hc    Hg    Ht
  <dbl> <dbl> <dbl>
1  43.5  14.7  180.

Same thing by group

athletes %>% 
  group_by(Sex) %>% 
  summarize(across(starts_with("H"), 
                   \(x) median(x)))
# A tibble: 2 × 4
  Sex       Hc    Hg    Ht
  <chr>  <dbl> <dbl> <dbl>
1 female  40.6  13.5  175 
2 male    45.5  15.5  186.

… another one, getting two summaries

athletes %>% 
  group_by(Sex) %>% 
  summarize(across(ends_with("C"), 
                   list(med = \(x) median(x), 
                        iqr = \(x) IQR(x))))
# A tibble: 2 × 7
  Sex    RCC_med RCC_iqr WCC_med WCC_iqr Hc_med Hc_iqr
  <chr>    <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
1 female    4.38   0.370     6.7    2.15   40.6   4.03
2 male      5.01   0.315     7.1    2.35   45.5   2.57