Mood’s Median Test

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(smmr)

Two-sample test: What to do if normality fails

  • If normality fails (for one or both of the groups), what do we do then?
  • Again, can compare medians: use the thought process of the sign test, which does not depend on normality and is not damaged by outliers.
  • A suitable test called Mood’s median test.

Data: exposure of infants to tobacco smoke

  • Random samples were taken of infants who had been exposed to tobacco smoke at home, and those that had not (“unexposed”).
  • A major metabolite of nicotine is cotanine, which can be measured in an infant’s urine.
  • Does exposure to tobacco smoke have an effect on cotanine levels?

Read in data

my_url <- "http://datafiles.ritsokiguess.site/smoke-exposure.csv"
smoke_exposure <- read_csv(my_url)
Rows: 15 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): smoke
dbl (1): cotanine

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
smoke_exposure
# A tibble: 15 × 2
   smoke     cotanine
   <chr>        <dbl>
 1 Unexposed        8
 2 Exposed         35
 3 Unexposed       11
 4 Exposed         56
 5 Unexposed       12
 6 Exposed         83
 7 Unexposed       14
 8 Exposed         92
 9 Unexposed       20
10 Exposed        128
11 Unexposed       43
12 Exposed        150
13 Unexposed      111
14 Exposed        176
15 Exposed        208

A graph

specifically, a boxplot:

ggplot(smoke_exposure, aes(x = smoke, y = cotanine)) + geom_boxplot()

Comments on boxplot

  • Distribution of cotanine in unexposed group has upper outlier and appears right-skewed.
  • Samples are small, so should not use \(t\)-test.
  • Looks as if average (median) cotanine is higher in the exposed group, so hope to find a significant difference.

Mimicking the sign test

  • In the sign test (for one sample), we tested for a median by categorizing each observation as above or below the null median.
  • In a two-sample test, the null hypothesis is that means/medians are same, so we don’t have a value to use for above/below.
  • So, work out overall median (of cotanine, here) and use that as dividing line:
smoke_exposure %>% 
  summarize(overall_median = median(cotanine))
# A tibble: 1 × 1
  overall_median
           <dbl>
1             56

The median of all the observations is 56.

Another graph

Plot observations as points, with overall median marked:

ggplot(smoke_exposure, aes(x = smoke, y = cotanine, 
                           colour = smoke)) + 
  geom_point() +
  geom_hline(yintercept = 56)

Above and below

Instead of using actual cotanine values, define categorical column that indicates above, below, or equal to overall median:

smoke_exposure %>% 
  mutate(vs_median = case_when(
    cotanine > 56  ~ "above",
    cotanine < 56  ~ "below",
    cotanine == 56 ~ "equal"
  )) -> smoke_exposure

With the new column

smoke_exposure 
# A tibble: 15 × 3
   smoke     cotanine vs_median
   <chr>        <dbl> <chr>    
 1 Unexposed        8 below    
 2 Exposed         35 below    
 3 Unexposed       11 below    
 4 Exposed         56 equal    
 5 Unexposed       12 below    
 6 Exposed         83 above    
 7 Unexposed       14 below    
 8 Exposed         92 above    
 9 Unexposed       20 below    
10 Exposed        128 above    
11 Unexposed       43 below    
12 Exposed        150 above    
13 Unexposed      111 above    
14 Exposed        176 above    
15 Exposed        208 above    

Making a test

  • Looking for an association between smoke and vs_median, both categorical.
  • Suggestion: count up number of values above and below in each group, and then test, but how?
  • Chi-squared test for independence (as it turns out).
  • Aside: explore chi-squared test, then apply here.

The chi-squared test for independence

Suppose we want to know whether people are in favour of having daylight savings time all year round. We ask 20 males and 20 females whether they each agree with having DST all year round (“yes”) or not (“no”). Some randomly chosen rows of data:

my_url <- "http://datafiles.ritsokiguess.site/dst.txt"
dst <- read_delim(my_url," ")
dst %>% slice_sample(n = 8)
# A tibble: 8 × 2
  gender agree
  <chr>  <chr>
1 female no   
2 female no   
3 male   yes  
4 male   yes  
5 female no   
6 male   no   
7 female yes  
8 female yes  

… continued

Count up individuals in each category combination, and arrange in contingency table:

tab <- with(dst, table(gender, agree))
tab
        agree
gender   no yes
  female 11   9
  male    3  17
  • Most of the males say “yes”, but the females are about evenly split.
  • Looks like males more likely to say “yes”, ie. an association between gender and agreement.
  • Test an \(H_0\) of “no association” (“independence”) vs. alternative that there is really some association.
  • Done with chisq.test.

…And finally

chisq.test(tab, correct=FALSE)

    Pearson's Chi-squared test

data:  tab
X-squared = 7.033, df = 1, p-value = 0.008002
  • Reject null hypothesis of no association (P-value 0.008)
  • therefore there is a difference in rates of agreement between (all) males and females (or that gender and agreement are associated).
  • This calculation gives same answers as you would get by hand. (Omitting correct = FALSE uses “Yates correction”).
  • Chi-squared test for independence is always two-sided.

Mood’s median test

  • Earlier: compare medians of two groups.
  • Sign test: count number of values above and below something (there, hypothesized median).
  • Mood’s median test:
    • Find “grand median” of all the data, regardless of group
    • Count data values in each group above/below grand median.
    • Make contingency table of group vs. above/below.
    • Test for association.
  • If group medians equal, each group should have about half its observations above/below grand median. If not, one group will be mostly above grand median and other below.

Mood’s median test for smoke exposure data

  • Any data values equal to overall median have no information about rejecting null, so remove first.
smoke_exposure %>% 
  filter_out(vs_median == "equal") -> d
tab <- with(d, table(vs_median, smoke))
tab
         smoke
vs_median Exposed Unexposed
    above       6         1
    below       1         6

… and then

chisq.test(tab, correct = FALSE)
Warning in chisq.test(tab, correct = FALSE): Chi-squared approximation may be
incorrect

    Pearson's Chi-squared test

data:  tab
X-squared = 7.1429, df = 1, p-value = 0.007526
  • Association between exposure group and being above or below the overall median
  • therefore the median cotanine level between the two groups is different.

Or by smmr

  • median_test does the whole thing:
median_test(smoke_exposure, cotanine, smoke)
$grand_median
[1] 56

$table
           above
group       above below
  Exposed       6     1
  Unexposed     1     6

$test
       what       value
1 statistic 7.142857143
2        df 1.000000000
3   P-value 0.007526315
  • Same P-value and conclusion.

Comments

  • Like the sign test, Mood’s median test doesn’t use the data very efficiently (only, is each value above or below grand median).
  • Thus, if we can justify doing t-test, we should do it, but was not the case here.
  • When both tests apply, the t-test will usually give smaller P-value because it uses the data more efficiently.
  • The time to use Mood’s median test is if we are definitely unhappy with the normality assumption (and thus the t-test P-value is not to be trusted).