Analysis of Covariance

Analysis of covariance

  • ANOVA: explanatory variables categorical (divide data into groups)

  • traditionally, analysis of covariance has categorical \(x\)’s plus one numerical \(x\) (“covariate”) to be adjusted for.

  • lm handles this too.

  • Simple example: two treatments (drugs) (a and b), with before and after scores.

  • Does knowing before score and/or treatment help to predict after score?

  • Is after score different by treatment/before score?

Data: treatment, before, after

a 5 20
a 10 23
a 12 30
a 9 25
a 23 34
a 21 40
a 14 27
a 18 38
a 6 24
a 13 31
b 7 19
b 12 26
b 27 33
b 24 35
b 18 30
b 22 31
b 26 34
b 21 28
b 14 23
b 9 22

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(marginaleffects)

the last of these for predictions.

Read in data

url <- "http://ritsokiguess.site/datafiles/ancova.txt"
prepost <- read_delim(url, " ")
prepost
# A tibble: 20 × 3
   drug  before after
   <chr>  <dbl> <dbl>
 1 a          5    20
 2 a         10    23
 3 a         12    30
 4 a          9    25
 5 a         23    34
 6 a         21    40
 7 a         14    27
 8 a         18    38
 9 a          6    24
10 a         13    31
11 b          7    19
12 b         12    26
13 b         27    33
14 b         24    35
15 b         18    30
16 b         22    31
17 b         26    34
18 b         21    28
19 b         14    23
20 b          9    22

Making a plot

ggplot(prepost, aes(x = before, y = after, colour = drug)) +
  geom_point() + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Comments

  • As before score goes up, after score goes up.

  • Red points (drug A) generally above blue points (drug B), for comparable before score.

  • Suggests before score effect and drug effect.

The means

prepost %>%
  group_by(drug) %>%
  summarize(
    before_mean = mean(before),
    after_mean = mean(after)
  )
# A tibble: 2 × 3
  drug  before_mean after_mean
  <chr>       <dbl>      <dbl>
1 a            13.1       29.2
2 b            18         28.1
  • Mean “after” score slightly higher for treatment A.

  • Mean “before” score much higher for treatment B.

  • Greater improvement on treatment A.

Testing for interaction

prepost.1 <- lm(after ~ before * drug, data = prepost)
drop1(prepost.1, test = "F")
Single term deletions

Model:
after ~ before * drug
            Df Sum of Sq    RSS    AIC F value Pr(>F)
<none>                   109.98 42.092               
before:drug  1    12.337 122.32 42.218  1.7948 0.1991
  • Interaction not significant. Will remove later.

Predictions

Set up values to predict for, median and quartiles for before, the two drugs:

new <- datagrid(before = c(9.75, 14, 21.25), 
                drug = c("a", "b"), model = prepost.1)
new
  before drug rowid
1   9.75    a     1
2   9.75    b     2
3  14.00    a     3
4  14.00    b     4
5  21.25    a     5
6  21.25    b     6

and then

cbind(predictions(prepost.1, newdata = new)) %>% 
  select(drug, before, estimate, conf.low, conf.high)
  drug before estimate conf.low conf.high
1    a   9.75 25.93250 24.05059  27.81442
2    b   9.75 22.14565 19.58681  24.70450
3    a  14.00 30.07784 28.43296  31.72271
4    b  14.00 25.21304 23.32649  27.09959
5    a  21.25 37.14929 34.32557  39.97300
6    b  21.25 30.44565 28.64373  32.24758

Predictions (with interaction included), plotted

plot_predictions(model = prepost.1, 
                 condition = c("before", "drug"))
Ignoring unknown labels:
• linetype : "drug"

Lines almost parallel, but not quite.

Taking out interaction

prepost.2 <- update(prepost.1, . ~ . - before:drug)
drop1(prepost.2, test = "F")
Single term deletions

Model:
after ~ before + drug
       Df Sum of Sq    RSS    AIC F value    Pr(>F)    
<none>              122.32 42.218                      
before  1    540.18 662.50 74.006  75.074 1.211e-07 ***
drug    1    115.31 237.63 53.499  16.025 0.0009209 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Take out non-significant interaction.

  • before and drug strongly significant.

  • Do predictions again and plot them.

Predictions

cbind(predictions(prepost.2, newdata = new)) %>% 
  select(drug, before, estimate)
  drug before estimate
1    a   9.75 26.42794
2    b   9.75 21.27328
3    a  14.00 29.94473
4    b  14.00 24.79007
5    a  21.25 35.94397
6    b  21.25 30.78931

Plot of predicted values

plot_predictions(prepost.2, condition = c("before", "drug"))
Ignoring unknown labels:
• linetype : "drug"

This time the lines are exactly parallel. No-interaction model forces them to have the same slope.

Different look at model output

  • drop1(prepost.2) tests for significant effect of before score and of drug, but doesn’t help with interpretation.

  • summary(prepost.2) views as regression with slopes:

summary(prepost.2)

Call:
lm(formula = after ~ before + drug, data = prepost)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6348 -2.5099 -0.2038  1.8871  4.7453 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.3600     1.5115  12.147 8.35e-10 ***
before        0.8275     0.0955   8.665 1.21e-07 ***
drugb        -5.1547     1.2876  -4.003 0.000921 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.682 on 17 degrees of freedom
Multiple R-squared:  0.817, Adjusted R-squared:  0.7955 
F-statistic: 37.96 on 2 and 17 DF,  p-value: 5.372e-07

Understanding those slopes

tidy(prepost.2)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   18.4      1.51       12.1  8.35e-10
2 before         0.827    0.0955      8.66 1.21e- 7
3 drugb         -5.15     1.29       -4.00 9.21e- 4
  • before ordinary numerical variable; drug categorical.

  • lm uses first category druga as baseline.

  • Intercept is prediction of after score for before score 0 and drug A.

  • before slope is predicted change in after score when before score increases by 1 (usual slope)

  • Slope for drugb is change in predicted after score for being on drug B rather than drug A. Same for any before score (no interaction).

Summary

  • ANCOVA model: fits different regression line for each group, predicting response from covariate.

  • ANCOVA model with interaction between factor and covariate allows different slopes for each line.

  • Sometimes those lines can cross over!

  • If interaction not significant, take out. Lines then parallel.

  • With parallel lines, groups have consistent effect regardless of value of covariate.

A second example

An ice cream seller is considering two locations in Amsterdam to sell ice cream: Dappermarkt and Oosterpark. The ice cream seller has data on weekly sales and average weekly temperature at those two locations.

The data

my_url <- "http://datafiles.ritsokiguess.site/icecream_sales.csv"
icecream <- read_csv(my_url)
Rows: 40 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): location
dbl (2): temperature, sales

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
icecream
# A tibble: 40 × 3
   temperature sales location  
         <dbl> <dbl> <chr>     
 1        15.3  601. Oosterpark
 2        17.4  658. Oosterpark
 3        21.5  747. Oosterpark
 4        28.2  855. Oosterpark
 5        14.0  549. Oosterpark
 6        28.0  937. Oosterpark
 7        28.9  960. Oosterpark
 8        23.2  730. Oosterpark
 9        22.6  756. Oosterpark
10        11.2  456. Oosterpark
# ℹ 30 more rows

Sales by location

icecream %>% 
  group_by(location) %>% 
  summarize(mean_sales = mean(sales))
# A tibble: 2 × 2
  location    mean_sales
  <chr>            <dbl>
1 Dappermarkt       660.
2 Oosterpark        733.

… so Oosterpark is better, right?

Fit a model including temperature and interaction as well

icecream.1 <- lm(sales ~ temperature * location, 
                 data = icecream)
drop1(icecream.1, test = "F")
Single term deletions

Model:
sales ~ temperature * location
                     Df Sum of Sq   RSS    AIC F value    Pr(>F)    
<none>                            48903 292.35                      
temperature:location  1     33457 82360 311.20   24.63 1.683e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction is significant and must be kept.

What does a significant interaction mean here?

ggplot(icecream, aes(x = temperature, y = sales, 
                     colour = location)) + 
  geom_point() + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

It means that the lines have significantly different slopes.

Summary output

summary(icecream.1)

Call:
lm(formula = sales ~ temperature * location, data = icecream)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.018 -32.222   3.449  28.157  83.390 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      78.407     27.717   2.829  0.00759 ** 
temperature                      37.481      1.704  21.995  < 2e-16 ***
locationOosterpark               99.644     42.527   2.343  0.02477 *  
temperature:locationOosterpark  -11.194      2.255  -4.963 1.68e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36.86 on 36 degrees of freedom
Multiple R-squared:  0.9588,    Adjusted R-squared:  0.9554 
F-statistic: 279.6 on 3 and 36 DF,  p-value: < 2.2e-16

Conclusion

  • The slope for Oosterpark is 11.2 less than for the baseline Dappermarkt.

  • That is, as temperature rises, sales increase faster at Dappermarkt.

  • So, what should the ice cream seller do?

The graph again

ggplot(icecream, aes(x = temperature, y = sales, 
                     colour = location)) + 
  geom_point() + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

What to do

  • At any temperature above about 12.5, sales are higher at Dappermarkt, so that’s where the ice cream seller should go
  • This is the opposite to a decision purely based on sales
  • At temperatures below about 12.5, it makes no difference where the seller goes
  • Because there is an interaction, the best choice of location depends on the temperature.
  • Sales depend on temperature, location, and their combination, so it was good to include temperature and the interaction in the model.