ANOVA: explanatory variables categorical (divide data into groups)
traditionally, analysis of covariance has categorical \(x\)’s plus one numerical \(x\) (“covariate”) to be adjusted for.
lm handles this too.
Simple example: two treatments (drugs) (a and b), with before and after scores.
Does knowing before score and/or treatment help to predict after score?
Is after score different by treatment/before score?
Data: treatment, before, after
a 5 20
a 10 23
a 12 30
a 9 25
a 23 34
a 21 40
a 14 27
a 18 38
a 6 24
a 13 31
b 7 19
b 12 26
b 27 33
b 24 35
b 18 30
b 22 31
b 26 34
b 21 28
b 14 23
b 9 22
Packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# A tibble: 20 × 3
drug before after
<chr> <dbl> <dbl>
1 a 5 20
2 a 10 23
3 a 12 30
4 a 9 25
5 a 23 34
6 a 21 40
7 a 14 27
8 a 18 38
9 a 6 24
10 a 13 31
11 b 7 19
12 b 12 26
13 b 27 33
14 b 24 35
15 b 18 30
16 b 22 31
17 b 26 34
18 b 21 28
19 b 14 23
20 b 9 22
# A tibble: 2 × 3
drug before_mean after_mean
<chr> <dbl> <dbl>
1 a 13.1 29.2
2 b 18 28.1
Mean “after” score slightly higher for treatment A.
Mean “before” score much higher for treatment B.
Greater improvement on treatment A.
Testing for interaction
prepost.1<-lm(after ~ before * drug, data = prepost)drop1(prepost.1, test ="F")
Single term deletions
Model:
after ~ before * drug
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 109.98 42.092
before:drug 1 12.337 122.32 42.218 1.7948 0.1991
Interaction not significant. Will remove later.
Predictions
Set up values to predict for, median and quartiles for before, the two drugs:
new <-datagrid(before =c(9.75, 14, 21.25), drug =c("a", "b"), model = prepost.1)new
before drug rowid
1 9.75 a 1
2 9.75 b 2
3 14.00 a 3
4 14.00 b 4
5 21.25 a 5
6 21.25 b 6
prepost.2<-update(prepost.1, . ~ . - before:drug)drop1(prepost.2, test ="F")
Single term deletions
Model:
after ~ before + drug
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 122.32 42.218
before 1 540.18 662.50 74.006 75.074 1.211e-07 ***
drug 1 115.31 237.63 53.499 16.025 0.0009209 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
before ordinary numerical variable; drug categorical.
lm uses first category druga as baseline.
Intercept is prediction of after score for before score 0 and drug A.
before slope is predicted change in after score when before score increases by 1 (usual slope)
Slope for drugb is change in predicted after score for being on drug B rather than drug A. Same for any before score (no interaction).
Summary
ANCOVA model: fits different regression line for each group, predicting response from covariate.
ANCOVA model with interaction between factor and covariate allows different slopes for each line.
Sometimes those lines can cross over!
If interaction not significant, take out. Lines then parallel.
With parallel lines, groups have consistent effect regardless of value of covariate.
A second example
An ice cream seller is considering two locations in Amsterdam to sell ice cream: Dappermarkt and Oosterpark. The ice cream seller has data on weekly sales and average weekly temperature at those two locations.
Rows: 40 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): location
dbl (2): temperature, sales
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Comments
As before score goes up, after score goes up.
Red points (drug A) generally above blue points (drug B), for comparable before score.
Suggests before score effect and drug effect.