ANOVA and MANOVA: predict a (counted/measured) response from group membership.
Discriminant analysis: predict group membership based on counted/measured variables.
Covers same ground as logistic regression (and its variations), but emphasis on classifying observed data into correct groups.
Does so by searching for linear combination of original variables that best separates data into groups (canonical variables).
Assumption here that groups are known (for data we have). If trying to “best separate” data into unknown groups, see cluster analysis.
ggrepel allows labelling points on a plot so they don’t overwrite each other.selectBoth dplyr (in tidyverse) and MASS have a function called select, and they do different things.
When you load MASS, make sure to load it without its select, so that when you use select, you get the one you’re used to.
If you forget, and you intend to use the tidyverse select, you will get a problem that is almost impossible to debug unless you have seen it before.
ggordggord (the package) contains a function, also called ggord, that makes a nice picture of a discriminant analysis.r-universe, so to install:Recall data from MANOVA: needed a multivariate analysis to find difference in seed yield and weight based on whether they were high or low fertilizer.
Uses lda from package MASS.
“Predicting” group membership from measured variables.
hilo.1)Call:
lda(fertilizer ~ yield + weight, data = hilo)
Prior probabilities of groups:
high low
0.5 0.5
Group means:
yield weight
high 35.0 13.25
low 32.5 12.00
Coefficients of linear discriminants:
LD1
yield -0.7666761
weight -1.2513563
Group means: high-fertilizer plants have (slightly) higher mean yield and weight than low-fertilizer plants.
“Coefficients of linear discriminants”: are scores constructed from observed variables that best separate the groups.
For any plant, get LD1 score by taking \(-0.76\) times yield plus \(-1.25\) times weight, add up, standardize.
the LD1 coefficients are like slopes:
High-fertilizer plants have higher yield and weight, thus low (negative) LD1 score. Low-fertilizer plants have low yield and weight, thus high (positive) LD1 score.
One LD1 score for each observation. Plot with actual groups, because the LD1 scale should do the best job of distinguishing the fertilizer groups.
Smaller of these:
Number of variables
Number of groups minus 1
Seed yield and weight: 2 variables, 2 groups, \(\min(2,2-1)=1\).
Feed output from LDA into predict:
fertilizer yield weight class posterior.high posterior.low LD1
1 low 34 10 low 2.108619e-05 9.999789e-01 3.0931414
2 low 29 14 low 1.245320e-03 9.987547e-01 1.9210963
3 low 35 11 low 2.315016e-02 9.768498e-01 1.0751090
4 low 32 13 low 4.579036e-02 9.542096e-01 0.8724245
5 high 33 14 high 9.817958e-01 1.820422e-02 -1.1456079
6 high 38 12 high 9.998195e-01 1.804941e-04 -2.4762756
7 high 34 13 high 9.089278e-01 9.107216e-02 -0.6609276
8 high 35 14 high 9.999109e-01 8.914534e-05 -2.6789600
Most positive LD1 score is most obviously low fertilizer, most negative is most obviously high.
High fertilizer have yield and weight high, negative LD1 scores.
With one LD score, plot against (true) groups, eg. boxplot:
hilo.2?class: predicted fertilizer level (based on values of yield and weight).
posterior: predicted probability of being low or high fertilizer given yield and weight.
LD1: scores for (each) linear discriminant (here is only LD1) on each observation.
based on yield and weight:
Each predicted fertilizer level is exactly same as observed one (perfect prediction).
Table shows no errors: all values on top-left to bottom-right diagonal.
show how clear-cut the classification decisions were:
fertilizer yield weight class posterior.high posterior.low
1 low 34 10 low 0.0000 1.0000
2 low 29 14 low 0.0012 0.9988
3 low 35 11 low 0.0232 0.9768
4 low 32 13 low 0.0458 0.9542
5 high 33 14 high 0.9818 0.0182
6 high 38 12 high 0.9998 0.0002
7 high 34 13 high 0.9089 0.0911
8 high 35 14 high 0.9999 0.0001
244 people who do one of three different jobs also took a survey that gave them scores on three different traits called Outdoor, Social, and Conservative. Can we use these survey scores to distinguish the people who do the three different jobs?
── Column specification ────────────────────────────────────────────────────────
cols(
outdoor = col_double(),
social = col_double(),
conservative = col_double(),
job = col_double(),
id = col_double()
)
# A tibble: 10 × 5
outdoor social conservative job id
<dbl> <dbl> <dbl> <dbl> <dbl>
1 14 16 6 3 48
2 15 25 11 1 73
3 20 20 9 2 45
4 13 27 7 1 8
5 24 14 7 2 25
6 20 19 11 2 82
7 13 25 14 1 82
8 15 16 14 3 60
9 20 13 18 3 61
10 25 16 12 3 46
The jobs are numbered 1, 2, and 3, but we actually know that 1 is customer service, 2 is mechanic, 3 is dispatcher. It would be better to have those names in the dataset.
recode_values:
Call:
lda(job ~ outdoor + social + conservative, data = jobs)
Prior probabilities of groups:
cs disp mech
0.3483607 0.2704918 0.3811475
Group means:
outdoor social conservative
cs 12.51765 24.22353 9.023529
disp 15.57576 15.45455 13.242424
mech 18.53763 21.13978 10.139785
Coefficients of linear discriminants:
LD1 LD2
outdoor -0.09198065 -0.22501431
social 0.19427415 -0.04978105
conservative -0.15499199 0.08734288
Proportion of trace:
LD1 LD2
0.7712 0.2288
social, low on other two.outdoor.predict:Rows: 244
Columns: 11
$ outdoor <dbl> 10, 14, 19, 14, 14, 20, 6, 13, 18, 16, 17, 10, 17, 10, …
$ social <dbl> 22, 17, 33, 29, 25, 25, 18, 27, 31, 35, 25, 29, 25, 22,…
$ conservative <dbl> 5, 6, 7, 12, 7, 12, 4, 7, 9, 13, 8, 11, 7, 13, 13, 5, 1…
$ job <chr> "cs", "cs", "cs", "cs", "cs", "cs", "cs", "cs", "cs", "…
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ class <fct> cs, mech, cs, cs, cs, mech, cs, cs, cs, cs, cs, cs, cs,…
$ posterior.cs <dbl> 0.9037622, 0.3677743, 0.7302117, 0.8100756, 0.7677607, …
$ posterior.disp <dbl> 7.289988e-03, 1.432468e-01, 3.186265e-04, 7.751215e-03,…
$ posterior.mech <dbl> 0.0889478485, 0.4889789008, 0.2694697105, 0.1821731894,…
$ x.LD1 <dbl> 1.6423155, 0.1480302, 2.6415213, 1.5493681, 1.5472314, …
$ x.LD2 <dbl> 0.71477348, 0.15096436, -1.68326115, 0.07764901, -0.159…
Individuals and original variables on same plot:
social, low conservativeoutdoorsocial, low on conservativeoutdoor pred
obs cs disp mech
cs 68 4 13
disp 3 50 13
mech 16 10 67
job class posterior.cs posterior.disp posterior.mech
172 mech disp 0.03434966 0.6633133705 0.3023370
191 disp mech 0.01290640 0.1738047250 0.8132889
94 mech cs 0.51762760 0.1503272946 0.3320451
90 mech cs 0.85070895 0.0222471944 0.1270439
65 cs mech 0.37698626 0.0041498345 0.6188639
6 cs mech 0.16825208 0.0469230463 0.7848249
102 mech cs 0.72743507 0.0114338064 0.2611311
88 mech cs 0.65707077 0.0220178411 0.3209114
107 mech cs 0.47000750 0.1664738951 0.3635186
168 mech cs 0.75341766 0.0001880719 0.2463943
15 individuals from three different professions (politicians, administrators and belly dancers) each participate in four different leisure activities: reading, dancing, TV watching and skiing. After each activity they rate it on a 0–10 scale.
How can we best use the scores on the activities to predict a person’s profession?
Or, what combination(s) of scores best separate data into profession groups?
# A tibble: 15 × 5
job reading dance tv ski
<chr> <dbl> <dbl> <dbl> <dbl>
1 bellydancer 7 10 6 5
2 bellydancer 8 9 5 7
3 bellydancer 5 10 5 8
4 bellydancer 6 10 6 8
5 bellydancer 7 8 7 9
6 politician 4 4 4 4
7 politician 6 4 5 3
8 politician 5 5 5 6
9 politician 6 6 6 7
10 politician 4 5 6 5
11 admin 3 1 1 2
12 admin 5 3 1 5
13 admin 4 2 2 5
14 admin 7 1 2 4
15 admin 6 3 3 3
Call:
lda(job ~ reading + dance + tv + ski, data = active)
Prior probabilities of groups:
admin bellydancer politician
0.3333333 0.3333333 0.3333333
Group means:
reading dance tv ski
admin 5.0 2.0 1.8 3.8
bellydancer 6.6 9.4 5.8 7.4
politician 5.0 4.8 5.2 5.0
Coefficients of linear discriminants:
LD1 LD2
reading -0.01297465 -0.4748081
dance -0.95212396 -0.4614976
tv -0.47417264 1.2446327
ski 0.04153684 -0.2033122
Proportion of trace:
LD1 LD2
0.8917 0.1083
Two discriminants, first fair bit more important than second.
LD1 depends (negatively) most on dance, a bit on tv.
LD2 depends mostly (positively) on tv.
job reading dance tv ski class posterior.admin
1 bellydancer 7 10 6 5 bellydancer 1.483562e-20
2 bellydancer 8 9 5 7 bellydancer 8.933185e-15
3 bellydancer 5 10 5 8 bellydancer 3.897923e-18
4 bellydancer 6 10 6 8 bellydancer 5.036835e-20
5 bellydancer 7 8 7 9 bellydancer 1.728855e-14
6 politician 4 4 4 4 politician 2.833006e-03
7 politician 6 4 5 3 politician 7.505931e-05
8 politician 5 5 5 6 politician 1.197099e-05
9 politician 6 6 6 7 politician 2.441050e-08
10 politician 4 5 6 5 politician 8.017357e-09
11 admin 3 1 1 2 admin 9.999999e-01
12 admin 5 3 1 5 admin 9.999997e-01
13 admin 4 2 2 5 admin 9.999855e-01
14 admin 7 1 2 4 admin 1.000000e+00
15 admin 6 3 3 3 admin 9.820948e-01
posterior.bellydancer posterior.politician x.LD1 x.LD2
1 9.999999e-01 5.151376e-08 -5.2373137 -0.5805858
2 9.999994e-01 6.238602e-07 -3.7409181 -2.2451535
3 9.999999e-01 8.410701e-08 -4.6125812 -1.4855391
4 9.999999e-01 6.790980e-08 -5.0997285 -0.7157145
5 9.972915e-01 2.708544e-03 -3.6410910 0.7737931
6 7.140225e-09 9.971670e-01 1.4211625 1.3268707
7 2.321289e-08 9.999249e-01 0.8795037 1.8251996
8 4.652250e-06 9.999834e-01 0.0649649 1.2285733
9 2.140482e-03 9.978595e-01 -1.3327695 1.3335881
10 3.005184e-07 9.999997e-01 -0.4377699 3.1513263
11 1.760462e-22 1.256896e-07 5.6299532 -0.1411021
12 1.200738e-15 3.226100e-07 3.8243665 -2.6236501
13 1.855048e-17 1.446596e-05 4.3152925 -0.4427117
14 8.094462e-21 4.682044e-08 5.1869557 -1.2023261
15 1.395626e-11 1.790523e-02 2.7799729 -0.2025683
pred
obs admin bellydancer politician
admin 5 0 0
bellydancer 0 5 0
politician 0 0 5
Everyone correctly classified.
Groups well separated: bellydancers bottom left, administrators bottom right, politicians upper middle.
Bellydancers most negative on LD1: like dancing most.
Administrators most positive on LD1: like dancing least.
Politicians most negative on LD2: like TV-watching most.
job class p.admin p.bellydancer p.politician
1 bellydancer bellydancer 0.000 1.000 0.000
2 bellydancer bellydancer 0.000 1.000 0.000
3 bellydancer bellydancer 0.000 1.000 0.000
4 bellydancer bellydancer 0.000 1.000 0.000
5 bellydancer bellydancer 0.000 0.997 0.003
6 politician politician 0.003 0.000 0.997
7 politician politician 0.000 0.000 1.000
8 politician politician 0.000 0.000 1.000
9 politician politician 0.000 0.002 0.998
10 politician politician 0.000 0.000 1.000
11 admin admin 1.000 0.000 0.000
12 admin admin 1.000 0.000 0.000
13 admin admin 1.000 0.000 0.000
14 admin admin 1.000 0.000 0.000
15 admin admin 0.982 0.000 0.018
Not much doubt.
View 25 crops from air, measure 4 variables x1-x4.
Go back and record what each crop was.
Can we use the 4 variables to distinguish crops?
# A tibble: 25 × 6
crop x1 x2 x3 x4 cr
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Corn 16 27 31 33 r
2 Corn 15 23 30 30 r
3 Corn 16 27 27 26 r
4 Corn 18 20 25 23 r
5 Corn 15 15 31 32 r
6 Corn 15 32 32 15 r
7 Corn 12 15 16 73 r
8 Soybeans 20 23 23 25 y
9 Soybeans 24 24 25 32 y
10 Soybeans 21 25 23 24 y
# ℹ 15 more rows
Call:
lda(crop ~ x1 + x2 + x3 + x4, data = crops)
Prior probabilities of groups:
Corn Cotton Soybeans Sugarbeets
0.28 0.24 0.24 0.24
Group means:
x1 x2 x3 x4
Corn 15.28571 22.71429 27.42857 33.14286
Cotton 34.50000 32.66667 35.00000 39.16667
Soybeans 21.00000 27.00000 23.50000 29.66667
Sugarbeets 31.00000 32.16667 20.00000 40.50000
Coefficients of linear discriminants:
LD1 LD2 LD3
x1 0.14077479 0.007780184 -0.0312610362
x2 0.03006972 0.007318386 0.0085401510
x3 -0.06363974 -0.099520895 -0.0005309869
x4 -0.00677414 -0.035612707 0.0577718649
Proportion of trace:
LD1 LD2 LD3
0.8044 0.1832 0.0124
3 LDs (four variables, four groups).
1st two important.
LD1 mostly x1 (plus)
LD2 x3 (minus)
pred
obs Corn Cotton Soybeans Sugarbeets
Corn 6 0 1 0
Cotton 0 4 2 0
Soybeans 2 0 3 1
Sugarbeets 0 0 3 3
Corn (red) mostly left, cotton (green) sort of right, soybeans and sugarbeets (blue and purple) mixed up.
Corn low on LD1 (left), hence low on x1
Cotton tends to be high on LD1 (high x1)
one cotton very low on LD2 (high x3?)
Rather mixed up.
crop class p.Corn p.Cotton p.Soybeans p.Sugarbeets
4 Corn Soybeans 0.443 0.034 0.494 0.029
11 Soybeans Sugarbeets 0.010 0.107 0.299 0.584
12 Soybeans Corn 0.684 0.009 0.296 0.011
13 Soybeans Corn 0.467 0.199 0.287 0.047
15 Cotton Soybeans 0.056 0.241 0.379 0.324
17 Cotton Soybeans 0.066 0.138 0.489 0.306
20 Sugarbeets Soybeans 0.381 0.146 0.395 0.078
21 Sugarbeets Soybeans 0.106 0.144 0.518 0.232
24 Sugarbeets Soybeans 0.088 0.207 0.489 0.216
These were the misclassified ones, but the posterior probability of being correct was not usually too low.
The correctly-classified ones are not very clear-cut either.
Comments
Only obs. 7 has any doubt:
yieldlow for a high-fertilizer, but highweightmakes up for it.