Discriminant Analysis

Discriminant analysis

  • ANOVA and MANOVA: predict a (counted/measured) response from group membership.

  • Discriminant analysis: predict group membership based on counted/measured variables.

  • Covers same ground as logistic regression (and its variations), but emphasis on classifying observed data into correct groups.

… continued

  • Does so by searching for linear combination of original variables that best separates data into groups (canonical variables).

  • Assumption here that groups are known (for data we have). If trying to “best separate” data into unknown groups, see cluster analysis.

Packages

library(MASS, exclude = "select")
library(tidyverse)
library(ggrepel)
library(ggord) # installation instructions below
library(MVTests) # for Box M test
  • ggrepel allows labelling points on a plot so they don’t overwrite each other.

About select

  • Both dplyr (in tidyverse) and MASS have a function called select, and they do different things.

  • When you load MASS, make sure to load it without its select, so that when you use select, you get the one you’re used to.

  • If you forget, and you intend to use the tidyverse select, you will get a problem that is almost impossible to debug unless you have seen it before.

Installing ggord

  • ggord (the package) contains a function, also called ggord, that makes a nice picture of a discriminant analysis.
  • It lives on r-universe, so to install:
install.packages("ggord", repos = "https://fawda123.r-universe.dev")

Example 1: seed yields and weights

my_url <- "http://ritsokiguess.site/datafiles/manova1.txt"
hilo <- read_delim(my_url, " ")
g <- ggplot(hilo, aes(x = yield, y = weight,
  colour = fertilizer)) + geom_point(size = 4)

Recall data from MANOVA: needed a multivariate analysis to find difference in seed yield and weight based on whether they were high or low fertilizer.

Basic discriminant analysis

hilo.1 <- lda(fertilizer ~ yield + weight, data = hilo)
  • Uses lda from package MASS.

  • “Predicting” group membership from measured variables.

Output (in hilo.1)

Call:
lda(fertilizer ~ yield + weight, data = hilo)

Prior probabilities of groups:
high  low 
 0.5  0.5 

Group means:
     yield weight
high  35.0  13.25
low   32.5  12.00

Coefficients of linear discriminants:
              LD1
yield  -0.7666761
weight -1.2513563

Things to take from output 1/2

  • Group means: high-fertilizer plants have (slightly) higher mean yield and weight than low-fertilizer plants.

  • “Coefficients of linear discriminants”: are scores constructed from observed variables that best separate the groups.

  • For any plant, get LD1 score by taking \(-0.76\) times yield plus \(-1.25\) times weight, add up, standardize.

Things to take from output 1/2

  • the LD1 coefficients are like slopes:

    • if yield higher, LD1 score for a plant lower
    • if weight higher, LD1 score for a plant lower
  • High-fertilizer plants have higher yield and weight, thus low (negative) LD1 score. Low-fertilizer plants have low yield and weight, thus high (positive) LD1 score.

  • One LD1 score for each observation. Plot with actual groups, because the LD1 scale should do the best job of distinguishing the fertilizer groups.

How many linear discriminants?

  • Smaller of these:

    • Number of variables

    • Number of groups minus 1

  • Seed yield and weight: 2 variables, 2 groups, \(\min(2,2-1)=1\).

Getting LD scores

Feed output from LDA into predict:

p <- predict(hilo.1)
hilo.2 <- cbind(hilo, p)

the LD scores

hilo.2
  fertilizer yield weight class posterior.high posterior.low        LD1
1        low    34     10   low   2.108619e-05  9.999789e-01  3.0931414
2        low    29     14   low   1.245320e-03  9.987547e-01  1.9210963
3        low    35     11   low   2.315016e-02  9.768498e-01  1.0751090
4        low    32     13   low   4.579036e-02  9.542096e-01  0.8724245
5       high    33     14  high   9.817958e-01  1.820422e-02 -1.1456079
6       high    38     12  high   9.998195e-01  1.804941e-04 -2.4762756
7       high    34     13  high   9.089278e-01  9.107216e-02 -0.6609276
8       high    35     14  high   9.999109e-01  8.914534e-05 -2.6789600

LD1 scores in order

hilo.2 %>% select(fertilizer, yield, weight, LD1) %>% 
  arrange(desc(LD1))
  fertilizer yield weight        LD1
1        low    34     10  3.0931414
2        low    29     14  1.9210963
3        low    35     11  1.0751090
4        low    32     13  0.8724245
7       high    34     13 -0.6609276
5       high    33     14 -1.1456079
6       high    38     12 -2.4762756
8       high    35     14 -2.6789600

LD1 scores and fertilizer

Most positive LD1 score is most obviously low fertilizer, most negative is most obviously high.

High fertilizer have yield and weight high, negative LD1 scores.

Plotting LD1 scores

With one LD score, plot against (true) groups, eg. boxplot:

ggplot(hilo.2, aes(x = fertilizer, y = LD1)) + geom_boxplot()

What else is in hilo.2?

  • class: predicted fertilizer level (based on values of yield and weight).

  • posterior: predicted probability of being low or high fertilizer given yield and weight.

  • LD1: scores for (each) linear discriminant (here is only LD1) on each observation.

Predictions and predicted groups

based on yield and weight:

hilo.2 %>% select(yield, weight, fertilizer, class)
  yield weight fertilizer class
1    34     10        low   low
2    29     14        low   low
3    35     11        low   low
4    32     13        low   low
5    33     14       high  high
6    38     12       high  high
7    34     13       high  high
8    35     14       high  high

Count up correct and incorrect classification

with(hilo.2, table(obs = fertilizer, pred = class))
      pred
obs    high low
  high    4   0
  low     0   4
  • Each predicted fertilizer level is exactly same as observed one (perfect prediction).

  • Table shows no errors: all values on top-left to bottom-right diagonal.

Posterior probabilities

show how clear-cut the classification decisions were:

hilo.2 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 4))) %>% 
  select(-LD1)
  fertilizer yield weight class posterior.high posterior.low
1        low    34     10   low         0.0000        1.0000
2        low    29     14   low         0.0012        0.9988
3        low    35     11   low         0.0232        0.9768
4        low    32     13   low         0.0458        0.9542
5       high    33     14  high         0.9818        0.0182
6       high    38     12  high         0.9998        0.0002
7       high    34     13  high         0.9089        0.0911
8       high    35     14  high         0.9999        0.0001

Comments

Only obs. 7 has any doubt: yield low for a high-fertilizer, but high weight makes up for it.

Example 2: jobs and survey scores

244 people who do one of three different jobs also took a survey that gave them scores on three different traits called Outdoor, Social, and Conservative. Can we use these survey scores to distinguish the people who do the three different jobs?

Data in https://datafiles.ritsokiguess.site/jobs.txt.

Read in

my_url <- "https://datafiles.ritsokiguess.site/jobs.txt"
jobs0 <- read_table(my_url)

── Column specification ────────────────────────────────────────────────────────
cols(
  outdoor = col_double(),
  social = col_double(),
  conservative = col_double(),
  job = col_double(),
  id = col_double()
)
jobs0 %>% slice_sample(n = 10)
# A tibble: 10 × 5
   outdoor social conservative   job    id
     <dbl>  <dbl>        <dbl> <dbl> <dbl>
 1      14     16            6     3    48
 2      15     25           11     1    73
 3      20     20            9     2    45
 4      13     27            7     1     8
 5      24     14            7     2    25
 6      20     19           11     2    82
 7      13     25           14     1    82
 8      15     16           14     3    60
 9      20     13           18     3    61
10      25     16           12     3    46

Problem

  • The jobs are numbered 1, 2, and 3, but we actually know that 1 is customer service, 2 is mechanic, 3 is dispatcher. It would be better to have those names in the dataset.

  • recode_values:

jobs0 %>% 
  mutate(job = recode_values(
    job,
    1 ~ "cs",
    2 ~ "mech",
    3 ~ "disp"
  )) -> jobs

Check

jobs %>% slice_sample(n = 10)
# A tibble: 10 × 5
   outdoor social conservative job      id
     <dbl>  <dbl>        <dbl> <chr> <dbl>
 1      17     12           17 disp      7
 2      15     14           17 disp     11
 3      14     21           16 disp      8
 4      22     19           10 mech     54
 5       9     13           16 disp     64
 6      16     28           13 mech     11
 7       6     18            6 cs       62
 8      12     14            8 disp     51
 9      22     27           12 cs       55
10      20     15            7 disp     12

Discriminant analysis

jobs.1 <- lda(job ~ outdoor + social + conservative, data = jobs)
jobs.1
Call:
lda(job ~ outdoor + social + conservative, data = jobs)

Prior probabilities of groups:
       cs      disp      mech 
0.3483607 0.2704918 0.3811475 

Group means:
      outdoor   social conservative
cs   12.51765 24.22353     9.023529
disp 15.57576 15.45455    13.242424
mech 18.53763 21.13978    10.139785

Coefficients of linear discriminants:
                     LD1         LD2
outdoor      -0.09198065 -0.22501431
social        0.19427415 -0.04978105
conservative -0.15499199  0.08734288

Proportion of trace:
   LD1    LD2 
0.7712 0.2288 

Comments

  • Now have 3 variables and 3 groups, so \(\min(3, 3-1) = 2\) LDs.
  • “Proportion of trace” says LD1 does best at separating jobs, but LD2 might do something as well.
  • High score on LD1: high on social, low on other two.
  • High score on LD2: low on outdoor.

Discriminant scores

  • via predict:
p <- predict(jobs.1)
jobs.2 <- cbind(jobs, p)
glimpse(jobs.2)
Rows: 244
Columns: 11
$ outdoor        <dbl> 10, 14, 19, 14, 14, 20, 6, 13, 18, 16, 17, 10, 17, 10, …
$ social         <dbl> 22, 17, 33, 29, 25, 25, 18, 27, 31, 35, 25, 29, 25, 22,…
$ conservative   <dbl> 5, 6, 7, 12, 7, 12, 4, 7, 9, 13, 8, 11, 7, 13, 13, 5, 1…
$ job            <chr> "cs", "cs", "cs", "cs", "cs", "cs", "cs", "cs", "cs", "…
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ class          <fct> cs, mech, cs, cs, cs, mech, cs, cs, cs, cs, cs, cs, cs,…
$ posterior.cs   <dbl> 0.9037622, 0.3677743, 0.7302117, 0.8100756, 0.7677607, …
$ posterior.disp <dbl> 7.289988e-03, 1.432468e-01, 3.186265e-04, 7.751215e-03,…
$ posterior.mech <dbl> 0.0889478485, 0.4889789008, 0.2694697105, 0.1821731894,…
$ x.LD1          <dbl> 1.6423155, 0.1480302, 2.6415213, 1.5493681, 1.5472314, …
$ x.LD2          <dbl> 0.71477348, 0.15096436, -1.68326115, 0.07764901, -0.159…

Plotting discriminant scores

  • now have two of them (quantitative), plus true job (categorical):
ggplot(jobs.2, aes(x = x.LD1, y = x.LD2, colour = job)) + geom_point()

The bi-plot

Individuals and original variables on same plot:

ggord(jobs.1, jobs$job, ellipse = FALSE, vec_ext = 5)

Comments

  • on first plot:
    • customer service mostly to right (high LD1)
    • dispatcher mostly to left (low LD1)
    • mechanic mostly at bottom (low LD2)
    • but some mixing
  • on bi-plot:
    • similar story, but also:
    • high LD1 goes with high social, low conservative
    • low LD1 goes with high outdoor

Inference from bi-plot

  • customer service high on social, low on conservative
  • dispatchers opposite of customer service
  • mechanics high on outdoor

How good was the classification?

with(jobs.2, table(obs = job, pred = class))
      pred
obs    cs disp mech
  cs   68    4   13
  disp  3   50   13
  mech 16   10   67
  • Most of the jobs were gotten correct
  • but mechanics in particular were often confused with people from other jobs

Some posterior probabilities for people gotten wrong

jobs.2 %>% 
  filter_out(job == class) %>% 
  select(job, class, starts_with("posterior")) %>% 
  slice_sample(n = 10)
     job class posterior.cs posterior.disp posterior.mech
172 mech  disp   0.03434966   0.6633133705      0.3023370
191 disp  mech   0.01290640   0.1738047250      0.8132889
94  mech    cs   0.51762760   0.1503272946      0.3320451
90  mech    cs   0.85070895   0.0222471944      0.1270439
65    cs  mech   0.37698626   0.0041498345      0.6188639
6     cs  mech   0.16825208   0.0469230463      0.7848249
102 mech    cs   0.72743507   0.0114338064      0.2611311
88  mech    cs   0.65707077   0.0220178411      0.3209114
107 mech    cs   0.47000750   0.1664738951      0.3635186
168 mech    cs   0.75341766   0.0001880719      0.2463943

Example 3: professions and leisure activities

  • 15 individuals from three different professions (politicians, administrators and belly dancers) each participate in four different leisure activities: reading, dancing, TV watching and skiing. After each activity they rate it on a 0–10 scale.

  • How can we best use the scores on the activities to predict a person’s profession?

  • Or, what combination(s) of scores best separate data into profession groups?

The data

my_url <- "http://ritsokiguess.site/datafiles/profile.txt"
active <- read_delim(my_url, " ")
active
# A tibble: 15 × 5
   job         reading dance    tv   ski
   <chr>         <dbl> <dbl> <dbl> <dbl>
 1 bellydancer       7    10     6     5
 2 bellydancer       8     9     5     7
 3 bellydancer       5    10     5     8
 4 bellydancer       6    10     6     8
 5 bellydancer       7     8     7     9
 6 politician        4     4     4     4
 7 politician        6     4     5     3
 8 politician        5     5     5     6
 9 politician        6     6     6     7
10 politician        4     5     6     5
11 admin             3     1     1     2
12 admin             5     3     1     5
13 admin             4     2     2     5
14 admin             7     1     2     4
15 admin             6     3     3     3

Discriminant analysis

active.1 <- lda(job ~ reading + dance + tv + ski, data = active)
active.1
Call:
lda(job ~ reading + dance + tv + ski, data = active)

Prior probabilities of groups:
      admin bellydancer  politician 
  0.3333333   0.3333333   0.3333333 

Group means:
            reading dance  tv ski
admin           5.0   2.0 1.8 3.8
bellydancer     6.6   9.4 5.8 7.4
politician      5.0   4.8 5.2 5.0

Coefficients of linear discriminants:
                LD1        LD2
reading -0.01297465 -0.4748081
dance   -0.95212396 -0.4614976
tv      -0.47417264  1.2446327
ski      0.04153684 -0.2033122

Proportion of trace:
   LD1    LD2 
0.8917 0.1083 

Comments

  • Two discriminants, first fair bit more important than second.

  • LD1 depends (negatively) most on dance, a bit on tv.

  • LD2 depends mostly (positively) on tv.

Misclassification

p <- predict(active.1)
active.2 <- cbind(active, p)
active.2
           job reading dance tv ski       class posterior.admin
1  bellydancer       7    10  6   5 bellydancer    1.483562e-20
2  bellydancer       8     9  5   7 bellydancer    8.933185e-15
3  bellydancer       5    10  5   8 bellydancer    3.897923e-18
4  bellydancer       6    10  6   8 bellydancer    5.036835e-20
5  bellydancer       7     8  7   9 bellydancer    1.728855e-14
6   politician       4     4  4   4  politician    2.833006e-03
7   politician       6     4  5   3  politician    7.505931e-05
8   politician       5     5  5   6  politician    1.197099e-05
9   politician       6     6  6   7  politician    2.441050e-08
10  politician       4     5  6   5  politician    8.017357e-09
11       admin       3     1  1   2       admin    9.999999e-01
12       admin       5     3  1   5       admin    9.999997e-01
13       admin       4     2  2   5       admin    9.999855e-01
14       admin       7     1  2   4       admin    1.000000e+00
15       admin       6     3  3   3       admin    9.820948e-01
   posterior.bellydancer posterior.politician      x.LD1      x.LD2
1           9.999999e-01         5.151376e-08 -5.2373137 -0.5805858
2           9.999994e-01         6.238602e-07 -3.7409181 -2.2451535
3           9.999999e-01         8.410701e-08 -4.6125812 -1.4855391
4           9.999999e-01         6.790980e-08 -5.0997285 -0.7157145
5           9.972915e-01         2.708544e-03 -3.6410910  0.7737931
6           7.140225e-09         9.971670e-01  1.4211625  1.3268707
7           2.321289e-08         9.999249e-01  0.8795037  1.8251996
8           4.652250e-06         9.999834e-01  0.0649649  1.2285733
9           2.140482e-03         9.978595e-01 -1.3327695  1.3335881
10          3.005184e-07         9.999997e-01 -0.4377699  3.1513263
11          1.760462e-22         1.256896e-07  5.6299532 -0.1411021
12          1.200738e-15         3.226100e-07  3.8243665 -2.6236501
13          1.855048e-17         1.446596e-05  4.3152925 -0.4427117
14          8.094462e-21         4.682044e-08  5.1869557 -1.2023261
15          1.395626e-11         1.790523e-02  2.7799729 -0.2025683
with(active.2, table(obs = job, pred = class))
             pred
obs           admin bellydancer politician
  admin           5           0          0
  bellydancer     0           5          0
  politician      0           0          5

Everyone correctly classified.

Plotting LDs

g <- ggplot(active.2, aes(x = x.LD1, y = x.LD2, colour = job, label = job)) + 
  geom_point() + geom_text_repel() + guides(colour = "none")
g

Biplot

# ggbiplot(active.1, groups = active$job)
ggord(active.1, active$job, ellipse = FALSE, vec_ext = 2)

Comments on plot

  • Groups well separated: bellydancers bottom left, administrators bottom right, politicians upper middle.

  • Bellydancers most negative on LD1: like dancing most.

  • Administrators most positive on LD1: like dancing least.

  • Politicians most negative on LD2: like TV-watching most.

Posterior probabilities

active.2 %>% mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>% 
  rename_with(\(p) str_remove(p, "osterior"), 
              starts_with("posterior")) %>% 
  select(job, class, starts_with("p."))
           job       class p.admin p.bellydancer p.politician
1  bellydancer bellydancer   0.000         1.000        0.000
2  bellydancer bellydancer   0.000         1.000        0.000
3  bellydancer bellydancer   0.000         1.000        0.000
4  bellydancer bellydancer   0.000         1.000        0.000
5  bellydancer bellydancer   0.000         0.997        0.003
6   politician  politician   0.003         0.000        0.997
7   politician  politician   0.000         0.000        1.000
8   politician  politician   0.000         0.000        1.000
9   politician  politician   0.000         0.002        0.998
10  politician  politician   0.000         0.000        1.000
11       admin       admin   1.000         0.000        0.000
12       admin       admin   1.000         0.000        0.000
13       admin       admin   1.000         0.000        0.000
14       admin       admin   1.000         0.000        0.000
15       admin       admin   0.982         0.000        0.018

Not much doubt.

Example 4: remote-sensing data

  • View 25 crops from air, measure 4 variables x1-x4.

  • Go back and record what each crop was.

  • Can we use the 4 variables to distinguish crops?

The data

my_url <- "http://ritsokiguess.site/datafiles/remote-sensing.txt"
crops <- read_table(my_url)
crops
# A tibble: 25 × 6
   crop        x1    x2    x3    x4 cr   
   <chr>    <dbl> <dbl> <dbl> <dbl> <chr>
 1 Corn        16    27    31    33 r    
 2 Corn        15    23    30    30 r    
 3 Corn        16    27    27    26 r    
 4 Corn        18    20    25    23 r    
 5 Corn        15    15    31    32 r    
 6 Corn        15    32    32    15 r    
 7 Corn        12    15    16    73 r    
 8 Soybeans    20    23    23    25 y    
 9 Soybeans    24    24    25    32 y    
10 Soybeans    21    25    23    24 y    
# ℹ 15 more rows

Discriminant analysis

crops.1 <- lda(crop ~ x1 + x2 + x3 + x4, data = crops)
crops.1
Call:
lda(crop ~ x1 + x2 + x3 + x4, data = crops)

Prior probabilities of groups:
      Corn     Cotton   Soybeans Sugarbeets 
      0.28       0.24       0.24       0.24 

Group means:
                 x1       x2       x3       x4
Corn       15.28571 22.71429 27.42857 33.14286
Cotton     34.50000 32.66667 35.00000 39.16667
Soybeans   21.00000 27.00000 23.50000 29.66667
Sugarbeets 31.00000 32.16667 20.00000 40.50000

Coefficients of linear discriminants:
           LD1          LD2           LD3
x1  0.14077479  0.007780184 -0.0312610362
x2  0.03006972  0.007318386  0.0085401510
x3 -0.06363974 -0.099520895 -0.0005309869
x4 -0.00677414 -0.035612707  0.0577718649

Proportion of trace:
   LD1    LD2    LD3 
0.8044 0.1832 0.0124 

Assessing

  • 3 LDs (four variables, four groups).

  • 1st two important.

  • LD1 mostly x1 (plus)

  • LD2 x3 (minus)

Predictions

  • Thus:
p <- predict(crops.1)
crops.2 <- cbind(crops, p)
with(crops.2, table(obs = crop, pred = class))
            pred
obs          Corn Cotton Soybeans Sugarbeets
  Corn          6      0        1          0
  Cotton        0      4        2          0
  Soybeans      2      0        3          1
  Sugarbeets    0      0        3          3
  • Not very good, eg. only half the Soybeans and Sugarbeets classified correctly.

Plotting the LDs

ggplot(crops.2, aes(x = x.LD1, y = x.LD2, colour = crop)) +
  geom_point()

Corn (red) mostly left, cotton (green) sort of right, soybeans and sugarbeets (blue and purple) mixed up.

Biplot

ggord(crops.1, crops$crop, ellipse = FALSE, vec_ext = 10)
# ggbiplot(crops.1, groups = crops$crop)

Comments

  • Corn low on LD1 (left), hence low on x1

  • Cotton tends to be high on LD1 (high x1)

  • one cotton very low on LD2 (high x3?)

  • Rather mixed up.

Posterior probs (some)

crops.2 %>% mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>% 
  rename_with(\(p) str_remove(p, "osterior"), 
              starts_with("posterior")) %>% 
  filter(crop != class) %>% 
  select(crop, class, starts_with("p."))
         crop      class p.Corn p.Cotton p.Soybeans p.Sugarbeets
4        Corn   Soybeans  0.443    0.034      0.494        0.029
11   Soybeans Sugarbeets  0.010    0.107      0.299        0.584
12   Soybeans       Corn  0.684    0.009      0.296        0.011
13   Soybeans       Corn  0.467    0.199      0.287        0.047
15     Cotton   Soybeans  0.056    0.241      0.379        0.324
17     Cotton   Soybeans  0.066    0.138      0.489        0.306
20 Sugarbeets   Soybeans  0.381    0.146      0.395        0.078
21 Sugarbeets   Soybeans  0.106    0.144      0.518        0.232
24 Sugarbeets   Soybeans  0.088    0.207      0.489        0.216

Comments

  • These were the misclassified ones, but the posterior probability of being correct was not usually too low.

  • The correctly-classified ones are not very clear-cut either.