Discriminant Analysis

Discriminant analysis

ANOVA and MANOVA: predict a (counted/measured) response from group membership.
Discriminant analysis: predict group membership based on counted/measured variables.
Covers same ground as logistic regression (and its variations), but emphasis on classifying observed data into correct groups.

… continued

Does so by searching for linear combination of original variables that best separates data into groups (canonical variables).
Assumption here that groups are known (for data we have). If trying to “best separate” data into unknown groups, see cluster analysis.

Packages

library(MASS, exclude = "select")
library(tidyverse)
library(ggrepel)
library(ggord) # installation instructions below
library(MVTests) # for Box M test

ggrepel allows labelling points on a plot so they don’t overwrite each other.

About `select`

Both dplyr (in tidyverse) and MASS have a function called select, and they do different things.
When you load MASS, make sure to load it without its select, so that when you use select, you get the one you’re used to.
If you forget, and you intend to use the tidyverse select, you will get a problem that is almost impossible to debug unless you have seen it before.

Installing `ggord`

ggord (the package) contains a function, also called ggord, that makes a nice picture of a discriminant analysis.
It lives on r-universe, so to install:

install.packages("ggord", repos = "https://fawda123.r-universe.dev")

Example 1: seed yields and weights

my_url <- "http://ritsokiguess.site/datafiles/manova1.txt"
hilo <- read_delim(my_url, " ")
g <- ggplot(hilo, aes(x = yield, y = weight,
  colour = fertilizer)) + geom_point(size = 4)

Recall data from MANOVA: needed a multivariate analysis to find difference in seed yield and weight based on whether they were high or low fertilizer.

Basic discriminant analysis

hilo.1 <- lda(fertilizer ~ yield + weight, data = hilo)

Uses lda from package MASS.
“Predicting” group membership from measured variables.

Output (in `hilo.1`)

Call:
lda(fertilizer ~ yield + weight, data = hilo)

Prior probabilities of groups:
high  low 
 0.5  0.5 

Group means:
     yield weight
high  35.0  13.25
low   32.5  12.00

Coefficients of linear discriminants:
              LD1
yield  -0.7666761
weight -1.2513563

Things to take from output 1/2

Group means: high-fertilizer plants have (slightly) higher mean yield and weight than low-fertilizer plants.
“Coefficients of linear discriminants”: are scores constructed from observed variables that best separate the groups.
For any plant, get LD1 score by taking \(-0.76\) times yield plus \(-1.25\) times weight, add up, standardize.

Things to take from output 1/2

the LD1 coefficients are like slopes:
- if yield higher, LD1 score for a plant lower
- if weight higher, LD1 score for a plant lower
High-fertilizer plants have higher yield and weight, thus low (negative) LD1 score. Low-fertilizer plants have low yield and weight, thus high (positive) LD1 score.
One LD1 score for each observation. Plot with actual groups, because the LD1 scale should do the best job of distinguishing the fertilizer groups.

How many linear discriminants?

Smaller of these:
- Number of variables
- Number of groups minus 1
Seed yield and weight: 2 variables, 2 groups, \(\min(2,2-1)=1\).

Getting LD scores

Feed output from LDA into predict:

p <- predict(hilo.1)
hilo.2 <- cbind(hilo, p)

the LD scores

hilo.2

LD1 scores in order

hilo.2 %>% select(fertilizer, yield, weight, LD1) %>% 
  arrange(desc(LD1))

LD1 scores and fertilizer

Most positive LD1 score is most obviously low fertilizer, most negative is most obviously high.

High fertilizer have yield and weight high, negative LD1 scores.

Plotting LD1 scores

With one LD score, plot against (true) groups, eg. boxplot:

ggplot(hilo.2, aes(x = fertilizer, y = LD1)) + geom_boxplot()

What else is in `hilo.2`?

class: predicted fertilizer level (based on values of yield and weight).
posterior: predicted probability of being low or high fertilizer given yield and weight.
LD1: scores for (each) linear discriminant (here is only LD1) on each observation.

Predictions and predicted groups

based on yield and weight:

hilo.2 %>% select(yield, weight, fertilizer, class)

Count up correct and incorrect classification

with(hilo.2, table(obs = fertilizer, pred = class))

      pred
obs    high low
  high    4   0
  low     0   4

Each predicted fertilizer level is exactly same as observed one (perfect prediction).
Table shows no errors: all values on top-left to bottom-right diagonal.

Posterior probabilities

show how clear-cut the classification decisions were:

hilo.2 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 4))) %>% 
  select(-LD1)

Comments

Only obs. 7 has any doubt: yield low for a high-fertilizer, but high weight makes up for it.

Example 2: jobs and survey scores

244 people who do one of three different jobs also took a survey that gave them scores on three different traits called Outdoor, Social, and Conservative. Can we use these survey scores to distinguish the people who do the three different jobs?

Data in https://datafiles.ritsokiguess.site/jobs.txt.

Read in

my_url <- "https://datafiles.ritsokiguess.site/jobs.txt"
jobs0 <- read_table(my_url)
jobs0 %>% slice_sample(n = 10)

Problem

The jobs are numbered 1, 2, and 3, but we actually know that 1 is customer service, 2 is mechanic, 3 is dispatcher. It would be better to have those names in the dataset.
recode_values:

jobs0 %>% 
  mutate(job = recode_values(
    job,
    1 ~ "cs",
    2 ~ "mech",
    3 ~ "disp"
  )) -> jobs

Check

jobs %>% slice_sample(n = 10)