Logistic Regression

Logistic regression

When response variable is measured/counted, regression can work well.
But what if response is yes/no, lived/died, success/failure?
Model probability of success.
Probability must be between 0 and 1; need method that ensures this.
Logistic regression does this. In R, is a generalized linear model with binomial “family”:

glm(y ~ x, family="binomial")

Begin with simplest case.

Packages

library(MASS, exclude = "select")
library(tidyverse)
library(marginaleffects)
library(broom)
library(nnet)
# library(conflicted)
# conflict_prefer("select", "dplyr")
# conflict_prefer("filter", "dplyr")
# conflict_prefer("rename", "dplyr")
# conflict_prefer("summarize", "dplyr")

The rats, part 1

Rats given dose of some poison; either live or die:

dose status
0 lived
1 died
2 lived
3 lived
4 died
5 died

Read in:

my_url <- "http://ritsokiguess.site/datafiles/rat.txt"
rats <- read_delim(my_url, " ")
rats

This doesn’t work

status.0 <- glm(status ~ dose, family = "binomial", data = rats)

Error in eval(family$initialize): y values must be 0 <= y <= 1

Values of response variable (here status) must be either:
- 1 = “success”, 0 = “failure”
- a factor (not text) with two levels.
The error message doesn’t say that the second is a possibility.

Basic logistic regression

So, make response into a factor first:

rats2 <- rats %>% mutate(status = factor(status))
rats2

then fit model:

status.1 <- glm(status ~ dose, family = "binomial", 
                data = rats2)

Output

summary(status.1)


Call:
glm(formula = status ~ dose, family = "binomial", data = rats2)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.6841     1.7979   0.937    0.349
dose         -0.6736     0.6140  -1.097    0.273

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8.3178  on 5  degrees of freedom
Residual deviance: 6.7728  on 4  degrees of freedom
AIC: 10.773

Number of Fisher Scoring iterations: 4

Interpreting the output

Like (multiple) regression, get tests of significance of individual \(x\)’s
Here not significant (only 6 observations).
“Slope” for dose is negative, meaning that as dose increases, probability of event modelled (survival) decreases.

Output part 2: predicted survival probs

cbind(predictions(status.1)) %>% 
  select(dose, estimate, conf.low, conf.high)

On a graph

plot_predictions(status.1, condition = "dose")

The rats, more

More realistic: more rats at each dose (say 10).
Listing each rat on one line makes a big data file.
Use format below: dose, number of survivals, number of deaths.


dose lived died
0    10    0
1     7    3 
2     6    4 
3     4    6 
4     2    8 
5     1    9

6 lines of data correspond to 60 actual rats.
Saved in rat2.txt.

These data

my_url <- "http://ritsokiguess.site/datafiles/rat2.txt"
rat2 <- read_delim(my_url, " ")
rat2

Response matrix:

Each row contains multiple observations.
Create two-column response with cbind:
- #survivals in first column,
- #deaths in second.

with(rat2, cbind(lived, died))

     lived died
[1,]    10    0
[2,]     7    3
[3,]     6    4
[4,]     4    6
[5,]     2    8
[6,]     1    9

Fit logistic regression

constructing the response in the glm:

rat2.1 <- glm(cbind(lived, died) ~ dose, 
              family = "binomial", data = rat2)

Output

Significant effect of dose now:

summary(rat2.1)


Call:
glm(formula = cbind(lived, died) ~ dose, family = "binomial", 
    data = rat2)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.3619     0.6719   3.515 0.000439 ***
dose         -0.9448     0.2351  -4.018 5.87e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27.530  on 5  degrees of freedom
Residual deviance:  2.474  on 4  degrees of freedom
AIC: 18.94

Number of Fisher Scoring iterations: 4

Predicted survival probs

new <- datagrid(model = rat2.1, dose = 0:5)
new

cbind(predictions(rat2.1, newdata = new)) %>% 
  select(estimate, dose, conf.low, conf.high)

On a picture

plot_predictions(rat2.1, condition = "dose")

Dose and predicted log-odds

plot_predictions(rat2.1, condition = "dose", type = "link")

Comments

Significant effect of dose.
Effect of larger dose is to decrease survival probability (“slope” negative; also see in decreasing predictions.)
Confidence intervals around prediction narrower (more data).

Multiple logistic regression

With more than one \(x\), works much like multiple regression.
Example: study of patients with blood poisoning severe enough to warrant surgery. Relate survival to other potential risk factors.
Variables, 1=present, 0=absent:
- survival (death from sepsis=1), response
- shock
- malnutrition
- alcoholism
- age (as numerical variable)
- bowel infarction
See what relates to death.

Read in data

my_url <- 
  "http://ritsokiguess.site/datafiles/sepsis.txt"
sepsis <- read_delim(my_url, " ")
sepsis

Make sure categoricals really are

sepsis %>% 
  mutate(across(-age, \(x) factor(x))) -> sepsis

The data (some)

sepsis

Fit model

sepsis.1 <- glm(death ~ shock + malnut + alcohol + age +
  bowelinf, family = "binomial", data = sepsis
)

Output part 1

summary(sepsis.1)


Call:
glm(formula = death ~ shock + malnut + alcohol + age + bowelinf, 
    family = "binomial", data = sepsis)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.75391    2.54170  -3.838 0.000124 ***
shock1       3.67387    1.16481   3.154 0.001610 ** 
malnut1      1.21658    0.72822   1.671 0.094798 .  
alcohol1     3.35488    0.98210   3.416 0.000635 ***
age          0.09215    0.03032   3.039 0.002374 ** 
bowelinf1    2.79759    1.16397   2.403 0.016240 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 105.528  on 105  degrees of freedom
Residual deviance:  53.122  on 100  degrees of freedom
AIC: 65.122

Number of Fisher Scoring iterations: 7

Or, with `tidy` (from `broom`)

tidy(sepsis.1)

All P-values fairly small
but malnut not significant: remove.

Removing `malnut`

sepsis.2 <- update(sepsis.1, . ~ . - malnut)
summary(sepsis.2)


Call:
glm(formula = death ~ shock + alcohol + age + bowelinf, family = "binomial", 
    data = sepsis)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.89459    2.31689  -3.839 0.000124 ***
shock1       3.70119    1.10353   3.354 0.000797 ***
alcohol1     3.18590    0.91725   3.473 0.000514 ***
age          0.08983    0.02922   3.075 0.002106 ** 
bowelinf1    2.38647    1.07227   2.226 0.026039 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 105.528  on 105  degrees of freedom
Residual deviance:  56.073  on 101  degrees of freedom
AIC: 66.073

Number of Fisher Scoring iterations: 7

Comments

Everything significant now.
Most of the original \(x\)’s helped predict death. Only malnut seemed not to add anything.
Removed malnut and tried again.
Everything remaining is significant (though bowelinf actually became less significant).
All coefficients are positive, so having any of the risk factors (or being older) increases risk of death.

Predictions from model without “malnut” 1/2

A few (rows of original dataframe) chosen “at random”:

sepsis %>% slice(c(4, 1, 2, 11, 32)) -> new
new

Predictions from model without “malnut” 2/2

cbind(predictions(sepsis.2, newdata = new)) %>% 
  select(estimate, shock:bowelinf)

Comments

Survival chances pretty good if no risk factors, though decreasing with age.
Having more than one risk factor reduces survival chances dramatically.
Usually good job of predicting survival; sometimes death predicted to survive.

Another way to assess effects

of age:

new <- datagrid(model = sepsis.2, age = seq(30, 70, 10))
new

Assessing age effect

cbind(predictions(sepsis.2, newdata = new)) %>% 
  select(estimate, shock:age)

Assessing shock effect

new <- datagrid(shock = c(0, 1), model = sepsis.2)
new

cbind(predictions(sepsis.2, newdata = new)) %>%
 select(estimate, alcohol:shock)

Assessing proportionality of odds for age

An assumption we made is that log-odds of survival depends linearly on age.
Hard to get your head around, but basic idea is that survival chances go continuously up (or down) with age, instead of (for example) going up and then down.
In this case, seems reasonable, but should check:

Residuals vs. age

sepsis.2 %>% augment(sepsis) %>% 
  ggplot(aes(x = age, y = .resid, colour = death)) +
  geom_point()

Comments

No apparent problems overall.
Confusing “line” across: no risk factors, survived.

Probability and odds

For probability \(p\), odds is \(p/(1-p)\):

Prob	Odds	Log-odds	Words
0.5	0.5 / 0.5 = 1.00	0.00	even money
0.1	0.1 / 0.9 = 0.11	-2.20	9 to 1
0.4	0.4 / 0.6 = 0.67	-0.41	1.5 to 1
0.8	0.8 / 0.2 = 4.00	1.39	4 to 1 on

Gamblers use odds: if you win at 9 to 1 odds, get original stake back plus 9 times the stake.
Probability has to be between 0 and 1
Odds between 0 and infinity
Log-odds can be anything: any log-odds corresponds to valid probability.
Thus, predict log-odds of probability from explanatory variable(s), rather than probability itself.

Table of log-odds

We can make a table showing log-odds for various probabilities:

tibble(prob = c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 
                0.7, 0.8, 0.9, 0.95, 0.99)) %>% 
  mutate(odds = prob / (1 - prob)) %>% 
  mutate(log_odds = log(odds)) -> tab

The table

tab

Comments

The relationship between probability and log-odds is non-linear: the same change in probability can go with a different change in log-odds
In particular, the log-odds changes faster when the probability is near 0 or 1, and slower when it is near 0.5
A logistic regression gives you the change in log-odds when the explanatory variable changes by 1.
A change in log-odds of 1, say, is a substantial change in probability.

Sepsis data again

Recall prediction of probability of death from risk factors:

summary(sepsis.2)


Call:
glm(formula = death ~ shock + alcohol + age + bowelinf, family = "binomial", 
    data = sepsis)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.89459    2.31689  -3.839 0.000124 ***
shock1       3.70119    1.10353   3.354 0.000797 ***
alcohol1     3.18590    0.91725   3.473 0.000514 ***
age          0.08983    0.02922   3.075 0.002106 ** 
bowelinf1    2.38647    1.07227   2.226 0.026039 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 105.528  on 105  degrees of freedom
Residual deviance:  56.073  on 101  degrees of freedom
AIC: 66.073

Number of Fisher Scoring iterations: 7

Effects of the risk factors

Being alcoholic vs not increases log odds of dying by 3.1859. This is huge increase in probability: from 0.05 to 0.5, or from 0.1 to 0.8.
The other categorical risk factors have effects of similar size.
Being 10 years older increases log odds of dying by \(10 \times 0.08983 \simeq 0.9\). This is increase in probability from 0.1 to 0.2, or from 0.4 to 0.6: smaller, but not trivial.

More than 2 response categories

With 2 response categories, model the probability of one, and prob of other is one minus that. So doesn’t matter which category you model.
With more than 2 categories, have to think more carefully about the categories: are they
ordered: you can put them in a natural order (like low, medium, high)
nominal: ordering the categories doesn’t make sense (like red, green, blue).
R handles both kinds of response; learn how.

Ordinal response: the miners

Model probability of being in given category or lower.
Example: coal-miners often suffer disease pneumoconiosis. Likelihood of disease believed to be greater among miners who have worked longer.
Severity of disease measured on categorical scale: none, moderate, severe.

Miners data

Data are frequencies:

Exposure None Moderate Severe
5.8       98      0       0
15.0      51      2       1
21.5      34      6       3
27.5      35      5       8
33.5      32     10       9
39.5      23      7       8
46.0      12      6      10
51.5       4      2       5

Reading the data

Data in aligned columns with more than one space between, so:

my_url <- "http://ritsokiguess.site/datafiles/miners-tab.txt"
freqs <- read_table(my_url)

The data

freqs

Tidying

freqs %>%
  pivot_longer(-Exposure, names_to = "Severity", values_to = "Freq") %>%
  mutate(Severity = fct_inorder(Severity)) -> miners

Result

miners

levels(miners$Severity)

[1] "None"     "Moderate" "Severe"

Plot proportions against exposure 1/2

miners %>% 
  group_by(Exposure) %>% 
  mutate(proportion = Freq / sum(Freq)) -> prop
prop

Plot proportions against exposure 2/2

ggplot(prop, aes(x = Exposure, y = proportion,
                   colour = Severity)) + 
  geom_point() + geom_smooth(se = F)

Reminder of data setup

miners

Fitting ordered logistic model

Use function polr from package MASS. Like glm.

sev.1 <- polr(Severity ~ Exposure,
  weights = Freq,
  data = miners
)

Output: not very illuminating

sev.1 <- polr(Severity ~ Exposure,
  weights = Freq,
  data = miners,
)

summary(sev.1)

Call:
polr(formula = Severity ~ Exposure, data = miners, weights = Freq)

Coefficients:
          Value Std. Error t value
Exposure 0.0959    0.01194   8.034

Intercepts:
                Value   Std. Error t value
None|Moderate    3.9558  0.4097     9.6558
Moderate|Severe  4.8690  0.4411    11.0383

Residual Deviance: 416.9188 
AIC: 422.9188

Does exposure have an effect?

Fit model without Exposure, and compare using anova. Note 1 for model with just intercept:

sev.0 <- polr(Severity ~ 1, weights = Freq, data = miners)
anova(sev.0, sev.1)

Exposure definitely has effect on severity of disease.

Another way

What (if anything) can we drop from model with exposure?

drop1(sev.1, test = "Chisq")

Nothing. Exposure definitely has effect.

Predicted probabilities 1/2

freqs %>% select(Exposure) -> new
new

Predicted probabilities 2/2

cbind(predictions(sev.1, newdata = new)) %>%
  select(group, estimate, Exposure) %>% 
  pivot_wider(names_from = group, values_from = estimate)

Plot of predicted probabilities

Wider for looking at, longer for graph:

plot_predictions(model = sev.1, 
                 condition = c("Exposure", "group"), 
                 type = "probs") + 
  geom_point(data = prop, aes(x = Exposure, y = proportion, 
                              colour = Severity)) -> ggg

The graph

ggg

Comments

Model appears to match data well enough.
As exposure goes up, prob of None goes down, Severe goes up (sharply for high exposure).
So more exposure means worse disease.

Unordered responses

With unordered (nominal) responses, can use generalized logit.
Example: 735 people, record age and sex (male 0, female 1), which of 3 brands of some product preferred.
Data in mlogit.csv separated by commas (so read_csv will work):

my_url <- "http://ritsokiguess.site/datafiles/mlogit.csv"
brandpref <- read_csv(my_url)

The data (some)

brandpref

Bashing into shape

sex and brand not meaningful as numbers, so turn into factors:

brandpref %>%
  mutate(sex = ifelse(sex == 1, "female", "male"), 
         sex = factor(sex),
         brand = factor(brand)
         ) -> brandpref

brandpref

Fitting model

We use multinom from package nnet. Works like polr.

library(nnet)
# levels(brandpref$sex)

brands.1 <- multinom(brand ~ age + sex, data = brandpref)

# weights:  12 (6 variable)
initial  value 807.480032 
iter  10 value 702.990572
final  value 702.970704 
converged

summary output not helpful.

Can we drop anything?

Unfortunately drop1 seems not to work:

drop1(brands.1, test = "Chisq", trace = 0)

trying - age

Error in if (trace) {: argument is not interpretable as logical

So, fall back on fitting model without what you want to test, and comparing using anova.

Do age/sex help predict brand? 1/3

Fit models without each of age and sex:

brands.2 <- multinom(brand ~ age, data = brandpref)

# weights:  9 (4 variable)
initial  value 807.480032 
iter  10 value 706.796323
iter  10 value 706.796322
final  value 706.796322 
converged

brands.3 <- multinom(brand ~ sex, data = brandpref)

# weights:  9 (4 variable)
initial  value 807.480032 
final  value 791.861266 
converged

Do age/sex help predict brand? 2/3

anova(brands.2, brands.1)

Do age/sex help predict brand? 3/3

anova(brands.3, brands.1)

Comments

age definitely significant (second anova)
sex significant also (first anova), though P-value less dramatic
Keep both.
Expect to see a large effect of age, and a smaller one of sex.

Another way to build model

Start from model with everything and run step:

step(brands.1, trace = 0)

trying - age 
trying - sex

Call:
multinom(formula = brand ~ age + sex)

Coefficients:
  (Intercept)       age    sexmale
2   -11.25127 0.3682202 -0.5237736
3   -22.25571 0.6859149 -0.4658215

Residual Deviance: 1405.941 
AIC: 1417.941

Final model contains both age and sex so neither could be removed.

Making predictions

Find age 5-number summary, and the two sexes:

summary(brandpref)

 brand       sex           age      
 1:207   female:466   Min.   :24.0  
 2:307   male  :269   1st Qu.:32.0  
 3:221                Median :32.0  
                      Mean   :32.9  
                      3rd Qu.:34.0  
                      Max.   :38.0

Space the ages out a bit for prediction (see over).

Combinations

new <- datagrid(age = seq(24, 40, 4), # cover age range 
                sex = c("female", "male"), model = brands.1)
new

The predictions

cbind(predictions(brands.1, newdata = new)) %>%
  select(group, estimate, age, sex) %>% 
  pivot_wider(names_from = group, values_from = estimate)

Comments

Young males prefer brand 1, but older males prefer brand 3.
Females similar, but like brand 1 less and brand 2 more.
A clear brand effect, but the sex effect is less clear.

Making a plot

I thought plot_predictions doesn’t work as we want, but I was (sort of) wrong about that:

plot_predictions(brands.1, condition = c("age", "sex"), 
         type = "probs")

Making it go

We have to include group in the condition:

plot_predictions(brands.1, condition = c("age", "group"))

Comments

This picks the most common sex in the data (females).
See younger females prefer brand 1, older ones preferring brand 3.

For each `sex`

If we add the other variable to the end, we get facets for sex:

plot_predictions(brands.1, condition = c("age", "group", "sex"))

Not actually much difference between males and females.

A better graph

but the male-female difference was significant. How?
don’t actually plot the graph, then plot the right things:

plot_predictions(brands.1, 
                 condition = c("age", "brand", "sex"), 
                 type = "probs", draw = FALSE) %>% 
  ggplot(aes(x = age, y = estimate, colour = group, 
             linetype = sex)) +
  geom_line() -> g

The graph

Digesting the plot

Brand vs. age: younger people (of both genders) prefer brand 1, but older people (of both genders) prefer brand 3. (Explains significant age effect.)
Brand vs. sex: females (solid) like brand 1 less than males (dashed), like brand 2 more (for all ages).
Not much brand difference between genders (solid and dashed lines of same colours close), but enough to be significant.
Model didn’t include interaction, so modelled effect of gender on brand same for each age, modelled effect of age same for each gender. (See also later.)

Alternative data format

Summarize all people of same brand preference, same sex, same age on one line of data file with frequency on end:

Whole data set in 65 lines not 735! But how?

Getting alternative data format

brandpref %>%
  group_by(age, sex, brand) %>%
  summarize(Freq = n()) %>%
  ungroup() -> b
b

Fitting models, almost the same

Just have to remember weights to incorporate frequencies.
Otherwise multinom assumes you have just 1 obs on each line!
Again turn (numerical) sex and brand into factors:

b %>%
  mutate(sex = factor(sex)) %>%
  mutate(brand = factor(brand)) -> bf
b.1 <- multinom(brand ~ age + sex, data = bf, weights = Freq)
b.2 <- multinom(brand ~ age, data = bf, weights = Freq)

P-value for `sex` identical

anova(b.2, b.1)

Same P-value as before, so we haven’t changed anything important.

Trying interaction between age and sex

brands.4 <- update(brands.1, . ~ . + age:sex)

anova(brands.1, brands.4)

No evidence that effect of age on brand preference differs for the two genders.

Make graph again

plot_predictions(brands.4, 
                 condition = c("age", "brand", "sex"), 
                 type = "probs", draw = FALSE)  %>% 
  ggplot(aes(x = age, y = estimate, colour = group, 
             linetype = sex)) +
  geom_line() -> g4

Not much difference in the graph

g4

Logistic Regression