--- title: "Case study: Alcohol metabolism" editor: markdown: wrap: 72 --- ## Packages for this section ```{r} library(tidyverse) library(broom) library(MASS, exclude = "select") # explained later ``` ## Metabolizing alcohol - It is believed that women have a lower tolerance for alcohol compared to men, and they develop alcohol-related liver disease more easily than men (on average). - Italian researchers developed a theory about why this is. To test their theory, they collected some data (variables on next page). ## The variables - `Subject`: subject number (ignored by us) - `Metabol`: first-pass metabolism of alcohol in stomach (millimoles per litre-hour), response - `Gastric`: gastric alcohol dehydrogenase activity in stomach (micromoles per minute per gram of tissue) - `Sex`: whether subject was `Female` or `Male` (categorical) - `Alcohol`: whether subject was an alcoholic or not (assessed by how much alcohol they drank and how frequently). Values `Alcoholic`, `Non-alcoholic` (categorical). ## Comments - one response variable, `Metabol` - *several* explanatory variables - hence, *multiple* regression - some of them are categorical. - according to theory, more gastric activity should mean higher metabolism - metabolism is expected to be higher for males than females even at same level of gastric activity - may be an effect of alcoholism, but direction not predicted by theory. ## The data \small ```{r} #| message: false my_url <- "http://datafiles.ritsokiguess.site/alcohol-metabolism.csv" alc <- read_csv(my_url) alc ``` \normalsize ## A graph - Have two quantitative variables (suggests scatterplot), but also two categorical ones (suggests colours, and ???) - Can also use `shape` as well as colour to distinguish observations. - Hence: ```{r} #| eval: false #| echo: true ggplot(alc, aes(x = Gastric, y = Metabol, colour = Sex, shape = Alcohol)) + geom_point() ``` ## The graph ```{r} #| echo: false ggplot(alc, aes(x = Gastric, y = Metabol, colour = Sex, shape = Alcohol, size = 2)) + geom_point() ``` ## Comments - As gastric activity increases, metabolism also increases. - Trend is apparently linear and of moderate strength. - Two points unusually high on both variables, but seem to be on the trend. - Male points (blue) mostly above female ones (red) for similar `Gastric` - No apparent effect of `Alcohol`. ## Fit regression \footnotesize ```{r} alc.1 <- lm(Metabol ~ Gastric + Sex + Alcohol, data = alc) summary(alc.1) ``` \normalsize ## Comments - no effect of `Alcohol` (suggests removing) - significant effects of both `Gastric` and `Sex` - both slopes positive (as expected) - but, before we go further, check residuals: - vs. fitted values - normal quantile plot - *also*, vs. each explanatory variable (if any trends, have form of relationship with that explanatory variable wrong). ## Interpretation of Estimates - There is one intercept, and *each* explanatory variable has a slope that expresses effect of that variable, *all else equal*. - Slope for `Gastric` (quantitative) says that if `Gastric` increases by 1, and everything else same, `Metabol` predicted to increase by 1.95. - Each categorical variable has a "baseline" category, the first one alphabetically: `Female`, `Alcoholic`. - For categorical explanatory variables, the slope says how `Metabol` compares for the named category vs. the baseline: - for males, `Metabol` is 1.65 higher than for females, all else equal - for non-alcoholics, `Metabol` is 0.12 higher than for alcoholics, all else equal - Intercept is the predicted value of `Metabol` when all the quantitative variables are 0 and all the categorical variables are at baseline (not usually very interesting). ## Create dataframe for plots \small ```{r} #| echo: false w <- getOption("width") options(width = 55) ``` ```{r} #| echo: true alc.1 %>% augment(alc) -> alc.1a glimpse(alc.1a) ``` ```{r} #| echo: false options(width = w) ``` \normalsize ## Residuals vs fitted \small ```{r} #| fig-height: 5 #| echo: true ggplot(alc.1a, aes(x = .fitted, y = .resid)) + geom_point() ``` \normalsize ## Normal quantile plot \small ```{r} #| fig-height: 5 #| echo: true ggplot(alc.1a, aes(sample = .resid)) + stat_qq() + stat_qq_line() ``` \normalsize ## Residuals vs `Gastric` \small ```{r} #| fig-height: 5 #| echo: true ggplot(alc.1a, aes(x = Gastric, y = .resid)) + geom_point() ``` \normalsize ## Residuals vs `Sex` \small ```{r} #| fig-height: 5 #| echo: true ggplot(alc.1a, aes(x = Sex, y = .resid)) + geom_point() ``` \normalsize ## Residuals vs `Alcohol` \small ```{r} #| fig-height: 5 #| echo: true ggplot(alc.1a, aes(x = Alcohol, y = .resid)) + geom_point() ``` \normalsize ## Comments - residuals vs fitted and vs `Gastric`: suggestion of curve? - residuals vs `Sex`: males more spread out - residuals vs `Alcohol`: non-alcoholics more spread out ## What to do next? - curves plus unequal spreads suggest transformation of response - problems with only one or two explanatory variables suggest transforming just those (our problems here are bigger than this) - if we have some theory about relationship, can use that to guide transformation (eg. take logs, but don't have that here) - can use *Box-Cox* to suggest good transformation. - aims to find transformation of response that promotes straightness plus equal spread (will also often reduce influence of large values). ## Ladder of powers from [here](https://www.slideserve.com/laurel-strong/chapter-10-re-expressing-the-data-powerpoint-ppt-presentation): ![](Screenshot from 2026-06-19 13-41-32.png) ## Box-Cox - Box and Cox developed a method for choosing a transformation from the ladder of powers. - They worked out how to estimate the transformation as a power, but making zero be log (since power zero makes no sense otherwise) - In package `MASS` as `boxcox` - Technical detail: `MASS` also has a `select` that we don't want to have interfere with the `tidyverse` `select`, so load `MASS` as shown: ```{r} #| eval: false library(MASS, exclude = "select") ``` ## Running Box-Cox - use `boxcox` with a model formula (such as you would use in `lm`): ```{r} #| fig-height: 5 #| echo: true boxcox(Metabol ~ Gastric + Sex + Alcohol, data = alc) ``` ## Comments - Output is a graph. - Peak of curve is single best power transformation - Outer vertical lines mark 95% CI for transformation power - Goal: find value from ladder of powers that is within CI - Here that is 0.5, square root - Note that 1, "do nothing", and 0, "take logs", are not supported by data. ## Re-do regression - with square root of `Metabol` as response: ```{r} #| echo: true alc.2 <- lm(sqrt(Metabol) ~ Gastric + Sex + Alcohol, data = alc) ``` ## Output \footnotesize ```{r} summary(alc.2) ``` \normalsize ## Remove `Alcohol`: \footnotesize ```{r} #| echo: true alc.3 <- lm(sqrt(Metabol) ~ Gastric + Sex, data = alc) summary(alc.3) ``` \normalsize ## Comments; set up to check residuals again - `Alcohol` was not significant in the regression, so removed it. If you have more than one non-significant explanatory variable, remove only the *one* least significant one (highest P-value), refit, re-evaluate. - Everything else significantly adds to the regression, so cannot remove anything else. ```{r} #| echo: true alc.3 %>% augment(alc) -> alc.3a ``` - Expect the residual plots to be a lot better. - In principle, check again: - residuals vs. fitted - normal quantile plot of residuals - residuals against each explanatory ## Residuals against fitted ```{r} #| fig-height: 5 #| echo: true ggplot(alc.3a, aes(x = .fitted, y = .resid)) + geom_point() ``` ## Normal quantile plot of residuals ```{r} #| fig-height: 5 #| echo: true ggplot(alc.3a, aes(sample = .resid)) + stat_qq() + stat_qq_line() ``` ## Residuals against `Sex` ```{r} #| fig-height: 5 #| echo: true ggplot(alc.3a, aes(x = Sex, y = .resid)) + geom_point() ``` ## Comments - Fitted vs residual looks less curved - Residuals look much more normal (high values brought down) - Residuals vs `Sex` have no outliers and look equally spread ## Back to regression output \footnotesize ```{r} #| echo: true glance(alc.3) tidy(alc.3) ``` \normalsize ## Comments - R-squared is reasonably high - Slope for `Gastric` is significantly positive: as gastric activity increases, metabolism increases (for everybody) - Slope for `SexMale` positive: compared to baseline `Sex` (Female), metabolism is higher for Males *even allowing for the effect of `Gastric`*. - We removed `Alcohol`: being alcoholic does not affect metabolism over and above the other explanatory variables. - This all seems to support the researchers' theory. ## More details of researchers' theory xxx (see Sleuth 3, case 11.1.1)