---
title: "Regression revisited"
editor: 
  markdown: 
    wrap: 72
---

## Regression

-   Use regression when one variable is an outcome (*response*, $y$).

-   See if/how response depends on other variable(s), *explanatory*,
    $x_1, x_2,\ldots$.

-   Can have *one* or *more than one* explanatory variable, but always
    one response.

-   Assumes a *straight-line* relationship between response and
    explanatory.

-   Ask:

    -   *is there* a relationship between $y$ and $x$'s, and if so,
        which ones?
    -   what does the relationship look like?

## Packages

```{r bRegression-1, results='hide'}
library(MASS, exclude = "select") # for Box-Cox, later
library(tidyverse)
library(broom)
library(marginaleffects)
```

## A regression with one $x$

13 children, measure average total sleep time (ATST, mins) and age
(years) for each. See if ATST depends on age. Data in `sleep.txt`, ATST
then age. Read in data:

```{r bRegression-2 }
my_url <- "http://ritsokiguess.site/datafiles/sleep.txt"
sleep <- read_delim(my_url, " ")
```

## Check data

\footnotesize

```{r}
sleep
```

\normalsize

Make scatter plot of ATST (response) vs. age (explanatory) using code
overleaf.

## The scatterplot

```{r suggo, fig.height=6}
ggplot(sleep, aes(x = age, y = atst)) + geom_point()
```

## Correlation

-   Measures how well a straight line fits the data:

```{r bRegression-4 }
with(sleep, cor(atst, age))
```

-   $1$ is perfect upward trend, $-1$ is perfect downward trend, 0 is no
    trend.

-   This one close to perfect downward trend.

-   Can do correlations of all pairs of variables:

```{r bRegression-5 }
cor(sleep)
```

## Lowess curve

-   Sometimes nice to guide the eye: is the trend straight, or not?

-   Idea: *lowess curve*. "Locally weighted least squares", not affected
    by outliers, not constrained to be linear.

-   Lowess is a *guide*: even if straight line appropriate, may
    wiggle/bend a little. Looking for *serious* problems with linearity.

-   Add lowess curve to plot using `geom_smooth`:

## Plot with lowess curve

```{r icko,fig.height=6}
ggplot(sleep, aes(x = age, y = atst)) + geom_point() +
  geom_smooth()
```

## The regression

Scatterplot shows no obvious curve, and a pretty clear downward trend.
So we can run the regression:

```{r bRegression-6, echo=-1}
options(width = 60)
sleep.1 <- lm(atst ~ age, data = sleep)
```

## The output

```{r bRegression-7}
summary(sleep.1)
```

## Conclusions

-   The relationship appears to be a straight line, with a downward
    trend.

-   $F$-tests for model as a whole and $t$-test for slope (same) both
    confirm this (P-value $5.7\times 10^{-7}=0.00000057$).

-   Slope is $-14$, so a 1-year increase in age goes with a 14-minute
    decrease in ATST on average.

-   R-squared is correlation squared (when one $x$ anyway), between 0
    and 1 (1 good, 0 bad).

-   Here R-squared is 0.9054, pleasantly high.

## Doing things with the regression output

-   Output from regression (and eg. $t$-test) is all right to look at,
    but hard to extract and re-use information from.

-   Package `broom` extracts info from model output in way that can be
    used in pipe (later):

```{r bRegression-8, size="footnotesize"}
tidy(sleep.1)
```

## also one-line summary of model:

```{r bRegression-9}
glance(sleep.1)
```

## Broom part 2

\footnotesize

```{r bRegression-10,warning=F}
sleep.1 %>% augment(sleep)
```

\normalsize

Useful for plotting residuals against an $x$-variable.

## CI for mean response and prediction intervals

Once useful regression exists, use it for prediction:

-   To get a single number for prediction at a given $x$, substitute
    into regression equation, eg. age 10: predicted ATST is
    $646.48-14.04(10)=506$ minutes.

-   To express uncertainty of this prediction:

-   *CI for mean response* expresses uncertainty about mean ATST for all
    children aged 10, based on data.

-   *Prediction interval* expresses uncertainty about predicted ATST for
    a new child aged 10 whose ATST not known. More uncertain.

-   Also do above for a child aged 5.

## The `marginaleffects` package 1/2

To get predictions for specific values, set up a dataframe with those
values first:

```{r}
new <- datagrid(model = sleep.1, age = c(10, 5))
new
```

Any variables in the dataframe that you don't specify are set to their
mean values (quantitative) or most common category (categorical).

## The `marginaleffects` package 2/2

Then feed into `newdata` in `predictions`. This contains a lot of
columns, so you probably want only to display the ones you care about:

```{r}
cbind(predictions(sleep.1, newdata = new)) %>% 
  select(estimate, conf.low, conf.high, age)
```

The confidence limits are a 95% confidence interval for the mean
response at that `age`.

## Prediction intervals

These are obtained (instead) with `predict` as below. Use the same
dataframe `new` as before:

```{r bRegression-12 }
pp <- predict(sleep.1, new, interval = "p")
pp
cbind(new, pp)

```

## Plotting the confidence intervals for mean response again:

```{r}
#| fig-height: 5
plot_predictions(sleep.1, condition = "age")
```

## Comments

-   Age 10 closer to centre of data, so intervals are both narrower than
    those for age 5.

-   Prediction intervals bigger than CI for mean (additional
    uncertainty).

-   Technical note: output from `predict` is R `matrix`, not data frame,
    so Tidyverse `bind_cols` does not work. Use base R `cbind`.

## That grey envelope

Marks confidence interval for mean for all $x$:

```{r bRegression-15, fig.height=4.5}
ggplot(sleep, aes(x = age, y = atst)) + geom_point() +
  geom_smooth(method = "lm") +
  scale_y_continuous(breaks = seq(420, 600, 20))
```

## Diagnostics

How to tell whether a straight-line regression is appropriate?

-   Before: check scatterplot for straight trend.

-   After: plot *residuals* (observed minus predicted response) against
    predicted values. Aim: a plot with no pattern.

## Residual plot

Not much pattern here --- regression appropriate.

```{r akjhkadjfhjahnkkk,fig.height=4.5}
ggplot(sleep.1, aes(x = .fitted, y = .resid)) + geom_point()
```

## An inappropriate regression

Different data:

```{r curvy}
my_url <- "http://ritsokiguess.site/datafiles/curvy.txt"
curvy <- read_delim(my_url, " ")
```

## Scatterplot

```{r bRegression-16, fig.height=5}
ggplot(curvy, aes(x = xx, y = yy)) + geom_point()
```

## Regression line, anyway

\scriptsize

```{r bRegression-17}
curvy.1 <- lm(yy ~ xx, data = curvy)
summary(curvy.1)
```

## Residual plot

```{r altoadige,fig.height=5}
ggplot(curvy.1, aes(x = .fitted, y = .resid)) + geom_point()
```

## No good: fixing it up

-   Residual plot has *curve*: middle residuals positive, high and low
    ones negative. Bad.

-   Fitting a curve would be better. Try this:

```{r bRegression-18 }
curvy.2 <- lm(yy ~ xx + I(xx^2), data = curvy)
```

-   Adding `xx`-squared term, to allow for curve.

-   Another way to do same thing: specify how model *changes*:

```{r bRegression-19 }
curvy.2a <- update(curvy.1, . ~ . + I(xx^2))
```

## Regression 2

\footnotesize

```{r bRegression-20 }
summary(curvy.2)
```

\normalsize

## Comments

-   `xx`-squared term definitely significant (P-value 0.000182), so need
    this curve to describe relationship.

-   Adding squared term has made R-squared go up from 0.5848 to 0.9502:
    great improvement.

-   This is a definite curve!

## The residual plot now

No apparent problems any more:

```{r bRegression-21, size="small", fig.height=4.5}
ggplot(curvy.2, aes(x = .fitted, y = .resid)) + geom_point()
```

## Another way to handle curves

-   Above, saw that changing $x$ (adding $x^2$) was a way of handling
    curved relationships.

-   Another way: change $y$ (transformation).

-   Can guess how to change $y$, or might be theory:

-   example: relationship $y=ae^{bx}$ (exponential growth):

-   take logs to get $\ln y=\ln a + bx$.

-   Taking logs has made relationship linear ($\ln y$ as response).

-   Or, *estimate* transformation, using Box-Cox method.

## Box-Cox

-   Install package `MASS` via `install.packages("MASS")` (only need to
    do *once*)

-   Every R session you want to use something in `MASS`, type
    `library(MASS)`

## Some made-up data

```{r bRegression-22, message=F}
my_url <- "http://ritsokiguess.site/datafiles/madeup2.csv"
madeup <- read_csv(my_url)
madeup
```

Seems to be faster-than-linear growth, maybe exponential growth.

## Scatterplot: faster than linear growth

```{r dsljhsdjlhf,fig.height=3.2}
ggplot(madeup, aes(x = x, y = y)) + geom_point() +
  geom_smooth()
```

## Running Box-Cox

-   `library(MASS)` first.

-   Feed `boxcox` a model formula with a squiggle in it, such as you
    would use for `lm`.

-   Output: a graph (next page):

```{r bRegression-23, eval=F}
boxcox(y ~ x, data = madeup)
```

## The Box-Cox output

```{r trento,echo=F, fig.height=6}
boxcox(y ~ x, data = madeup)
```

## Comments

-   $\lambda$ (lambda) is the power by which you should transform $y$ to
    get the relationship straight (straighter). Power 0 is "take logs"

-   Middle dotted line marks best single value of $\lambda$ (here about
    0.1).

-   Outer dotted lines mark 95% CI for $\lambda$, here $-0.3$ to 0.7,
    approx. (Rather uncertain about best transformation.)

-   Any power transformation within the CI supported by data. In this
    case, log ($\lambda=0$) and square root ($\lambda=0.5$) good, but no
    transformation ($\lambda=1$) not.

-   Pick a "round-number" value of $\lambda$ like $2,1,0.5,0,-0.5,-1$.
    Here 0 and 0.5 good values to pick.

## Did transformation straighten things?

-   Plot transformed $y$ against $x$. Here, log:

```{r bRegression-24, message=FALSE, fig.height=3.2 }
ggplot(madeup, aes(x = x, y = log(y))) + geom_point() +
  geom_smooth()
```

Looks much straighter.

## Regression with transformed $y$

\footnotesize

```{r bRegression-25}
madeup.1 <- lm(log(y) ~ x, data = madeup)
summary(madeup.1)
```

\normalsize

R-squared now decently high.

## Multiple regression

-   What if more than one $x$? Extra issues:

    -   Now one intercept and a slope for each $x$: how to interpret?

    -   Which $x$-variables actually help to predict $y$?

    -   Different interpretations of "global" $F$-test and individual
        $t$-tests.

    -   R-squared no longer correlation squared, but still interpreted
        as "higher better".

    -   In `lm` line, add extra $x$s after `~`.

    -   Interpretation not so easy (and other problems that can occur).

## Multiple regression example

Study of women and visits to health professionals, and how the number of
visits might be related to other variables:

```{=tex}
\begin{description}
\item[timedrs:] number of visits to health professionals (over course of study)
\item[phyheal:] number of physical health problems
\item[menheal:] number of mental health problems
\item[stress:] result of questionnaire about number and type of life changes
\end{description}
```
`timedrs` response, others explanatory.

## The data

```{r bRegression-26 }
my_url <- 
  "http://ritsokiguess.site/datafiles/regressx.txt"
visits <- read_delim(my_url, " ")
```

## Check data

```{r bRegression-27}
visits
```

## Fit multiple regression

\scriptsize

```{r bRegression-28}
visits.1 <- lm(timedrs ~ phyheal + menheal + stress,
  data = visits)
summary(visits.1)
```

\normalsize

## The slopes

-   Model as a whole strongly significant even though R-sq not very big
    (lots of data). At least one of the $x$'s predicts `timedrs`.

-   The physical health and stress variables definitely help to predict
    the number of visits, but *with those in the model* we don't need
    `menheal`. However, look at prediction of `timedrs` from `menheal`
    by itself:

## Just `menheal`

\footnotesize

```{r bRegression-30 }
visits.2 <- lm(timedrs ~ menheal, data = visits)
summary(visits.2)
```

\normalsize

## `menheal` by itself

-   `menheal` by itself *does* significantly help to predict `timedrs`.

-   But the R-sq is much less (6.5% vs. 22%).

-   So other two variables do a better job of prediction.

-   With those variables in the regression (`phyheal` and `stress`),
    don't need `menheal` *as well*.

## Investigating via correlation

Leave out first column (`subjno`):

```{r bRegression-31 }
visits %>% select(-subjno) %>% cor()
```

-   `phyheal` most strongly correlated with `timedrs`.

-   Not much to choose between other two.

-   But `menheal` has higher correlation with `phyheal`, so not as much
    to *add* to prediction as `stress`.

-   Goes to show things more complicated in multiple regression.

## Residual plot (from `timedrs` on all)

```{r iffy8,fig.height=3.5}
ggplot(visits.1, aes(x = .fitted, y = .resid)) + geom_point()
```

Apparently random. But...

## Normal quantile plot of residuals

```{r bRegression-32, fig.height=3.5}
ggplot(visits.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```

Not normal at all; upper tail is way too long.

## Absolute residuals

Is there trend in *size* of residuals (fan-out)? Plot *absolute value*
of residual against fitted value:

```{r bRegression-33, fig.height=2.8, message=FALSE}
ggplot(visits.1, aes(x = .fitted, y = abs(.resid))) +
  geom_point() + geom_smooth()
```

## Comments

-   On the normal quantile plot:

    -   highest (most positive) residuals are *way* too high

    -   distribution of residuals skewed to right (not normal at all)

-   On plot of absolute residuals:

    -   size of residuals getting bigger as fitted values increase

    -   predictions getting more variable as fitted values increase

    -   that is, predictions getting *less accurate* as fitted values
        increase, but predictions should be equally accurate all way
        along.

-   Both indicate problems with regression, of kind that transformation
    of response often fixes: that is, predict *function* of response
    `timedrs` instead of `timedrs` itself.

## Box-Cox transformations

-   Taking log of `timedrs` and having it work: lucky guess. How to find
    good transformation?

-   Box-Cox again.

-   Extra problem: some of `timedrs` values are 0, but Box-Cox expects
    all +. Note response for `boxcox`:

```{r bRegression-35, eval=F}
boxcox(timedrs + 1 ~ phyheal + menheal + stress, data = visits)
```

## Try 1

```{r bRegression-36, echo=F}
  boxcox(timedrs + 1 ~ phyheal + menheal + stress, 
         data = visits)
```

## Comments on try 1

-   Best: $\lambda$ just less than zero.

-   Hard to see scale.

-   Focus on $\lambda$ in $(-0.3,0.1)$:

```{r bRegression-37, size="footnotesize"}
my.lambda <- seq(-0.3, 0.1, 0.01)
my.lambda
```

## Try 2

```{r bRegression-38, fig.height=5}
boxcox(timedrs + 1 ~ phyheal + menheal + stress,
  lambda = my.lambda,
  data = visits
)
```

## Comments

-   Best: $\lambda$ just about $-0.07$.

-   CI for $\lambda$ about $(-0.14,0.01)$.

-   Only nearby round number: $\lambda=0$, log transformation.

## Fixing the problems

-   Try regression again, with transformed response instead of original
    one.

-   Then check residual plot to see that it is OK now.

```{r bRegression-39 }
visits.3 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress,
  data = visits
)
```

-   `timedrs+1` because some `timedrs` values 0, can't take log of 0.

-   Won't usually need to worry about this, but when response could be
    zero/negative, fix that before transformation.

## Output

\scriptsize

```{r bRegression-40 }
summary(visits.3)
```

## Comments

-   Model as a whole strongly significant again

-   R-sq higher than before (37% vs. 22%) suggesting things more linear
    now

-   Same conclusion re `menheal`: can take out of regression.

-   Should look at residual plots (next pages). Have we fixed problems?

## Residuals against fitted values

```{r bRegression-41, fig.height=3.5}
ggplot(visits.3, aes(x = .fitted, y = .resid)) +
  geom_point()
```

## Normal quantile plot of residuals

```{r bRegression-42, fig.height=3.8}
ggplot(visits.3, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```

## Absolute residuals against fitted

```{r bRegression-43, fig.height=3.5, message=F}
ggplot(visits.3, aes(x = .fitted, y = abs(.resid))) +
  geom_point() + geom_smooth()
```

## Comments

-   Residuals vs. fitted looks a lot more random.

-   Normal quantile plot looks a lot more normal (though still a little
    right-skewness)

-   Absolute residuals: not so much trend (though still some).

-   Not perfect, but much improved.

## Testing more than one $x$ at once

-   The $t$-tests test only whether one variable could be taken out of
    the regression you're looking at.
-   To test significance of more than one variable at once, fit model
    with and without variables
    -   then use `anova` to compare fit of models:

```{r bRegression-44 }
visits.5 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress, 
               data = visits)
visits.6 <- lm(log(timedrs + 1) ~ stress, data = visits)
```

## Results of tests

```{r bRegression-45}
anova(visits.6, visits.5)
```

-   Models don't fit equally well, so bigger one fits better.

-   Or "taking both variables out makes the fit worse, so don't do it".

-   Taking out those $x$'s is a mistake. Or putting them in is a good
    idea.

## The punting data

Data set `punting.txt` contains 4 variables for 13 right-footed football
kickers (punters): left leg and right leg strength (lbs), distance
punted (ft), another variable called "fred". Predict punting distance
from other variables.

## Reading in

-   Separated by *multiple spaces* with *columns lined up*:

```{r bRegression-46 }
my_url <- "http://ritsokiguess.site/datafiles/punting.txt"
punting <- read_table(my_url)
```

## The data

\small

```{r bRegression-47}
punting
```

\normalsize

## Regression and output

\footnotesize

```{r bRegression-48}
punting.1 <- lm(punt ~ left + right + fred, data = punting)
summary(punting.1)
```

\normalsize

## Comments

-   Overall regression strongly significant, R-sq high.

-   None of the $x$'s significant! Why?

-   $t$-tests only say that you could take any one of the $x$'s out
    without damaging the fit; doesn't matter which one.

-   Explanation: look at *correlations*.

## The correlations

```{r bRegression-49 }
cor(punting)
```

-   *All* correlations are high: $x$'s with `punt` (good) and with each
    other (bad, at least confusing).

-   What to do? Probably do just as well to pick one variable, say
    `right` since kickers are right-footed.

## Just `right`


```{r bRegression-50 }
punting.2 <- lm(punt ~ right, data = punting)
anova(punting.2, punting.1)
```


No significant loss by dropping other two variables.

## Comparing R-squareds

- All three $x$-variables:

```{r bRegression-51 }
summary(punting.1)$r.squared
```

- Only `right`:

```{r}
summary(punting.2)$r.squared
```


- Basically no difference. In regression (over), `right` significant:

## Regression results

\footnotesize

```{r bRegression-52 }
summary(punting.2)
```

\normalsize

## But...

-   Maybe we got the *form* of the relationship with `left` wrong.

-   Check: plot *residuals* from previous regression (without `left`)
    against `left`.

-   Residuals here are "punting distance adjusted for right leg
    strength".

-   If there is some kind of relationship with `left`, we should include
    in model.

-   Plot of residuals against original variable: `augment` from `broom`.

## Augmenting `punting.2`

\footnotesize

```{r bRegression-53 }
punting.2 %>% augment(punting) -> punting.2.aug
punting.2.aug 
```

\normalsize

## Residuals against `left`

```{r basingstoke,fig.height=4.5}
ggplot(punting.2.aug, aes(x = left, y = .resid)) +
  geom_point()
```

## Comments

-   There is a *curved* relationship with `left`.

-   We should add `left`-squared to the regression (and therefore put
    `left` back in when we do that):

```{r bRegression-54 }
punting.3 <- lm(punt ~ left + I(left^2) + right,
  data = punting
)
```

## Regression with `left-squared`

\scriptsize

```{r bRegression-55}
summary(punting.3)
```

## Comments

-   This was definitely a good idea (R-squared has clearly increased).

-   We would never have seen it without plotting residuals from
    `punting.2` (without `left`) against `left`.

-   Negative slope for `leftsq` means that increased left-leg strength
    only increases punting distance up to a point: beyond that, it
    decreases again.