Use regression when one variable is an outcome (response, \(y\)).
See if/how response depends on other variable(s), explanatory, \(x_1, x_2,\ldots\).
Can have one or more than one explanatory variable, but always one response.
Assumes a straight-line relationship between response and explanatory.
Ask:
is there a relationship between \(y\) and \(x\)’s, and if so, which ones?
what does the relationship look like?
Packages
library(MASS, exclude ="select") # for Box-Cox, laterlibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)library(marginaleffects)
A regression with one \(x\)
13 children, measure average total sleep time (ATST, mins) and age (years) for each. See if ATST depends on age. Data in sleep.txt, ATST then age. Read in data: https://programs-courses.ritsokiguess.site/faq.html
Rows: 13 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (2): atst, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Useful for plotting residuals against an \(x\)-variable.
CI for mean response and prediction intervals
Once useful regression exists, use it for prediction:
To get a single number for prediction at a given \(x\), substitute into regression equation, eg. age 10: predicted ATST is \(646.48-14.04(10)=506\) minutes.
To express uncertainty of this prediction:
CI for mean response expresses uncertainty about mean ATST for all children aged 10, based on data.
Prediction interval expresses uncertainty about predicted ATST for a new child aged 10 whose ATST not known. More uncertain.
Also do above for a child aged 5.
The marginaleffects package 1/2
To get predictions for specific values, set up a dataframe with those values first:
new <-datagrid(model = sleep.1, age =c(10, 5))new
age rowid
1 10 1
2 5 2
Any variables in the dataframe that you don’t specify are set to their mean values (quantitative) or most common category (categorical).
The marginaleffects package 2/2
Then feed into newdata in predictions. This contains a lot of columns, so you probably want only to display the ones you care about:
How to tell whether a straight-line regression is appropriate?
Before: check scatterplot for straight trend.
After: plot residuals (observed minus predicted response) against predicted values. Aim: a plot with no pattern.
Residual plot
Not much pattern here — regression appropriate.
ggplot(sleep.1, aes(x = .fitted, y = .resid)) +geom_point()
Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggplot2 package.
Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
Rows: 10 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (2): xx, yy
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Scatterplot
ggplot(curvy, aes(x = xx, y = yy)) +geom_point()
Regression line, anyway
curvy.1<-lm(yy ~ xx, data = curvy)summary(curvy.1)
Call:
lm(formula = yy ~ xx, data = curvy)
Residuals:
Min 1Q Median 3Q Max
-3.582 -2.204 0.000 1.514 3.509
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5818 1.5616 4.855 0.00126 **
xx 0.9818 0.2925 3.356 0.00998 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.657 on 8 degrees of freedom
Multiple R-squared: 0.5848, Adjusted R-squared: 0.5329
F-statistic: 11.27 on 1 and 8 DF, p-value: 0.009984
Residual plot
ggplot(curvy.1, aes(x = .fitted, y = .resid)) +geom_point()
No good: fixing it up
Residual plot has curve: middle residuals positive, high and low ones negative. Bad.
Fitting a curve would be better. Try this:
curvy.2<-lm(yy ~ xx +I(xx^2), data = curvy)
Adding xx-squared term, to allow for curve.
Another way to do same thing: specify how model changes:
curvy.2a <-update(curvy.1, . ~ . +I(xx^2))
Regression 2
summary(curvy.2)
Call:
lm(formula = yy ~ xx + I(xx^2), data = curvy)
Residuals:
Min 1Q Median 3Q Max
-1.2091 -0.3602 -0.2364 0.8023 1.2636
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.90000 0.77312 5.045 0.001489 **
xx 3.74318 0.40006 9.357 3.31e-05 ***
I(xx^2) -0.30682 0.04279 -7.170 0.000182 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9833 on 7 degrees of freedom
Multiple R-squared: 0.9502, Adjusted R-squared: 0.936
F-statistic: 66.83 on 2 and 7 DF, p-value: 2.75e-05
Comments
xx-squared term definitely significant (P-value 0.000182), so need this curve to describe relationship.
Adding squared term has made R-squared go up from 0.5848 to 0.9502: great improvement.
This is a definite curve!
The residual plot now
No apparent problems any more:
ggplot(curvy.2, aes(x = .fitted, y = .resid)) +geom_point()
Another way to handle curves
Above, saw that changing \(x\) (adding \(x^2\)) was a way of handling curved relationships.
Another way: change \(y\) (transformation).
Can guess how to change \(y\), or might be theory:
Seems to be faster-than-linear growth, maybe exponential growth.
Scatterplot: faster than linear growth
ggplot(madeup, aes(x = x, y = y)) +geom_point() +geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Running Box-Cox
library(MASS) first.
Feed boxcox a model formula with a squiggle in it, such as you would use for lm.
Output: a graph (next page):
boxcox(y ~ x, data = madeup)
The Box-Cox output
Comments
\(\lambda\) (lambda) is the power by which you should transform \(y\) to get the relationship straight (straighter). Power 0 is “take logs”
Middle dotted line marks best single value of \(\lambda\) (here about 0.1).
Outer dotted lines mark 95% CI for \(\lambda\), here \(-0.3\) to 0.7, approx. (Rather uncertain about best transformation.)
Any power transformation within the CI supported by data. In this case, log (\(\lambda=0\)) and square root (\(\lambda=0.5\)) good, but no transformation (\(\lambda=1\)) not.
Pick a “round-number” value of \(\lambda\) like \(2,1,0.5,0,-0.5,-1\). Here 0 and 0.5 good values to pick.
Did transformation straighten things?
Plot transformed \(y\) against \(x\). Here, log:
ggplot(madeup, aes(x = x, y =log(y))) +geom_point() +geom_smooth()
Looks much straighter.
Regression with transformed \(y\)
madeup.1<-lm(log(y) ~ x, data = madeup)summary(madeup.1)
Call:
lm(formula = log(y) ~ x, data = madeup)
Residuals:
Min 1Q Median 3Q Max
-0.9688 -0.2577 0.1663 0.3881 0.5534
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.02884 0.37935 7.984 0.000206 ***
x 0.46006 0.09068 5.073 0.002281 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5877 on 6 degrees of freedom
Multiple R-squared: 0.811, Adjusted R-squared: 0.7794
F-statistic: 25.74 on 1 and 6 DF, p-value: 0.002281
R-squared now decently high.
Multiple regression
What if more than one \(x\)? Extra issues:
Now one intercept and a slope for each \(x\): how to interpret?
Which \(x\)-variables actually help to predict \(y\)?
Different interpretations of “global” \(F\)-test and individual \(t\)-tests.
R-squared no longer correlation squared, but still interpreted as “higher better”.
In lm line, add extra \(x\)s after ~.
Interpretation not so easy (and other problems that can occur).
Multiple regression example
Study of women and visits to health professionals, and how the number of visits might be related to other variables:
Rows: 465 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (5): subjno, timedrs, phyheal, menheal, stress
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Call:
lm(formula = timedrs ~ phyheal + menheal + stress, data = visits)
Residuals:
Min 1Q Median 3Q Max
-14.792 -4.353 -1.815 0.902 65.886
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.704848 1.124195 -3.296 0.001058 **
phyheal 1.786948 0.221074 8.083 5.6e-15 ***
menheal -0.009666 0.129029 -0.075 0.940318
stress 0.013615 0.003612 3.769 0.000185 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.708 on 461 degrees of freedom
Multiple R-squared: 0.2188, Adjusted R-squared: 0.2137
F-statistic: 43.03 on 3 and 461 DF, p-value: < 2.2e-16
The slopes
Model as a whole strongly significant even though R-sq not very big (lots of data). At least one of the \(x\)’s predicts timedrs.
The physical health and stress variables definitely help to predict the number of visits, but with those in the model we don’t need menheal. However, look at prediction of timedrs from menheal by itself:
Just menheal
visits.2<-lm(timedrs ~ menheal, data = visits)summary(visits.2)
Call:
lm(formula = timedrs ~ menheal, data = visits)
Residuals:
Min 1Q Median 3Q Max
-13.826 -5.150 -2.818 1.177 72.513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.8159 0.8702 4.385 1.44e-05 ***
menheal 0.6672 0.1173 5.688 2.28e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.6 on 463 degrees of freedom
Multiple R-squared: 0.06532, Adjusted R-squared: 0.0633
F-statistic: 32.35 on 1 and 463 DF, p-value: 2.279e-08
menheal by itself
menheal by itself does significantly help to predict timedrs.
But the R-sq is much less (6.5% vs. 22%).
So other two variables do a better job of prediction.
With those variables in the regression (phyheal and stress), don’t need menhealas well.
Is there trend in size of residuals (fan-out)? Plot absolute value of residual against fitted value:
ggplot(visits.1, aes(x = .fitted, y =abs(.resid))) +geom_point() +geom_smooth()
Comments
On the normal quantile plot:
highest (most positive) residuals are way too high
distribution of residuals skewed to right (not normal at all)
On plot of absolute residuals:
size of residuals getting bigger as fitted values increase
predictions getting more variable as fitted values increase
that is, predictions getting less accurate as fitted values increase, but predictions should be equally accurate all way along.
Both indicate problems with regression, of kind that transformation of response often fixes: that is, predict function of response timedrs instead of timedrs itself.
Box-Cox transformations
Taking log of timedrs and having it work: lucky guess. How to find good transformation?
Box-Cox again.
Extra problem: some of timedrs values are 0, but Box-Cox expects all +. Note response for boxcox:
boxcox(timedrs +1~ phyheal + menheal + stress, data = visits)
ggplot(visits.3, aes(x = .fitted, y =abs(.resid))) +geom_point() +geom_smooth()
Comments
Residuals vs. fitted looks a lot more random.
Normal quantile plot looks a lot more normal (though still a little right-skewness)
Absolute residuals: not so much trend (though still some).
Not perfect, but much improved.
Testing more than one \(x\) at once
The \(t\)-tests test only whether one variable could be taken out of the regression you’re looking at.
To test significance of more than one variable at once, fit model with and without variables
then use anova to compare fit of models:
visits.5<-lm(log(timedrs +1) ~ phyheal + menheal + stress, data = visits)visits.6<-lm(log(timedrs +1) ~ stress, data = visits)
Results of tests
anova(visits.6, visits.5)
Analysis of Variance Table
Model 1: log(timedrs + 1) ~ stress
Model 2: log(timedrs + 1) ~ phyheal + menheal + stress
Res.Df RSS Df Sum of Sq F Pr(>F)
1 463 371.47
2 461 268.01 2 103.46 88.984 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Models don’t fit equally well, so bigger one fits better.
Or “taking both variables out makes the fit worse, so don’t do it”.
Taking out those \(x\)’s is a mistake. Or putting them in is a good idea.
The punting data
Data set punting.txt contains 4 variables for 13 right-footed football kickers (punters): left leg and right leg strength (lbs), distance punted (ft), another variable called “fred”. Predict punting distance from other variables.
Reading in
Separated by multiple spaces with columns lined up:
── Column specification ────────────────────────────────────────────────────────
cols(
left = col_double(),
right = col_double(),
punt = col_double(),
fred = col_double()
)
punting.1<-lm(punt ~ left + right + fred, data = punting)summary(punting.1)
Call:
lm(formula = punt ~ left + right + fred, data = punting)
Residuals:
Min 1Q Median 3Q Max
-14.9325 -11.5618 -0.0315 9.0415 20.0886
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.6855 29.1172 -0.161 0.876
left 0.2679 2.1111 0.127 0.902
right 1.0524 2.1477 0.490 0.636
fred -0.2672 4.2266 -0.063 0.951
Residual standard error: 14.68 on 9 degrees of freedom
Multiple R-squared: 0.7781, Adjusted R-squared: 0.7042
F-statistic: 10.52 on 3 and 9 DF, p-value: 0.00267
\(t\)-tests only say that you could take any one of the \(x\)’s out without damaging the fit; doesn’t matter which one.
Explanation: look at correlations.
The correlations
cor(punting)
left right punt fred
left 1.0000000 0.8957224 0.8117368 0.9722632
right 0.8957224 1.0000000 0.8805469 0.9728784
punt 0.8117368 0.8805469 1.0000000 0.8679507
fred 0.9722632 0.9728784 0.8679507 1.0000000
All correlations are high: \(x\)’s with punt (good) and with each other (bad, at least confusing).
What to do? Probably do just as well to pick one variable, say right since kickers are right-footed.
Just right
punting.2<-lm(punt ~ right, data = punting)anova(punting.2, punting.1)
Analysis of Variance Table
Model 1: punt ~ right
Model 2: punt ~ left + right + fred
Res.Df RSS Df Sum of Sq F Pr(>F)
1 11 1962.5
2 9 1938.2 2 24.263 0.0563 0.9456
No significant loss by dropping other two variables.
Comparing R-squareds
All three \(x\)-variables:
summary(punting.1)$r.squared
[1] 0.7781401
Only right:
summary(punting.2)$r.squared
[1] 0.7753629
Basically no difference. In regression (over), right significant:
Regression results
summary(punting.2)
Call:
lm(formula = punt ~ right, data = punting)
Residuals:
Min 1Q Median 3Q Max
-15.7576 -11.0611 0.3656 7.8890 19.0423
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.6930 25.2649 -0.146 0.886
right 1.0427 0.1692 6.162 7.09e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.36 on 11 degrees of freedom
Multiple R-squared: 0.7754, Adjusted R-squared: 0.7549
F-statistic: 37.97 on 1 and 11 DF, p-value: 7.088e-05
But…
Maybe we got the form of the relationship with left wrong.
Check: plot residuals from previous regression (without left) against left.
Residuals here are “punting distance adjusted for right leg strength”.
If there is some kind of relationship with left, we should include in model.
Plot of residuals against original variable: augment from broom.
Comments
Age 10 closer to centre of data, so intervals are both narrower than those for age 5.
Prediction intervals bigger than CI for mean (additional uncertainty).
Technical note: output from
predictis Rmatrix, not data frame, so Tidyversebind_colsdoes not work. Use base Rcbind.