--- title: "Bootstrap for sampling distribution of sample mean" editor: markdown: wrap: 72 --- ## Assessing assumptions - Our $t$-tests assume normality of variable being tested - but, Central Limit Theorem says that normality matters less if sample is "large" - in practice "approximate normality" is enough, but how do we assess whether what we have is normal enough? - so far, use histogram/boxplot and make a call, allowing for sample size. ## What actually has to be normal - is: **sampling distribution of sample mean** - the distribution of sample mean over *all possible samples* - but we only have *one* sample! - Idea: assume our sample is representative of the population, and draw samples from our sample (!), with replacement. - This gives an idea of what different samples from the population might look like. - Called *bootstrap*, after expression "to pull yourself up by your own bootstraps". ## Packages ```{r} library(tidyverse) ``` ## Blue Jays attendances ```{r bootstrap-R-1, echo=FALSE, message=FALSE} jays <- read_csv("http://ritsokiguess.site/datafiles/jays15-home.csv") set.seed(457299) ``` \small ```{r} #| echo: false wid <- getOption("width") options(width = 60) ``` ```{r bootstrap-R-2} jays$attendance ``` - A bootstrap sample: ```{r bootstrap-R-3} s <- sample(jays$attendance, replace = TRUE) s ``` \normalsize ## Sorting - It is easier to see what is happening if we sort both the actual attendances and the bootstrap sample: \small ```{r} sort(jays$attendance) sort(s) ``` \normalsize ```{r} #| echo: false options(width = wid) ``` ## Getting mean of bootstrap sample - A bootstrap sample is same size as original, but contains repeated values (eg. 15062) and missing ones (42917). - We need the mean of our bootstrap sample: ```{r bootstrap-R-4} mean(s) ``` - This is a little different from the mean of our actual sample: ```{r bootstrap-R-5} mean(jays$attendance) ``` - Want a sense of how the sample mean might vary, if we were able to take repeated samples from our population. - Idea: take lots of *bootstrap* samples, and see how *their* sample means vary. ## Setting up bootstrap sampling - Begin by setting up a dataframe that contains a row for each bootstrap sample. I usually call this column `sim`. Do just 4 to get the idea: ```{r bootstrap-R-6} tibble(sim = 1:4) ``` ## Drawing the bootstrap samples - Then set up to work one row at a time, and draw a bootstrap sample of the attendances in each row: ```{r bootstrap-R-7} tibble(sim = 1:4) %>% rowwise() %>% mutate(sample = list(sample(jays$attendance, replace = TRUE))) ``` - Each row of our dataframe contains *all* of a bootstrap sample of 25 observations drawn with replacement from the attendances. ## Sample means - Find the mean of each sample: ```{r bootstrap-R-8} tibble(sim = 1:4) %>% rowwise() %>% mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% mutate(my_mean = mean(sample)) ``` - These are (four simulated values of) the bootstrapped sampling distribution of the sample mean. ## Make a normal quantile plot of them - rather pointless here, but to get the idea: ```{r bootstrap-R-9} tibble(sim = 1:4) %>% rowwise() %>% mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% mutate(my_mean = mean(sample)) %>% ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line() -> g ``` ## The (pointless) plot ```{r bootstrap-R-10} #| fig-height: 5 g ``` ## Now do again with a decent number of bootstrap samples - say 10000, to get a good look at the tails: ```{r bootstrap-R-11} tibble(sim = 1:10000) %>% rowwise() %>% mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% mutate(my_mean = mean(sample)) %>% ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line() -> g ``` ## The (better) plot ```{r bootstrap-R-12} #| fig-height: 5 g ``` ## Comments - This is very close to normal (only very slightly right-skewed) - The bootstrap says that the sampling distribution of the sample mean is close to normal, even though the distribution of the data is not - A sample size of 25 is big enough to overcome the skewness that we saw - This is the Central Limit Theorem in practice - It is surprisingly powerful. - Thus, the $t$-test is actually perfectly good here. ## Comments on the code 1/2 - You might have been wondering about this: ```{r bootstrap-R-13} tibble(sim = 1:4) %>% rowwise() %>% mutate(sample = list(sample(jays$attendance, replace = TRUE))) ``` ## Comments on the code 2/2 - how did we squeeze all 25 sample values into one cell? - sample is a so-called "list-column" that can contain anything. - why did we have to put `list()` around the `sample()`? - because `sample` produces a collection of numbers, not just a single one - the `list()` signals this: "make a list-column of samples". ## Two samples - Assumption: *both* samples are from a normal distribution. - In this case, each sample should be "normal enough" given its sample size, since Central Limit Theorem will help. - Use bootstrap on each group independently, as above. ## Kids learning to read ```{r bootstrap-R-14, echo=FALSE, message=FALSE} my_url <- "http://ritsokiguess.site/datafiles/drp.txt" kids <- read_delim(my_url," ") kids ``` ```{r bootstrap-R-15} #| fig-height: 5 ggplot(kids, aes(x=group, y=score)) + geom_boxplot() ``` ## or, a normal quantile plot ```{r} #| fig-height: 5 ggplot(kids, aes(sample = score)) + stat_qq() + stat_qq_line() + facet_wrap(~ group) ``` ## Getting just the control group - Use `filter` to select rows where something is true: \small ```{r bootstrap-R-16} kids %>% filter(group == "c") -> controls controls ``` \normalsize ## Bootstrap these ```{r bootstrap-R-17} #| fig-height: 4 tibble(sim = 1:10000) %>% rowwise() %>% mutate(sample = list(sample(controls$score, replace = TRUE))) %>% mutate(my_mean = mean(sample)) %>% ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line() ``` ## ... and the treatment group: ```{r bootstrap-R-19} #| fig-height: 4 kids %>% filter(group == "t") -> treats tibble(sim = 1:10000) %>% rowwise() %>% mutate(sample = list(sample(treats$score, replace = TRUE))) %>% mutate(my_mean = mean(sample)) %>% ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line() ``` ## Comments - sampling distributions of sample means both look pretty normal, though treatment group is a tiny bit left-skewed - as we thought, no problems with our two-sample $t$ at all.