Interactive SAS tutorials supporting the OpenIntro Introduction to Modern Statistics textbook.
In this lesson you will learn about how to use the t-distribution to perform inference in linear regression models. You will also learn about how to create prediction intervals for the response variable.
In this lesson, we’ll be using mathematical approximations to test and estimate the slope parameter. The approximations will build on the t-distribution which you may have seen in previous tutorials. The mathematical model is often correct and is usually easy to implement computationally.
starbucks %>%
ggplot(aes(x=Fat, y=Calories)) + geom_point() + ggtitle("Fat vs. Calories for Starbucks Food Items") +
geom_smooth(method="lm", se=FALSE)
Recall the Starbucks data. In this lesson, we will start with a continued investigation into the linear relationship between Fat and Calories.
ggplot(starFatCal, aes(x = statistic)) +
geom_histogram(aes(y = ..density..), bins = 50) +
stat_function(fun = dt,
color = "#F05133",
args = list(df = nrow(starbucks) - 2))
Using the bootstrap sampling procedure from the previous lesson, we can generate a histogram of bootstrapped slopes. The red line on top of the histogram gives the appropriate t-distribution with n-2 degrees of freedom. You can see that the mathematical model (that is, the t-distribution) is remarkably similar to the computational methods described previously.
The R code uses the dt
function to provide the density of the t-distribution. You won’t typically use the dt
function in your analyses, but you should know that R is using it in the background to calculate the p-value because the t-distribution fits the sampling distribution so well.
ggplot(starFatCal, aes(x = statistic)) +
geom_histogram(aes(y = ..density..), bins = 50) +
stat_function(fun = dt,
color = "#F05133",
args = list(df = nrow(starbucks) - 2)) +
ggtitle("Sampling distribution of slope: fat vs. calories")
We can look at the histogram more closely, because interest is truly in the tails of the distribution. Remember, decisions to reject or not reject the null hypothesis are made by looking at the tails of the distribution.
ggplot(starFatCal, aes(x = statistic)) +
geom_histogram(aes(y = ..density..), bins = 50) +
stat_function(fun = dt,
color = "#F05133",
args = list(df = nrow(starbucks) - 2)) +
coord_cartesian(xlim =c(-3.2, -2)) +
ggtitle("Sampling distribution of slope: fat vs. calories")
ggplot(starFatCal, aes(x = statistic)) +
geom_histogram(aes(y = ..density..), bins = 50) +
stat_function(fun = dt,
color = "#F05133",
args = list(df = nrow(starbucks) - 2)) +
coord_cartesian(xlim =c(2, 3.2)) +
ggtitle("Sampling distribution of slope: fat vs. calories")
Here, we see that the mathematical model fits the computational model quite well, even in the tails. The good fit is seen by looking at the red line which is slightly above the histogram in some places and slightly below the histogram in other places.
ggplot(starbucks, aes(x=Fiber, y=Protein)) + geom_point() +
ggtitle("Fiber vs. Protein for Starbucks Food Items") +
geom_smooth(method="lm", se=FALSE)
Unlike the relationship between fat and calories, fiber and protein does not seem to have an obvious linear model. although the least squares fit produces a line with a positive slope, the points don’t seem to indicate a strong positive relationship.
ggplot(starProFib, aes(x=statistic)) +
geom_histogram(aes(y=..density..), bins=50) +
stat_function(fun=dt, color="#F05133", args=list(df=nrow(starbucks) - 2)) +
ggtitle("Sampling distribution of slope: fiber vs. protein")
we can repeat the bootstrap analysis to identify whether the mathematical model fits the sampling distribution of the slopes when dealing with variables that do not seem to show a strong linear trend. we will argue that although the red line is not extremely different from the histogram, it is different enough to indicate concerns with the mathematical model used to describe fiber and protein.
Note that we again use the dt
function, but the fit is not as good as it was with fat and calorie. R won’t know when the fit is good and when it isn’t, so R will always use the t-distribution to calculate p-values, even when it isn’t a good idea.
ggplot(starProFib, aes(x=statistic)) +
geom_histogram(aes(y=..density..), bins=50) +
stat_function(fun=dt, color="#F05133", args=list(df=nrow(starbucks) - 2)) +
coord_cartesian(xlim =c(-3.2, -1.7)) +
ggtitle("Sampling distribution of slope: fiber vs. protein")
indeed, we are again most concerned with the fit of the t-distribution to the histogram in the tails of the distribution.
in looking at the left tail, notice that the red line is above the histogram values. that means, we are unlikely to see values as extreme if the null hypothesis is true. indeed, the p-value with the mathematical model will overestimate the true p-value.
ggplot(starProFib, aes(x=statistic)) +
geom_histogram(aes(y=..density..), bins=50) +
stat_function(fun=dt, color="#F05133", args=list(df=nrow(starbucks) - 2)) +
coord_cartesian(xlim =c(1.7, 3.2)) +
ggtitle("Sampling distribution of slope: fiber vs. protein")
Notice that the histogram values in the right tail are all above the red line. We are likely to call those value significant, even if the null hypothesis is true.
The p-value given by the mathematical model in the right tail will underestimate the true p-value.
The discrepancy from the mathematical model does not matter if the effect is extremely strong or if there is absolutely no effect. It matters when the data have low power either in terms of small sample size or minimal effect size.
So far we’ve talked about the mathematical model and how sometimes it can lead to inaccurate p-values. In the following lesson we’ll cover the technical conditions which insure the mathematical model does fit. But for now, it’s your turn to practice using the mathematical model.
Instead of simulating a null distribution (using permutations), the t-distribution can be used to calculate p-values and confidence intervals. The theoretical result provides a t-distribution fit for the sampling distribution of the standardized slope statistic. Why does it matter if the sampling distribution is accurate?
Hint: Lots of things can go wrong if the distribution doesn’t match the assumptions.
question("Why does it matter if the sampling distribution is accurate?",
answer("If the distribution is wrong, the software will not compute a p-value.", message="The software will always provide a p-value, even if the number is meaningless."),
answer("If the distribution is wrong, the p-value will not represent the probability of the data given the null hypothesis is true.", message="True... but there is more."),
answer("If the distribution is wrong, the CI procedure will not capture the true parameter in 95% of samples.", message="True... but there is more."),
answer("All of the above.", message="No. The p-value is always calculated, regardless of whether the distribution is appropriate."),
answer("Some of the above.", message="Right, the probabilities for BOTH the p-value and the CI are based on having an accurate sampling distribution. In this case, the sampling distribution used to calculate p-values and CIs is the t-distribution.", correct = TRUE),
allow_retry = TRUE
)
Using the permuted datasets (recall, the randomization forces the null hypothesis to be true), investigate the distribution of the standardized slope statistics (the slope, which has been divided by the standard error). Note that the distribution of the standardized slope is well described by a t-distribution.
set.seed(4747)
twins_perm <- twins %>% rep_sample_n(size=nrow(twins), reps=500) %>%
group_by(replicate) %>%
mutate(Biological_perm = sample(Biological)) %>%
do(tidy(lm(Foster ~ Biological_perm, data=.)))
twins_perm
for rows where term
equals "Biological_perm"
.twins
, as the number of its rows minus two.# Look at the data
twins_perm
# Filter for Biological_perm
biological_perm <- twins_perm %>%
___
# Calculate degrees of freedom of twins
degrees_of_freedom <- ___
- Call `filter()`, passing the condition `term` equal to `"Biological_perm"`.
- Use `nrow()` to calculate the number of rows of `twins`.
# Look at the data
twins_perm
# Filter for Biological_perm
biological_perm <- twins_perm %>%
filter(term == "Biological_perm")
# Calculate degrees of freedom of twins
degrees_of_freedom <- nrow(twins) - 2
biological_perm
, plot statistic
.y
to the special value ..density..
.fun
to dt
.args
to a list, with the element df
equal to degrees_of_freedom
.color
to "red"
.# From previous step
biological_perm <- twins_perm %>%
filter(term == "Biological_perm")
degrees_of_freedom <- nrow(twins) - 2
# Using biological_perm, plot statistic
___ +
# Add a histogram layer, with density on the y axis
___ +
# Add a t-distribution function stat, colored red
stat_function(fun = ___, args = list(df = ___), color = ___)
- Call `ggplot()` with `biological_term` as the data argument. Inside `aes()`, map `x` to `statistic`.
- Add `geom_histogram()`, and inside `aes()`, map `y` to `..density..`.
- Add `stat_function()`, setting `fn` to `dt` (without parentheses), `args` to a list with `df` set to `degrees_of_freedom`, and `color` set to `"red"`.
# From previous step
biological_perm <- twins_perm %>%
filter(term == "Biological_perm")
degrees_of_freedom <- nrow(twins) - 2
# Using biological_perm, plot statistic
ggplot(biological_perm, aes(x = statistic)) +
# Add a histogram layer, with density on the y axis
geom_histogram(aes(y = ..density..)) +
# Add a t-distribution function stat, colored red
stat_function(fun = dt, args = list(df = degrees_of_freedom), color = "red")
The p-value given by the lm
output is a two-sided p-value by default. In the twin study, it might seem more reasonable to follow along the one-sided scientific hypothesis that the IQ scores of the twins are positively associated. Because the p-value is the probability of the observed data or more extreme, the two-sided test p-value is twice as big as the one-sided result. That is, to get a one-sided p-value from the two-sided output in R, divide the p-value by two.
The linear regression model of Foster
vs. Biological
, model
, is provided in the script.
term
"Biological"
.one_sided_p_value
, of the one-sided p-value.model <- lm(Foster ~ Biological, data = twins)
# Get the Biological model coefficient
biological_term <- model %>%
# Tidy the model
___ %>%
# Filter for term equal to "Biological"
___
biological_term %>%
# Add a column of one-sided p-values
___(one_sided_p_value = ___)
- Call `tidy()`, without arguments.
- Call `filter()`, with the condition `term` equal to `"Biological"`.
- Call `mutate()`, setting `one_sided_p_value` equal to half of `p.value`.
model <- lm(Foster ~ Biological, data = twins)
# Get the Biological model coefficient
biological_term <- model %>%
# Tidy the model
tidy() %>%
# Filter for term equal to "Biological"
filter(term == "Biological")
biological_term %>%
# Add a column of one-sided p-values
mutate(one_sided_p_value = p.value / 2)
In thinking about the scientific research question, if IQ is caused only by genetics, then we would expect the slope of the line between the two sets of twins to be 1. Testing the hypothesized slope value of 1 can be done by making a new test statistic which evaluates how far the observed slope is from the hypothesized value of 1.
\[new_t = \frac{slope - 1}{SE}\]If the hypothesis that the slope equals one is true, then the new test statistic will have a t-distribution which we can use for calculating a p-value.
The biological term from the model is available as biological_term
.
twins
dataset.estimate
minus 1
, all divided by the standard error.pt()
, at the test_statistic
, with the degrees of freedom you just calculated.model <- lm(Foster ~ Biological, data = twins)
biological_term <- model %>%
tidy() %>%
filter(term == "Biological")
# Make sure all columns print
options(tibble.width = Inf)
# Calculate the degrees of freedom of twins
degrees_of_freedom <- nrow(twins) - 2
biological_term %>%
mutate(
# Calculate the test statistic
test_statistic = ___,
# Calculate its one-sided p-value
one_sided_p_value_of_test_statistic = ___,
# ... and its two-sided p-value
two_sided_p_value_of_test_statistic = ___
)
- The degrees of freedom are the number of rows of data minus two.
- The test statistic is the `estimate` minus `1` (in parentheses), divided by `std.error`.
- To calculate the one-sided p-value, call `pt()`, passing `test_statistic` and setting `df` to `degrees_of_freedom`.
- The two-sided p-value is `2` times `one_sided_p_value_of_test_statistic`.
# Calculate the degrees of freedom of twins
degrees_of_freedom <- nrow(twins) - 2
biological_term %>%
mutate(
# Calculate the test statistic
test_statistic = (estimate - 1) / std.error,
# Calculate its one-sided p-value
one_sided_p_value_of_test_statistic = pt(test_statistic, df = degrees_of_freedom),
# ... and its two-sided p-value
two_sided_p_value_of_test_statistic = 2 * one_sided_p_value_of_test_statistic
)
When technical conditions hold (see next lesson) , the inference from the randomization test and the t-distribution test should give equivalent conclusions. They will not provide the exact same answer because they are based on different methods. But they should give p-values and confidence intervals that are reasonably close.
obs_slope
.perm_slope
of the absolute value of the slope in each permuted replicate. The slope column is called stat
.summarize()
, calculate the p-value as the proportion of absolute estimates of slope that are at least as extreme as the absolute observed estimate of slope.set.seed(4747)
perm_slope <- twins %>%
specify(Foster ~ Biological) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "slope")
obs_slope <- lm(Foster ~ Biological, data = twins) %>%
tidy() %>%
filter(term == "Biological") %>%
select(estimate) %>%
pull()
# The slope in the observed data and each permutation replicate
obs_slope
perm_slope
# Calculate the absolute value of the observed slope
abs_obs_slope <- ___
# Find the p-value
perm_slope %>%
# Add a column for the absolute value of stat
___ %>%
summarize(
# Calculate prop'n permuted values at least as extreme as observed
p_value = ___
)
- `abs_obs_slope` is the `abs()`olute value of `obs_slope`.
- `abs_perm_slope` is the absolute value of `stat`.
- Recall that you can calculate a proportion using `mean(x >= y)`.
- "`x` is at least as extreme as `y`" means "the absolute value of `x` is greater than or equal to the absolute value of `y`".
# The slope in the observed data and each permutation replicate
obs_slope
perm_slope
# Calculate the absolute value of the observed slope
abs_obs_slope <- abs(obs_slope)
# Find the p-value
perm_slope %>%
# Add a column for the absolute value of stat
mutate(abs_perm_slope = abs(stat)) %>%
summarize(
# Calculate prop'n permuted values at least as extreme as observed
p_value = mean(abs_perm_slope >= abs_obs_slope)
)
Hypothesis testing of slopes is only one part of our inferential tool box. Often we are more interested in quantifying the slope in the population than we are in knowing if it is different from zero.
starbucks %>%
ggplot(aes(x=Fat, y=Calories)) + geom_point() +
ggtitle("Fat vs. Calories for Starbucks Food Items")
For the Starbucks data, it seems obvious that there is a linear model. But what is less obvious is the value for the true slope of the linear model that describes the population. We call the true population slope the parameter, and as part of estimating the parameter, we want to gauge our confidence in how closely the sample slope estimates the true population slope.
starbucks %>%
ggplot(aes(x=Fat, y=Calories)) + geom_point() +
ggtitle("Fat vs. Calories for Starbucks Food Items") +
geom_smooth(method="lm", se=FALSE)
Here, the hypothesis could be stated as a question on whether the linear model between fat and calories has a slope of zero. A slope of zero would indicate that regardless of the amount of fat in a particular food, the linear model predicts the same number of calories.
But what if you are interested in estimating the number of extra calories for foods with additional grams of fat? Instead of running a hypothesis test on the slope, your goal would be to find a confidence interval for the slope.
alpha = 0.05
crit_val <- qt((1-alpha/2), df = nrow(starbucks) - 2)
lm(Calories ~ Fat, data=starbucks) %>%
tidy(conf.int = TRUE, conf.level=1-alpha)
lm(Calories ~ Fat, data=starbucks) %>% tidy() %>%
mutate(lower = estimate - crit_val*std.error,
upper = estimate + crit_val*std.error)
As you’ve seen in previous lessons, a confidence interval for a given parameter can be created using the estimate of the parameter as well as the standard error of the estimate around the parameter. The estimate and the SE are both given in the tidy evaluation of the linear model.
tidy_mod <- lm(Calories ~ Fat, data = starbucks) %>%
tidy(conf.int = TRUE, conf.level=1-alpha)
tidy_mod
tidy_mod %>%
filter(term == "(Intercept)") %>%
select(conf.low, conf.high)
the confidence interval for the intercept parameter goes from 118.3kcal to 177.65 kcal. In this particular case, we can interpret the intercept in terms of the problem, because zero fat is a plausible value for the explanatory variable. Oftentimes, a zero value represents extrapolation outside of the plausible range, so it is important to be careful with the interpretation of the intercept.
Here we say that for the set of food items with zero fat, we are 95% confident that their average calories will be between 118.3 and 177.7.
tidy_mod <- lm(Calories ~ Fat, data = starbucks) %>%
tidy(conf.int = TRUE, conf.level=1-alpha)
tidy_mod
tidy_mod %>%
filter(term == "Fat") %>%
select(conf.low, conf.high)
The confidence interval for the slope parameter is 11.1 to 14.4. that means, for each set of foods with one additional gram of fat, we expect the average calorie content to be an additional 11.1 to 14.4 kcal.
note that a regression model cannot tell you anything about causation. that is, we can’t say that one additional gram of fat adds between 11.1 to 14.4 calories to a food item. Of course, in this case, nutritional science can tell us how many calories a gram of fat has, but that doesn’t mean that an additional gram of fat will add exactly that amount to a specific food item because the make-up of the food item includes many other components as well.
the correct interpretation of the confidence interval for the slope parameter is to indicate that you are 95% confident that foods with one higher gram of fat will have, on average, between 11.1 and 14.4 additional calories.
BS_slope <- starbucks %>%
specify(Calories ~ Fat) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "slope")
BS_slope %>%
summarize(low=quantile(stat, alpha / 2),
high=quantile(stat, 1 - alpha / 2))
In addition to using the mathematical theory to create a confidence interval for the slope, you can also use bootstrap sampling to estimate the variability of the slope associated with re-samples.
In the example above, you can see that the bootstrap confidence interval ranges from 11.2 to 14.3 (very close to the mathematical approximation which went from 11.1 to 14.4).
In previous tutorials, you have created confidence intervals with the formula of statistic plus/minus some number of standard errors. With bootstrapping, we typically use two standard errors. With t-based theory, we use the specific t-multiplier.
Create a CI for the slope parameter using both the default tidy()
call as well as mutate()
to calculate the confidence interval bounds explicitly. Note that the two methods should give exactly the same CI values because they are using the same computations.
alpha
has been set to 0.05
and the degrees of freedom of the twins
dataset is given to you.
alpha
.alpha
.qt()
. Pass it the upper percentile cutoff and the degrees of freedom.alpha <- 0.05
degrees_of_freedom <- nrow(twins) - 2
# Calculate the confidence level
confidence_level <- ___
# Calculate the upper percentile cutoff
p_upper <- ___
# Find the critical value from the t-distribution
critical_value <- ___
- Set `confidence_level` to `1` minus `alpha`.
- Set `p_upper` to `1` minus (`alpha` divided by `2`).
- Calculate `critical_value` as `qt()` of `p_upper` and `degrees_of_freedom`.
alpha <- 0.05
degrees_of_freedom <- nrow(twins) - 2
# Calculate the confidence level
confidence_level <- 1 - alpha
# Calculate the upper percentile cutoff
p_upper <- 1 - alpha / 2
# Find the critical value from the t-distribution
critical_value <- qt(p_upper, df = degrees_of_freedom)
conf.int
to TRUE
, and setting the confidence level, conf.level
, to confidence_level
.# From previous step
alpha <- 0.05
degrees_of_freedom <- nrow(twins) - 2
confidence_level <- 1 - alpha
p_upper <- 1 - alpha / 2
critical_value <- qt(p_upper, df = degrees_of_freedom)
tidied_model <- lm(Foster ~ Biological, data = twins) %>%
# Tidy the model, with a confidence interval
___
tidied_model %>%
# Manually calculate the same interval
mutate(
lower = ___,
upper = ___
)
- Call `tidy()`, setting `conf.int` to `TRUE` and `conf.level` to `confidence_level`.
- Inside `mutate()`, set `lower` to `estimate` minus `critical_value` times `std.error`.
- For `upper`, change the minus to a plus.
# From previous step
alpha <- 0.05
degrees_of_freedom <- nrow(twins) - 2
confidence_level <- 1 - alpha
p_upper <- 1 - alpha / 2
critical_value <- qt(p_upper, df = degrees_of_freedom)
tidied_model <- lm(Foster ~ Biological, data = twins) %>%
# Tidy the model, with a confidence interval
tidy(conf.int = TRUE, conf.level = confidence_level)
tidied_model %>%
# Manually calculate the same interval
mutate(
lower = estimate - critical_value * std.error,
upper = estimate + critical_value * std.error
)
As with hypothesis testing, if technical conditions hold (technical conditions are discussed more in the next lesson), the CI created for the slope parameter in the t-distribution setting should be in line with the CI created using bootstrapping. Create a CI for the slope parameter and compare it to the one created using the bootstrap percentile interval from the previous lesson.
Note that the bootstrap and t-intervals will not be exactly the same because they use different computational steps to arrive at the interval estimate.
set.seed(4747)
boot_slope <- twins %>%
specify(Foster ~ Biological) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "slope")
Calculate the confidence level and the lower and upper percentile cutoffs from alpha.
alpha <- 0.05
# Calculate the confidence level
confidence_level <- ___
# Calculate lower percentile cutoff
p_lower <- ___
# ... and the upper one
p_upper <- ___
- Calculate the confidence level as one minus `alpha`.
- Calculate the lower percentile cutoff as half of `alpha`.
- Calculate the upper percentile cutoff as one minus half of `alpha`.
alpha <- 0.05
# Calculate the confidence level
confidence_level <- 1 - alpha
# Calculate lower percentile cutoff
p_lower <- alpha / 2
# ... and the upper one
p_upper <- 1 - alpha / 2
confidence_level
.stat
. Use p_lower
and p_upper
for the quantiles.# From previous step
alpha <- 0.05
confidence_level <- 1 - alpha
p_lower <- alpha / 2
p_upper <- 1 - alpha / 2
# Tidy the model, including a confidence interval
tidied_model <- lm(Foster ~ Biological, data = twins) %>%
___
# Recall the t-confidence interval
tidied_model %>%
filter(term == "Biological") %>%
select(conf.low, conf.high)
# Create the bootstrap confidence interval
boot_slope %>%
summarize(
lower = ___,
upper = ___
)
- Call `tidy()`, setting `conf.int` to `TRUE` and `conf.level` to `confidence_level`.
- Inside `summarize()`, calculate `lower` by calling `quantile()` with `stat` and `p_lower`.
- For `upper`, change `p_lower` to `p_upper`.
# From previous step
alpha <- 0.05
confidence_level <- 1 - alpha
p_lower <- alpha / 2
p_upper <- 1 - alpha / 2
# Tidy the model, including a confidence interval
tidied_model <- lm(Foster ~ Biological, data = twins) %>%
tidy(conf.int = TRUE, conf.level = confidence_level)
# Recall the t-confidence interval
tidied_model %>%
filter(term == "Biological") %>%
select(conf.low, conf.high)
# Create the bootstrap confidence interval
boot_slope %>%
summarize(
lower = quantile(stat, p_lower),
upper = quantile(stat, p_upper)
)
Throughout the different tutorials, you have repeatedly seen confidence intervals which attempt to capture the true population parameter of interest. In the previous exercises, you created confidence intervals for the true slope and intercept parameters.
However, in regression, it is often of interest to put an interval estimate onto the average response or the predicted response variable.
starbucks %>%
ggplot(aes(x=Fat, y=Calories)) + geom_point() +
ggtitle("Fat vs. Calories for Starbucks Food Items") +
geom_smooth(method="lm", se=FALSE)
Recall that the variability of the linear model is based on both the slope and intercept changing from sample to sample.
ggplot(starbucks_many, aes(x=Fat, y=Calories, group=replicate)) +
geom_point() +
ggtitle("Fat vs. Calories for Starbucks Food Items") +
geom_smooth(method="lm", se=FALSE)
Indeed, the variability of the prediction is wider at the extreme values of fat. That is, it is harder to predict the average calories of food items that have zero fat than those that have 20g of fat.
alpha <- .05
crit_val <- qt((1-alpha/2), df = nrow(starbucks) - 2)
newfood <- data.frame(Fat = c(0,10,20,30))
augment(lm(Calories ~ Fat, data=starbucks),
se_fit = TRUE,
newdata = newfood) %>%
mutate(lowMean = .fitted - crit_val*.se.fit,
upMean = .fitted + crit_val*.se.fit)
The augment
function (in the broom package) calculates the standard error for the predicted average at each separate explanatory variable of interest (.se.fit
). the variability at zero and 30 grams of fat is much higher than the variability at 10 or 20 grams of fat.
As before, the CI for the average calories (as a parameter value) is the statistic plus or minus the standard error times the critical value. The critical value (crit_val
) here is calculated for a 95% interval with degrees of freedom of the total number of observations minus two.
Notice that the CI (118.3 to 177.7
) at zero grams of fat is the same as the CI for the intercept.
The 95% confidence interval (258.7 to 292.4
) representing the average calorie content for foods with 10g of fat is 258.7 calories to 292.4 calories.
Additionally, you can see the intervals which capture the average calorie content with 95% confidence for foods that are 20g of fat (388 and 418
) and 30g of fat (505 to 556.7
).
predMeans <- augment(lm(Calories ~ Fat, data = starbucks),
se_fit = TRUE) %>%
select(Calories, Fat, .fitted, .se.fit) %>%
mutate(lowMean = .fitted - crit_val*.se.fit,
upMean = .fitted + crit_val*.se.fit)
head(predMeans)
In order to produce a confidence bound for the entire linear model, the interval is calculated for every observation in the dataset.
ggplot(predMeans, aes(x = Fat, y = Calories)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
geom_ribbon(aes(ymin = lowMean, ymax = upMean), alpha=.2)
In order to produce a confidence bound for the entire linear model, the interval is calculated for every observation in the dataset. We have plotted the interval using the geom_ribbon
in ggplot2, but the same plot could have been created using se = TRUE
in the stat smooth call.
alpha <- .05
crit_val <- qt((1-alpha/2), df = nrow(twins) - 2)
FatCal_lm <- lm(Calories ~ Fat, data = starbucks)
FatCal_gl <- glance(FatCal_lm)
FatCal_sig <- pull(FatCal_gl, sigma)
FatCal_pred <- augment(FatCal_lm,
se_fit = TRUE) %>%
mutate(.se.pred = sqrt(FatCal_sig^2 + .se.fit^2))
predResp <- FatCal_pred %>%
mutate(lowResp = .fitted - crit_val*.se.pred,
upResp = .fitted + crit_val*.se.pred)
predResp
Prediction intervals give a range of plausible values for the individual observations at a given level of the explanatory variable. For example, a prediction interval will tell the calorie counts for 95% of Starbucks foods with 10g of fat.
Note that the difference between prediction intervals for individual responses and confidence intervals for average responses comes in the value of the standard error, see ` mutate(.se.pred = sqrt(FatCal_sig^2 + .se.fit^2))`.
The standard error of the individual prediction is a combination of how variable the line is (.se.fit
) and how variable the individual points are (FatCal_sig
).
ggplot(predResp, aes(x = Fat, y = Calories)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
geom_ribbon(aes(ymin = lowResp, ymax = upResp), alpha = .2)
We plot the prediction interval bounds using the geom_ribbon
function with the new interval upper and lower limits.
The previous two exercises gave CIs for the slope parameter. That is, based on the observed data, you provided an interval estimate of the plausible values for the true slope parameter in the population. Recall that the number 1 was in the CI for the slope, meaning 1 cannot be ruled out as a possible value for the true slope between Biological and Foster twins. There is no evidence to claim that there is a difference, on average, between the IQ scores of two twins in any given pair.
When working with a linear regression model, you might also want to know the plausible values for the expected value of the response at a particular explanatory location. That is, what would you expect the IQ of a Foster twin to be given a Biological twin’s IQ of 100?
alpha <- 0.05
degrees_of_freedom <- nrow(twins) - 2
confidence_level <- 1 - alpha
p_upper <- 1 - alpha / 2
critical_value <- qt(p_upper, df = degrees_of_freedom)
Call augment()
on model
to see the observation-level data in the model.
model <- lm(Foster ~ Biological, data = twins)
# Get observation-level values from the model
___
Call `augment()`, passing `model` as the only input.
model <- lm(Foster ~ Biological, data = twins)
# Get observation-level values from the model
augment(model, se_fit = TRUE)
data.frame
with a column named Biological
, taking values as a seq
uence from 70
to 130
in steps of 15
.augment()
’s newdata
argument to new_twins
.# From previous step
model <- lm(Foster ~ Biological, data = twins)
# Create a dataframe of new observations
new_twins <- ___
# Augment the model with the new dataset
augmented_model <- ___
# See the result
augmented_model
- Call `data.frame()`, setting `Biological` to `70`, `85`, ..., `130`.
- Call `augment()`, passing `model` and setting `newdata` to `new_twins`. Remember to set `se_fit = TRUE`.
# From previous step
model <- lm(Foster ~ Biological, data = twins)
# Create a dataframe of new observations
new_twins <- data.frame(Biological = seq(70, 130, 15))
# Augment the model with the new dataset
augmented_model <- model %>%
augment(newdata = new_twins, se_fit = TRUE)
# See the result
augmented_model
Calculate a confidence interval of the predicted IQs of the foster twin, based on the augmented model. This is the predicted value, .fitted
, plus or minus the critical value times the standard error of the prediction, .se.fit
.
# From previous steps
model <- lm(Foster ~ Biological, data = twins)
new_twins <- data.frame(Biological = seq(70, 130, 15))
augmented_model <- model %>% augment(newdata = new_twins, se_fit = TRUE)
augmented_model %>%
# Calculate a confidence interval on the predicted values
mutate(
lower_mean_prediction = ___,
upper_mean_prediction = ___
)
- Inside `mutate()`, calculate `lower_mean_prediction` as `.fitted` minus `critical_value` times `.se.fit`.
- For `upper_mean_prediction`, change the minus to a plus.
# From previous steps
model <- lm(Foster ~ Biological, data = twins)
new_twins <- data.frame(Biological = seq(70, 130, 15))
augmented_model <- model %>% augment(newdata = new_twins, se_fit = TRUE)
augmented_model %>%
# Calculate a confidence interval on the predicted values
mutate(
lower_mean_prediction = .fitted - critical_value * .se.fit,
upper_mean_prediction = .fitted + critical_value * .se.fit
)
The confidence interval for the average response can be computed for all observations in the dataset. Using augment()
directly on the twins
dataset gives predictions and standard errors for the Foster twin based on all the Biological observations.
Note that the calculation of the regression line is more stable at the center, so predictions for the extreme values are more variable than predictions in the middle of the range of explanatory IQs.
The foster twin IQ predictions that you calculated last time are provided as predictions
. These predictions are shown in a plot using geom_smooth()
.
Manually create what geom_smooth()
does, using predictions
. Provide the aesthetics and data to each geom.
Foster
vs. Biological
, using the data = twins
dataset..fitted
vs. Biological
, using the data = predictions
dataset. Color the line "blue"
.x
mapped to Biological
, ymin
mapped to lower_mean_prediction
and ymax
mapped to upper_mean_prediction
. Use the data = predictions
dataset and set the transparency, alpha
to 0.2
.alpha <- 0.05
degrees_of_freedom <- nrow(twins) - 2
confidence_level <- 1 - alpha
p_upper <- 1 - alpha / 2
critical_value <- qt(p_upper, df = degrees_of_freedom)
model <- lm(Foster ~ Biological, data = twins)
new_twins <- data.frame(Biological = seq(70, 130, 15))
augmented_model <- model %>% augment(newdata = new_twins, se_fit = TRUE)
predictions <- augmented_model %>%
# Calculate a confidence interval on the predicted values
mutate(
lower_mean_prediction = .fitted - critical_value * .se.fit,
upper_mean_prediction = .fitted + critical_value * .se.fit
)
ggplot(twins, aes(x = Biological, y = Foster)) +
geom_point() +
geom_smooth(method = "lm")
# This plot is shown
ggplot(twins, aes(x = Biological, y = Foster)) +
geom_point() +
geom_smooth(method = "lm")
ggplot() +
# Add a point layer of Foster vs. Biological, using twins
___(aes(___, ___), data = ___) +
# Add a line layer of .fitted vs Biological, using predictions, colored blue
___ +
# Add a ribbon layer of lower_mean_prediction to upper_mean_prediction vs Biological,
# using predictions, transparency of 0.2
___
- Add `geom_point()`. Inside `aes()`, map `x` to `Biological` and `y` to `Foster`. Set `data` to `twins`.
- Add `geom_line()`. Inside `aes()`, map `x` to `Biological` and `y` to `.fitted`. Set `data` to `predictions`.
- Add `geom_ribbon()`. Inside `aes()`, map `x` to `Biological`, `ymin` to `lower_mean_prediction`, and `ymax` to `upper_mean_prediction`. Set `data` to `predictions`, and `alpha` to `0.2`.
# This plot is shown
ggplot(twins, aes(x = Biological, y = Foster)) +
geom_point() +
geom_smooth(method = "lm")
ggplot() +
# Add a point layer of Foster vs. Biological, using twins
geom_point(aes(x = Biological, y = Foster), data = twins) +
# Add a line layer of .fitted vs Biological, using predictions, colored blue
geom_line(aes(x = Biological, y = .fitted), data = predictions, color = "blue") +
# Add a ribbon layer of lower_mean_prediction to upper_mean_prediction vs Biological,
# using predictions, transparency of 0.2
geom_ribbon(
aes(x = Biological, ymin = lower_mean_prediction, ymax = upper_mean_prediction),
data = predictions, alpha = 0.2
)
Along with an interval estimate for the expected value of the response, it is often desired to have an interval estimate for the actual individual responses. The formulation for the prediction is the same, but the predicted points are more variable around the line, so the standard error is calculated to be a larger value.
As with the interval around the expected average values, the interval for predicted individual values is smaller in the middle than on the extremes due to the calculation of the regression line being more stable at the center. Note that the intervals for the average responses are much smaller than the intervals for the individual responses.
You have already seen tidy()
, to pull out coefficient-level information from a model, and augment()
for observation-level information. glance()
completes the triumvirate, giving you model-level information.
The linear regression is provided as model
and the predictions from the previous exercise are given as predictions
.
alpha <- 0.05
degrees_of_freedom <- nrow(twins) - 2
confidence_level <- 1 - alpha
p_upper <- 1 - alpha / 2
critical_value <- qt(p_upper, df = degrees_of_freedom)
model <- lm(Foster ~ Biological, data = twins)
new_twins <- data.frame(Biological = seq(70, 130, 15))
augmented_model <- model %>% augment(newdata = new_twins, se_fit = TRUE)
predictions <- augmented_model %>%
# Calculate a confidence interval on the predicted values
mutate(
lower_mean_prediction = .fitted - critical_value * .se.fit,
upper_mean_prediction = .fitted + critical_value * .se.fit
)
model <- lm(Foster ~ Biological, data = twins)
glance()
to get the model-level information from model
.sigma
element.twins_sigma
squared and .se.fit
squared).twins_sigma <- model %>%
# Get model-level information
___ %>%
# Pull out sigma
___
predictions %>%
# Calculate the std err of the predictions
mutate(std_err_of_predictions = ___)
- From `model`, pipe to `glance()` then `pull()`, passing `sigma` as the only argument to the latter.
- Inside `mutate()`, `std_err_of_predictions` should be the `sqrt()` of (`twins_sigma` to the power `2`) plus (`.se.fit` to the power `2`). Add the two together versus using the `sum()` function!
twins_sigma <- model %>%
# Get model-level information
glance() %>%
# Pull out sigma
pull(sigma)
predictions %>%
# Calculate the std err of the predictions
mutate(std_err_of_predictions = sqrt(twins_sigma ^ 2 + .se.fit ^ 2))
critical_value
has already been loaded into your workspace.
.fitted
plus or minus the critical value times the standard error of the predictions.# From previous step
twins_sigma <- model %>% glance() %>% pull(sigma)
predictions2 <- predictions %>%
mutate(std_err_of_predictions = sqrt(twins_sigma ^ 2 + .se.fit ^ 2))
predictions3 <- predictions2 %>%
# Calculate the prediction intervals
mutate(
lower_response_prediction = ___,
upper_response_prediction = ___
)
# See the result
predictions3
- Calculate `lower_response_prediction` as `.fitted` minus `critical_value` times `std_err_of_predictions`.
- For `upper_response_prediction`, swap the minus for a plus.
# From previous step
twins_sigma <- model %>% glance() %>% pull(sigma)
predictions2 <- predictions %>%
mutate(std_err_of_predictions = sqrt(twins_sigma ^ 2 + .se.fit ^ 2))
predictions3 <- predictions2 %>%
# Calculate the prediction intervals
mutate(
lower_response_prediction = .fitted - critical_value * std_err_of_predictions,
upper_response_prediction = .fitted + critical_value * std_err_of_predictions
)
# See the result
predictions3
Update the plot to add a ribbon layer of prediction intervals.
Biological
on the x-axis, and make the y-axis go from lower_response_prediction
(ymin
) to upper_response_prediction
(ymax
).predictions3
dataset.alpha
, to 0.2
.fill
color of "red"
.# From previous step3
twins_sigma <- model %>% glance() %>% pull(sigma)
predictions3 <- predictions %>%
mutate(
std_err_of_predictions = sqrt(twins_sigma ^ 2 + .se.fit ^ 2),
lower_response_prediction = .fitted - critical_value * std_err_of_predictions,
upper_response_prediction = .fitted + critical_value * std_err_of_predictions
)
# Update the plot
ggplot() +
geom_point(aes(x = Biological, y = Foster), data = twins) +
geom_smooth(aes(x = Biological, y = Foster), data = twins, method = "lm") +
# Add a ribbon layer
___(
# ... of lower_response_prediction to upper_response_prediction vs. Biological
___,
# ... using the predictions3 dataset
___,
# ... with transparency set to 0.2
___,
# ... and fill color red
___
)
- Add `geom_ribbon()`.
- Inside `aes()`, map `x` to `Biological`, `ymin` to `lower_response_prediction`, and `ymax` to `upper_response_prediction`.
- Set `data` to `predictions3`, `alpha` to `0.2`, and `fill` to `"red"`.
# From previous step3
twins_sigma <- model %>% glance() %>% pull(sigma)
predictions3 <- predictions %>%
mutate(
std_err_of_predictions = sqrt(twins_sigma ^ 2 + .se.fit ^ 2),
lower_response_prediction = .fitted - critical_value * std_err_of_predictions,
upper_response_prediction = .fitted + critical_value * std_err_of_predictions
)
# Update the plot
ggplot() +
geom_point(aes(x = Biological, y = Foster), data = twins) +
geom_smooth(aes(x = Biological, y = Foster), data = twins, method = "lm") +
# Add a ribbon layer
geom_ribbon(
# ... of lower_response_prediction to upper_response_prediction vs. Biological
aes(x = Biological, ymin = lower_response_prediction, ymax = upper_response_prediction),
# ... using the predictions3 dataset
data = predictions3,
# ... with transparency set to 0.2
alpha = 0.2,
# ... and fill color red
fill = "red"
)
You have successfully completed Lesson 3 in Tutorial 6: Inferential Modeling.
What’s next?
Full list of tutorials supporting OpenIntro::Introduction to Modern Statistics