9 Linear Model Visualization

The purpose of this section is to visualize potentially complex linear models. We will begin with how to visualize a linear regression model with a single quantitative predictor in order to see how the package works. We will then expand our usage to include models with categorical predictors, interaction terms, and quadratic (or higher order) terms.

Throughout this section, we will use a data set on course evaluations at the University of Texas Austin from the openintro package. Each row of the data set corresponds to a course at UT Austin. Variables that we will use include course evaluation score (on a 1-5 scale), age of professor, bty_avg (the attractiveness of the professor), ethnicity (either minority or not minority) and gender (either male or female in this data set).

Note

We will assume each observation is independent in our linear model. For those of you who have had STAT 313, it is more reasonable to fit a random intercepts model with prof_id as the random effect term, as each professor appears more than once in the data set (and it is likely that course scores for a professor are not independent).

library(openintro)
library(tidyverse)
theme_set(theme_minimal())
evals <- openintro::evals

9.1 Basic Strategy (Class Prep)

Our basic strategy for visualizing models is to

fit the model of interest with lm().
construct a grid of predictor values with the data_grid() function from the modelr package.
Use the augment() function from the broom package on the data grid in (2) to predict the response variable according to the model for each row in the grid.
Use ggplot2 to construct a meaningful plot with the model predictions from (3).

We will begin by fitting a linear regression model with score as the response and age (in years) as the predictor. Note that we can easily visualize this model because of how simple it is:

ggplot(data = evals, aes(x = age, y = score)) +
  geom_point() +
  geom_smooth(method = "lm")

Our first goal is to recreate the plot above “by hand” so that we can see what the different functions are doing in a relatively simple example.

Step 1: Fit the model.

library(broom)
mod_age <- lm(score ~ age, data = evals) 
mod_age |> tidy()
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)  4.46      0.127       35.2  1.05e-132
#> 2 age         -0.00594   0.00257     -2.31 2.13e-  2

Step 2: Create a grid of predictor values.

In this simple example, we only have one predictor: age. So, we want to create a tibble that has a few values of age to plug into the fitted model. The seq_range() function can help with this. In the code below, we are creating a tibble called grid that contains 6 values of age: the minimum age in evals, the maximum age in evals, and four other values that are equally spaced between the minimum and the maximum. The seq_range() and data_grid() functions come from the modelr package.

library(modelr)
grid <- evals |>
  data_grid(
    age = seq_range(age, n = 6)
  ) 
grid
#> # A tibble: 6 × 1
#>     age
#>   <dbl>
#> 1  29  
#> 2  37.8
#> 3  46.6
#> 4  55.4
#> 5  64.2
#> 6  73

Step 3: augment().

Next, we want to use the augment() function from broom to use the mod_age model to predict course score for each row in grid. We will name this new tibble aug_age.

aug_age <- augment(mod_age, newdata = grid,
                   interval = "confidence")
aug_age
#> # A tibble: 6 × 4
#>     age .fitted .lower .upper
#>   <dbl>   <dbl>  <dbl>  <dbl>
#> 1  29      4.29   4.18   4.40
#> 2  37.8    4.24   4.16   4.31
#> 3  46.6    4.19   4.13   4.24
#> 4  55.4    4.13   4.07   4.19
#> 5  64.2    4.08   3.99   4.17
#> 6  73      4.03   3.89   4.16

augment() takes the name of the model and the name of the gridded data frame we created in the previous step. We can also obtain 95% confidence intervals for the mean response at each value of age in grid if we specify the interval = "confidence" argument.

Step 4: Use ggplot2.

The last step is to use ggplot2 to make a meaningful plot. In this case, we can construct a plot with score on the y-axis and age on the x-axis.

ggplot(data = evals, aes(x = age, y = score)) +
  geom_point() +
  geom_line(data = aug_age, aes(x = age, y = .fitted),
            colour = "blue", linewidth = 1.2)

Note that, because the original data and the grid used for the predictions are different data frames, we have to remember to adjust the data used for each geom appropriately, taking advantage of the fact that a local data argument and local aes() aesthetics will overrid the global arguments in the ggplot() function.

We can then use the geom_ribbon() function to add a measure of uncertainty to our plot and generate the 95% confidence band:

ggplot(data = evals, aes(x = age, y = score)) +
  geom_point() +
  geom_line(data = aug_age, aes(x = age, y = .fitted),
            colour = "blue", linewidth = 1.2) +
  geom_ribbon(data = aug_age, aes(y = .fitted,
                                  ymin = .lower,
                                  ymax = .upper), 
              alpha = 0.4)

So, we finally get a plot of the fitted model that matches with the default plot that we get from geom_smooth(method = "lm")!

Important

This exercise is only to help us understand the steps to create the plot. For such a simple example, we would not go through the trouble of using this workflow to create this plot when it can easily be created without it.

Exercise 1. As we saw above, the grey “band” around the fitted regression line represents 95% confidence intervals for the mean response (score) for particular values of the predictor (age). In STAT 213, you also discussed 95% prediction intervals for a new observation’s response (score) for particular values of the predictor (age). What is the difference between a 95% confidence interval and a 95% prediction interval?

Exercise 2. Modify the code so that the grey band reflects 95% prediction intervals instead of 95% confidence intervals for the mean.

Exercise 3. By “hand”, verify that the .fitted value in the first row of aug_age can be calculated simply by plugging in 29 into the fitted regression equation obtained from mod_age.

Exercise 4. In data_grid(age = seq_range(age, n = 6)), why does it not matter as much what value is chosen for n in this example? Change n to be a different integer and verify that the plot does not substantially change.

Exercise 5. Fit the following model, which includes an age^2 term. Then, run the rest of the code in the chunk to obtain predictions for the age values in grid with both the mod_age model and the mod_agesq model.

aug_age <- augment(mod_age, newdata = grid,
                   interval = "confidence")

mod_agesq <- lm(score ~ age + I(age ^ 2), data = evals) 

grid <- evals |>
  data_grid(
    age = seq_range(age, n = 6)
  ) 

aug_agesq <- augment(mod_agesq, newdata = grid,
                     interval = "confidence")
aug_agesq
#> # A tibble: 6 × 4
#>     age .fitted .lower .upper
#>   <dbl>   <dbl>  <dbl>  <dbl>
#> 1  29      4.29   4.12   4.47
#> 2  37.8    4.24   4.16   4.31
#> 3  46.6    4.18   4.12   4.25
#> 4  55.4    4.13   4.07   4.20
#> 5  64.2    4.08   3.97   4.20
#> 6  73      4.03   3.76   4.30

Use ggplot to make a plot that has (1) the fitted line from mod_age and the fitted curve from mod_agesq, where the line/curves are coloured by the model type and (2) has the data points in the background of the plot. The code below stacks the two augmented data frames on top of each other and creates a new column called model that gives the names of the data frames as its levels.

plot_df <- bind_rows(lst(aug_age, aug_agesq), .id = "model")
plot_df
#> # A tibble: 12 × 5
#>   model     age .fitted .lower .upper
#>   <chr>   <dbl>   <dbl>  <dbl>  <dbl>
#> 1 aug_age  29      4.29   4.18   4.40
#> 2 aug_age  37.8    4.24   4.16   4.31
#> 3 aug_age  46.6    4.19   4.13   4.24
#> 4 aug_age  55.4    4.13   4.07   4.19
#> 5 aug_age  64.2    4.08   3.99   4.17
#> 6 aug_age  73      4.03   3.89   4.16
#> # ℹ 6 more rows

9.2 Visualizing More Complex Models

The power of this model visualization strategy in general can really be seen in models where the coefficients are more challenging to interpret. For example, suppose that we fit the following model to the evals data:

mod_comp <- lm(score ~ age + bty_avg + age:bty_avg + gender,
               data = evals)
mod_comp |> tidy()
#> # A tibble: 5 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)  5.24      0.362       14.5  2.08e-39
#> 2 age         -0.0308    0.00730     -4.22 2.91e- 5
#> 3 bty_avg     -0.204     0.0745      -2.74 6.48e- 3
#> 4 gendermale   0.213     0.0512       4.16 3.75e- 5
#> 5 age:bty_avg  0.00574   0.00156      3.69 2.53e- 4

The model contains an interaction between age and bty_avg so the coefficients involving these two terms are very tough to interpret. Our goal is to create a plot that helps interpret this model.

We will use the same strategy outlined in the previous section to create a data frame with predictions for various values of age, bty_avg, and gender. In data_grid(), we now need to give values not only for age, but also for bty_avg, and gender. Note that gender, the categorical predictor is a vector of its possible levels.

grid <- evals |>
  data_grid(
    age = seq_range(age, n = 6),
    bty_avg = seq_range(bty_avg, n = 6),
    gender = c("female", "male")
  ) 
grid
#> # A tibble: 72 × 3
#>     age bty_avg gender
#>   <dbl>   <dbl> <chr> 
#> 1    29    1.67 female
#> 2    29    1.67 male  
#> 3    29    2.97 female
#> 4    29    2.97 male  
#> 5    29    4.27 female
#> 6    29    4.27 male  
#> # ℹ 66 more rows

data_grid() creates one row for each age-bty_avg-gender combination. With 6 values for age, 6 values for bty_avg, and 2 values for gender, grid has 72 rows. We then gather the predictions from this grid with the mod_comp model:

aug_int <- augment(mod_comp, newdata = grid,
                   interval = "confidence")
aug_int
#> # A tibble: 72 × 6
#>     age bty_avg gender .fitted .lower .upper
#>   <dbl>   <dbl> <chr>    <dbl>  <dbl>  <dbl>
#> 1    29    1.67 female    4.29   4.06   4.51
#> 2    29    1.67 male      4.50   4.26   4.74
#> 3    29    2.97 female    4.24   4.08   4.40
#> 4    29    2.97 male      4.45   4.28   4.62
#> 5    29    4.27 female    4.19   4.07   4.31
#> 6    29    4.27 male      4.40   4.27   4.53
#> # ℹ 66 more rows

And the final step is to create a plot of the resulting model predictions. This is the step that requires the most critical thinking, as the plot will change depending on (1) how many models we fit (just 1 in this example) and (2) how many predictor variables we have.

Exercise 1. By hand, sketch a plot that shows the predictions from the mod_comp model in a meaningful way.

Exercise 2. Make the plot that you sketched in the previous exercise.

Exercise 3. We’ve discussed in this class the importance of showing uncertainty, when possible, using our visualizations. However, if you attempt to show uncertainty using geom_ribbon() on the plot you created, you end up with a mess. How could you modify the plot so that uncertainty is shown?

Exercise 4. Adjust one of the values for n to modify the plot in the previous exercise.

Exercise 5. Look at the help in ?seq_range and use it to adjust the trim option for age.

9.3 Your Turn

Exercise 1. Fit a model of your choice with two categorical predictors, one quantitative predictor, and an interaction between the quantitative predictor and one of the categorical predictors. Construct a plot that helps interpret the coefficients from the fitted model. You do not need to show confidence bands on your plot. You should make a sketch of the plot you intend to create first!

Exercise 2. Modify the model from the previous exercise by getting rid of the interaction term. Using the workflow we have been using, construct a plot that compares the model with the interaction and the model without the interaction. Again, it might be helpful to sketch the plot first.

Exercise 3. Suppose that you want to visualize a regression model with a generic quantitative response variable \(y\) and 10 predictor variables \(x_1\), \(x_2\), …., \(x_{10}\). You are most interested in the visualizing the association between \(x_4\) and \(y\), after accounting for the effects of the other 9 predictors. Sketch an appropriate visualization for this setting. What should the values for the other 9 predictors be?

Exercise 4. The purpose of this exercise is to explore what happens to the augmented predictions when there is a lot of multicollinearity among the predictor variables. Recall that multicollinearity means that predictor variables are highly correlated with one another. The example below contains information on body measurements. The response variable, brozek, is a metric of body fat percentage while the predictor variables, weight, thigh, and adipos, all correspond to other body measurements that are much easier to obtain than brozek body fat percentage.

Examine the following scatterplot matrix. Explain why the model has a large amount of multicollinearity.

## install.packages("faraway")
library(faraway)
library(broom)
library(modelr)
library(GGally)

ggpairs(fat, columns = c("brozek", "weight", "thigh", "adipos"))

Examine the output from the following fitted model. What looks odd about the fitted model coefficients and about the p-values in the model?

mod <- lm(brozek ~ weight + thigh + adipos, data = fat)
mod |> tidy()
#> # A tibble: 4 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept) -19.1       4.59      -4.16  4.41e- 5
#> 2 weight       -0.0349    0.0296    -1.18  2.40e- 1
#> 3 thigh        -0.0472    0.131     -0.359 7.20e- 1
#> 4 adipos        1.85      0.203      9.12  2.59e-17

Before completing the next part, explain what you think a plot will look like that has weight on the x-axis and augmented predictions for brozek at the median values of thigh and adipos on the y-axis.
Run the code below to construct the plot. Why does the model look like it fits the points very poorly?

grid <- fat |>
  data_grid(
    weight = seq_range(weight, n = 10),
    thigh = fat |> pull(thigh) |> median(),
    adipos = fat |> pull(adipos) |> median()
  ) 

fat_aug <- augment(mod, newdata = grid, interval = "confidence")
ggplot(data = fat_aug, aes(x = weight, y = .fitted)) +
  geom_line(linewidth = 1.7, colour = "blue") +
  geom_ribbon(aes(ymin = .lower, ymax = .upper),
              fill = "lightblue", alpha = 0.3) +
  theme_minimal() +
  geom_point(data = fat, aes(y = brozek), alpha = 0.7)