8  STAT 213 Review

8.1 STAT 213 with broom (Class Prep)

In this section, we will focus more on a tidy approach to the models that you fit and interpreted in STAT 213.

Note

You may or may not have completed STAT 213 using the functions from the broom package (tidy(), augment(), and glance()), but this section will assume that you have not used these functions.

The functions in the broom package return tibbles with model summary information that we can then use for further analysis, plotting, or presentation. The broom package consists of three primary functions: tidy(), glance(), and augment(), which are described below.

We will use the coffee_ratings data set, which contains observations on ratings of various coffees throughout the world. The data was obtained from the Github account (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-07/readme.md).

A description of each variable in the data set is given below.

  • total_cup_points, the score of the coffee by a panel of experts (our response variable for this section)
  • species, the species of the coffee bean (Arabica or Robusta)
  • aroma, aroma (smell) grade
  • flavor, flavor grade
  • aftertaste, aftertaste grade
  • acidity, acidity grade
  • body, body grade
  • balance, balance grade
  • uniformity, uniformity grade
  • clean_cup, clean cup grade
  • sweetness, sweetness grade
  • moisture, moisture grade
  • category_one_defects, count of category one defects
  • quakers, quakers
  • category_two_defects, the number of category two defects

In the examples below, we will consider the multiple linear regression model

\[ Y = \beta_0 + \beta_1 species + \beta_2 aroma + \beta_3 flavor + \beta_4 sweetness + \beta_5 moisture + \epsilon, \]

where \(Y\) is the rating of the coffee (total_cup_points), species is an indicator variable equal to 1 if the species is Robusta and a 0 if the species is Arabica, and \(epsilon\) are the random errors, which follow the usual assumptions that they are independent and normally distributed with constant variance.

8.1.1 tidy()

tidy() is analagous to summary() for a linear model object. Let’s start by fitting a linear model with lm() with total_cup_points as the response and species, aroma, flavor, sweetness, and moisture as predictors.

Read in the data, load the broom package (and install it with install.packages("broom")), and fit the model with

library(tidyverse)
library(broom)
library(here)
theme_set(theme_minimal())

coffee_df <- read_csv(here("data/coffee_ratings.csv"))
coffee_mod <- lm(total_cup_points ~ species + aroma + flavor +
                   sweetness + moisture,
   data = coffee_df)

In STAT 213, you likely used summary() to look at the model output:

summary(coffee_mod)
#> 
#> Call:
#> lm(formula = total_cup_points ~ species + aroma + flavor + sweetness + 
#>     moisture, data = coffee_df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.5132 -0.3705  0.0726  0.5610  5.5844 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     7.04039    0.77377   9.099  < 2e-16 ***
#> speciesRobusta  2.85365    0.26861  10.624  < 2e-16 ***
#> aroma           1.95188    0.14575  13.392  < 2e-16 ***
#> flavor          5.09440    0.14042  36.281  < 2e-16 ***
#> sweetness       2.23956    0.06553  34.173  < 2e-16 ***
#> moisture       -1.88033    0.67368  -2.791  0.00533 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.168 on 1333 degrees of freedom
#> Multiple R-squared:  0.8891, Adjusted R-squared:  0.8887 
#> F-statistic:  2137 on 5 and 1333 DF,  p-value: < 2.2e-16

However, there are a few inconveniences involving summary(). First, it’s just not that nice to look at: the output isn’t formatted in a way that is easy to look at. Second, it can be challenging to pull items from the summary output with code. For example, if you want to pull the p-value for moisture, you would need to write something like:

summary(coffee_mod)$coefficients["moisture", 4]
#> [1] 0.005327594

tidy() is an alternative that puts the model coefficients, standard errors, t-stats, and p-values in a tidy tibble:

tidy(coffee_mod)
#> # A tibble: 6 × 5
#>   term           estimate std.error statistic   p.value
#>   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)        7.04    0.774       9.10 3.23e- 19
#> 2 speciesRobusta     2.85    0.269      10.6  2.31e- 25
#> 3 aroma              1.95    0.146      13.4  1.82e- 38
#> 4 flavor             5.09    0.140      36.3  4.73e-201
#> 5 sweetness          2.24    0.0655     34.2  2.41e-184
#> 6 moisture          -1.88    0.674      -2.79 5.33e-  3

The advantage of this format of output is that we can now use other tidyverse functions on the output. To pull the p-values,

tidy(coffee_mod) |> select(p.value)
#> # A tibble: 6 × 1
#>     p.value
#>       <dbl>
#> 1 3.23e- 19
#> 2 2.31e- 25
#> 3 1.82e- 38
#> 4 4.73e-201
#> 5 2.41e-184
#> 6 5.33e-  3

or, to grab the output for a particular variable of interest:

tidy(coffee_mod) |> filter(term == "aroma")
#> # A tibble: 1 × 5
#>   term  estimate std.error statistic  p.value
#>   <chr>    <dbl>     <dbl>     <dbl>    <dbl>
#> 1 aroma     1.95     0.146      13.4 1.82e-38

8.1.2 glance()

glance() puts some model summary statistics into a tidy tibble. For example, if we run

glance(coffee_mod)
#> # A tibble: 1 × 12
#>   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#>       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     0.889         0.889  1.17     2137.       0     5 -2105. 4224. 4260.
#> # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

you should notice a lot of statistics that you are familiar with from STAT 213, including r.squared, adj.r.squared, sigma (the residual standard error), statistic (the overall F-statistic), AIC, and BIC. We are going to focus mainly on coefficient interpretation in this course, so there is no need to worry if you do not remember exactly what, for example, BIC is.

8.1.3 augment()

augment() is my personal favourite of the three. The function returns a tibble that contains all of the variables used to fit the model appended with commonly used diagnostic statistics like the fitted values (.fitted), cook’s distance (.cooksd), .hat values for leverage, and residuals (.resid).

augment(coffee_mod)
#> # A tibble: 1,339 × 12
#>   total_cup_points species aroma flavor sweetness moisture .fitted  .resid
#>              <dbl> <chr>   <dbl>  <dbl>     <dbl>    <dbl>   <dbl>   <dbl>
#> 1             90.6 Arabica  8.67   8.83        10     0.12    91.1 -0.537 
#> 2             89.9 Arabica  8.75   8.67        10     0.12    90.5 -0.538 
#> 3             89.8 Arabica  8.42   8.5         10     0       89.2  0.577 
#> 4             89   Arabica  8.17   8.58        10     0.11    88.9  0.114 
#> 5             88.8 Arabica  8.25   8.5         10     0.12    88.6  0.214 
#> 6             88.8 Arabica  8.58   8.42        10     0.11    88.9 -0.0411
#> # ℹ 1,333 more rows
#> # ℹ 4 more variables: .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
#> #   .std.resid <dbl>

augment() the data set makes it really easy to do things like:

  • filter() the data set to examine values with high cook’s distance that might be influential
augment_df <- augment(coffee_mod)
augment_df |> filter(.cooksd > 1)
#> # A tibble: 1 × 12
#>   total_cup_points species aroma flavor sweetness moisture .fitted .resid
#>              <dbl> <chr>   <dbl>  <dbl>     <dbl>    <dbl>   <dbl>  <dbl>
#> 1                0 Arabica     0      0         0     0.12    6.81  -6.81
#> # ℹ 4 more variables: .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
#> #   .std.resid <dbl>

We see right away that there is a potentially influential observation with 0 total_cup_points. Examining this variable further, we see that it is probably a data entry error that can be removed from the data.

ggplot(data = coffee_df, aes(x = total_cup_points)) +
  geom_histogram(bins = 15, fill = "white", colour = "black")

We won’t refit the model here after removing the data entry error to save time.

We could also find observations with high leverage

augment_df |> filter(.hat > 0.2)
#> # A tibble: 2 × 12
#>   total_cup_points species aroma flavor sweetness moisture .fitted .resid
#>              <dbl> <chr>   <dbl>  <dbl>     <dbl>    <dbl>   <dbl>  <dbl>
#> 1             59.8 Arabica   7.5   6.67      1.33     0.1    58.4    1.38
#> 2              0   Arabica   0     0         0        0.12    6.81  -6.81
#> # ℹ 4 more variables: .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
#> #   .std.resid <dbl>

or observations that are outliers:

augment_df |> filter(.std.resid > 3 | .std.resid < -3)
#> # A tibble: 25 × 12
#>   total_cup_points species aroma flavor sweetness moisture .fitted .resid
#>              <dbl> <chr>   <dbl>  <dbl>     <dbl>    <dbl>   <dbl>  <dbl>
#> 1             82.8 Arabica  8.08   8.17     10        0.12    86.6  -3.85
#> 2             82.4 Arabica  5.08   7.75     10        0.11    78.6   3.79
#> 3             82.3 Arabica  7.75   8.08      6.67     0.11    78.1   4.27
#> 4             80.7 Arabica  7.67   7.5       6.67     0       75.2   5.51
#> 5             80   Arabica  7.58   7.75     10        0       83.7  -3.71
#> 6             79.9 Arabica  7.83   7.67     10        0       83.8  -3.87
#> # ℹ 19 more rows
#> # ℹ 4 more variables: .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
#> #   .std.resid <dbl>

Finally, we can use our ggplot2 skills to construct plots like a residuals versus fitted values plot (filtering out the outlying observation first):

ggplot(data = augment_df |> filter(.fitted > 25),
       aes(x = .fitted, y = .resid)) +
  geom_point() 

8.1.4 Exercises

Exercise 1. Examine the penguins data set in the palmerpenguins package:

library(palmerpenguins)
penguins
#> # A tibble: 344 × 8
#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>             <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgers…           39.1          18.7               181        3750
#> 2 Adelie  Torgers…           39.5          17.4               186        3800
#> 3 Adelie  Torgers…           40.3          18                 195        3250
#> 4 Adelie  Torgers…           NA            NA                  NA          NA
#> 5 Adelie  Torgers…           36.7          19.3               193        3450
#> 6 Adelie  Torgers…           39.3          20.6               190        3650
#> # ℹ 338 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>

Fit a linear regression model with body_mass_g as the response variable and species and bill_length_mm as the predictors. Note that penguins with missing values for any of these three variables will be dropped from the analysis.

Exercise 2. Create a table of summary output, including coefficient estimates, standard errors, test statistics, and p-values, using one of the broom functions.

Exercise 3. Use glance() to glance at some of the relevant model statistics.

Exercise 4. Using augment(), create a plot of the residuals vs. the fitted values and evaluate the constant variance assumption.

Exercise 5. Using augment(), check to see if there are any penguins that are influential. Use 0.75 as your cut-off value for Cook’s distance.

8.2 Model Coefficient Interpretation

The goal of this section is to review intercept and slope coefficient interpretations from STAT 213. Our ultimate goal, in the coming weeks, is to use visualization to make our model coefficient interpretation more clear. We will have a particular focus on interpreting model coefficients from relatively complex models (with interaction terms, for example) for an audience with little statistical background.

We will use an accompanying handout and pen-and-paper to review model coefficient interpretation from STAT 213.