14 Predictive Modeling with knn

Goals

explain why it’s necessary to use training data and test data when building a predictive model.
describe the k-nearest neighbors (knn) procedure.
interpret a classification table.
use knn to predict the levels of a categorical response variable for test data.

Note

The structure of this section will be a bit different than the structure of the previous sections. We will complete some of this material on handouts that we will fill in by-hand.

14.1 Introduction to Classification

We will introduce both the knn algorithm and classification in general using a handwritten handout.

14.2 Choosing Predictors and k

We will continue to use the scaled version of the pokemon data set for this handout. This time, we will have 75 pokemon in our training data set and we are only looking at the Steel, Dark, Fire, and Ice types. As discussed in the handout, it is important to scale all of our numeric predictors so that the unit of measurement does not influence the classification results. We can scale all of the numeric variables in a data frame using a combination of across(), where(), and mutate() before we split the data into a training sample and a test sample.

set.seed(1119)
library(tidyverse)
library(pander)
library(here) 

pokemon <- read_csv(here("data", "pokemon_full.csv")) |>
  filter(Type %in% c("Steel", "Dark", "Fire", "Ice")) |>
  mutate(across(where(is.numeric), ~ (.x - min(.x)) /
                                 (max(.x) - min(.x))))

Next, we split the data into a traning sample of 75 pokemon and a test sample.

train_sample <- pokemon |>
  slice_sample(n = 75)

test_sample <- anti_join(pokemon, train_sample)

There are many candidate predictors in this data set: HP, Attack, Defense, …, all the way up to base_experience. How should we determine which predictors to include in our model?

Much of this will be trial and error by evaluating different models with a criterion that we will talk about in the next section. However, it is always helpful to explore the data set with graphics to get us to a good starting point. A scatterplot matrix is a useful exploratory tool. The following is a scatterplot matrix with the response variable, Type, and just three candidate predictors, HP, Attack, and Defense, created with the GGally (“g-g-ally”) package.

## install.packages("GGally")
library(GGally)
ggpairs(data = train_sample, columns = c(4, 5, 6, 3))

The columns argument is important: it allows you to specify which columns you want to look at. I prefer putting the response, Type (column 3) in the last slot.

We can examine this to see which variables seem to have a relationship with Type. Where would we want to look for this?

What’s given on the diagonal of the scatterplot matrix?

Which variables might we want to include as predictors in a knn model?

Exercise 1. Construct another scatterplot matrix with Type and a different set of predictors. Which predictors look like they might be useful to include in a knn model to predict Type?

After we decide on an initial set of predictors to include, we’ll use the class package to fit a knn model in R. For our first model, let’s just use HP, Attack and Defense as predictors. The class library can fit knn models with a knn() function but requires the training and test data sets to have only the predictors that we want to use to fit the model. The knn() function also requires that the response variable, Type, be given as a vector to the cl argument.

## install.packages("class")
library(class)

## create a data frame that only has the predictors
## that we will use
train_small <- train_sample |> select(HP, Attack, Defense, Speed)
test_small <- test_sample |> select(HP, Attack, Defense, Speed)

## put our response variable into a vector
train_cat <- train_sample$Type
test_cat <- test_sample$Type

Now that the data has been prepared for the knn() function in the class library, we fit the model with 9 nearest neighbors. The arguments to knn() are

train, a data set with the training data that contains only the predictors we want to use (and not other predictors or the response).
test, a data set with the test data that contains only the predictors we want to use (and not other predictors or the response).
cl, a vector of the response variable for the training data.
k, the number of nearest neighbors.

## fit the knn model with 9 nearest neighbors
knn_mod <- knn(train = train_small, test = test_small,
               cl = train_cat, k = 9)
knn_mod
#>  [1] Fire  Fire  Fire  Dark  Ice   Fire  Steel Ice   Fire  Fire  Fire  Fire 
#> [13] Steel Dark  Steel Dark  Ice   Dark  Steel Fire  Dark  Ice   Fire  Fire 
#> [25] Fire  Fire  Fire  Fire  Dark  Fire  Fire  Fire  Fire  Fire  Fire  Fire 
#> [37] Fire  Fire  Fire  Dark  Fire  Ice   Dark  Steel Fire 
#> Levels: Dark Fire Ice Steel

The output of knn_mod gives the predicted categories for the test sample. So, the first pokemon in the test sample is predicted to be Fire type, the second is predicted to be Fire type, etc.

14.3 Evaluating a Predictive Model

But, how well did our model classify pokemon into Types? We still need a metric to evaluate models with different predictors. One definition of a “good” model in the classification context is a model that has a high proportion of correct predictions in the test data set. This should make some intuitive sense, as we would hope that a “good” model correctly classifies most Dark pokemon as Dark, most Fire pokemon as Fire, etc.

In order to examine the performance of a particular model, we’ll create a classification table that shows the results of the model’s classification on observations in the test data set. An equivalent name for the confusion matrix is a confusion matrix.

We can compare the predictions from the knn model with the actual pokemon Types in the test sample with table(), which makes the classification table:

table(knn_mod, test_cat) 
#>        test_cat
#> knn_mod Dark Fire Ice Steel
#>   Dark     1    3   1     3
#>   Fire     9   11   4     3
#>   Ice      2    1   1     1
#>   Steel    0    1   1     3

The columns of the classification table give the actual Pokemon types in the test data while the rows give the predicted types from our knn model.

Exercise 2. Interpret the value of 11 in the classification table above.

Exercise 3. Interpret the value of 3 in the column with Fire and the row with Dark.

Exercise 4. Interpret the value of 0 in the bottom-left of the classification table above.

One common metric used to assess overall model performance is the model’s classification rate, which is computed as the number of correct classifications divided by the total number of observations in the test data set.

Exercise 5. Compute the classification rate “by hand” (that is, by using R as a calculator).

Code to automatically obtain the classification rate from a confusion matrix is

tab <- table(knn_mod, test_cat) 
sum(diag(tab)) / sum(tab)

What does diag() seem to do in the code above?

Exercise 6. Change the predictors used or change k to improve the classification rate of the model with k = 9 and Attack, Defense, HP, and Speed as predictors.

Exercise 7. A baseline classification rate to compare to is a model that just classifies everything in the test data set as the most common Type in the training data set. In this case, what would the “baseline” classification rate be?

Exercise 8. We will choose \(k\), the number of neighbors considered, using a bit of trial and error, but we will also automate the process by writing a for loop to loop through different values of \(k\). However, we should discuss the relative advantages of smaller and larger k values. Which value is “best” is entirely dependent on the data at hand! What are some advantages for making k smaller? What are some advantages for making k larger?

14.4 Practice

14.4.1 Class Exercises

Examine the following code that fits a knn model using the pokemon data set with \(k\) set to \(9\). For this example, we are using the full pokemon data set (with all Types), so, we might expect our classification rate to be a bit lower.

library(tidyverse)

pokemon <- read_csv(here::here("data/pokemon_full.csv")) 
set.seed(1119)

## scale the quantitative predictors
pokemon_scaled <- pokemon |>
  mutate(across(where(is.numeric), ~ (.x - min(.x)) /
                  (max(.x) - min(.x))))

train_sample <- pokemon_scaled |>
  slice_sample(n = 550)
test_sample <- anti_join(pokemon_scaled, train_sample)

library(class)

train_pokemon <- train_sample |> select(HP, Attack, Defense, Speed,
                                        SpAtk, SpDef, height, weight)
test_pokemon <- test_sample |> select(HP, Attack, Defense, Speed,
                                      SpAtk, SpDef, height, weight)

## put our response variable into a vector
train_cat <- train_sample$Type
test_cat <- test_sample$Type

knn_mod <- knn(train = train_pokemon, test = test_pokemon,
               cl = train_cat, k = 9)
knn_mod

tab <- table(knn_mod, test_cat)
sum(diag(tab)) / sum(tab)

If we want to automate generating a classification rate for a knn model with particular predictors, we have two major choices: write a function and map() across different values of \(k\) or loop through different values of \(k\) with a for loop.

Class Exercise 1. First, we will take a functional programming approach, by writing a function and “mapping” different values of \(k\) through that function. The following code writes a function called get_class_rate that has just a single argument: k_val

get_class_rate <- function(k_val) {
  knn_mod <- knn(train = train_pokemon, test = test_pokemon,
                 cl = train_cat, k = k_val)
  knn_mod
  
  tab <- table(knn_mod, test_cat)
  class_rate <- sum(diag(tab)) / sum(tab)
  
  return(class_rate)
}

Run the code and then test the function by running get_class_rate(k_val = 10), which should return the classification rate using 10 nearest neighbors.
Together, we will define a vector of k values that we want to map through the function and then write code to perform the mapping using the map() function from the purrr package.

Put the classification rates, along with the vector of k values into a tibble().

Make a line plot that shows how the classification rate changes for different values of k.

The code below gives an equivalent way to map or loop through values of \(k\) using a for loop. If you have taken CS 140, you should be able to see a lot of similarities between how loops are defiend in R and how they are defined in Python.

## define an empty vector to store results
class_rate <- double()

## define values of k that we want to loop through
k_vec <- 1:70

for (i in 1:70) {
  knn_mod <- knn(train = train_pokemon, test = test_pokemon,
                 cl = train_cat, k = k_vec[i])
  knn_mod
  
  tab <- table(knn_mod, test_cat)
  
  ## for the ith value of k_vec, store the classification rate as 
  ## the ith value of class_rate
  class_rate[i] <- sum(diag(tab)) / sum(tab)
}

class_rate

14.4.2 Your Turn

There will be no your turn exercises for this section. Instead, you will apply some of these concepts in your third project.