14 Predictive Modeling with knn
Goals
- explain why it’s necessary to use training data and test data when building a predictive model.
- describe the k-nearest neighbors (knn) procedure.
- interpret a classification table.
- use knn to predict the levels of a categorical response variable for test data.
The structure of this section will be a bit different than the structure of the previous sections. We will complete some of this material on handouts that we will fill in by-hand.
14.1 Introduction to Classification
We will introduce both the knn algorithm and classification in general using a handwritten handout.
14.2 Choosing Predictors and k
We will continue to use the scaled version of the pokemon data set for this handout. This time, we will have 75 pokemon in our training data set and we are only looking at the Steel
, Dark
, Fire
, and Ice
types. As discussed in the handout, it is important to scale all of our numeric predictors so that the unit of measurement does not influence the classification results. We can scale all of the numeric variables in a data frame using a combination of across()
, where()
, and mutate()
before we split the data into a training sample and a test sample.
Next, we split the data into a traning sample of 75 pokemon and a test sample.
train_sample <- pokemon |>
slice_sample(n = 75)
test_sample <- anti_join(pokemon, train_sample)
There are many candidate predictors in this data set: HP
, Attack
, Defense
, …, all the way up to base_experience
. How should we determine which predictors to include in our model?
Much of this will be trial and error by evaluating different models with a criterion that we will talk about in the next section. However, it is always helpful to explore the data set with graphics to get us to a good starting point. A scatterplot matrix is a useful exploratory tool. The following is a scatterplot matrix with the response variable, Type
, and just three candidate predictors, HP
, Attack
, and Defense
, created with the GGally
(“g-g-ally”) package.
The columns
argument is important: it allows you to specify which columns you want to look at. I prefer putting the response, Type
(column 3
) in the last slot.
We can examine this to see which variables seem to have a relationship with Type
. Where would we want to look for this?
What’s given on the diagonal of the scatterplot matrix?
Which variables might we want to include as predictors in a knn model?
Exercise 1. Construct another scatterplot matrix with Type
and a different set of predictors. Which predictors look like they might be useful to include in a knn model to predict Type
?
After we decide on an initial set of predictors to include, we’ll use the class
package to fit a knn model in R
. For our first model, let’s just use HP
, Attack
and Defense
as predictors. The class
library can fit knn models with a knn()
function but requires the training and test data sets to have only the predictors that we want to use to fit the model. The knn()
function also requires that the response variable, Type
, be given as a vector to the cl
argument.
## install.packages("class")
library(class)
## create a data frame that only has the predictors
## that we will use
train_small <- train_sample |> select(HP, Attack, Defense, Speed)
test_small <- test_sample |> select(HP, Attack, Defense, Speed)
## put our response variable into a vector
train_cat <- train_sample$Type
test_cat <- test_sample$Type
Now that the data has been prepared for the knn()
function in the class
library, we fit the model with 9 nearest neighbors. The arguments to knn()
are
-
train
, a data set with the training data that contains only the predictors we want to use (and not other predictors or the response). -
test
, a data set with the test data that contains only the predictors we want to use (and not other predictors or the response). -
cl
, a vector of the response variable for the training data. -
k
, the number of nearest neighbors.
## fit the knn model with 9 nearest neighbors
knn_mod <- knn(train = train_small, test = test_small,
cl = train_cat, k = 9)
knn_mod
#> [1] Fire Fire Fire Dark Ice Fire Steel Ice Fire Fire Fire Fire
#> [13] Steel Dark Steel Dark Ice Dark Steel Fire Dark Ice Fire Fire
#> [25] Fire Fire Fire Fire Dark Fire Fire Fire Fire Fire Fire Fire
#> [37] Fire Fire Fire Dark Fire Ice Dark Steel Fire
#> Levels: Dark Fire Ice Steel
The output of knn_mod
gives the predicted categories for the test sample. So, the first pokemon in the test sample is predicted to be Fire
type, the second is predicted to be Fire
type, etc.
14.3 Evaluating a Predictive Model
But, how well did our model classify pokemon into Type
s? We still need a metric to evaluate models with different predictors. One definition of a “good” model in the classification context is a model that has a high proportion of correct predictions in the test data set. This should make some intuitive sense, as we would hope that a “good” model correctly classifies most Dark
pokemon as Dark
, most Fire
pokemon as Fire
, etc.
In order to examine the performance of a particular model, we’ll create a classification table that shows the results of the model’s classification on observations in the test data set. An equivalent name for the confusion matrix is a confusion matrix.
We can compare the predictions from the knn model with the actual pokemon Type
s in the test sample with table()
, which makes the classification table:
table(knn_mod, test_cat)
#> test_cat
#> knn_mod Dark Fire Ice Steel
#> Dark 1 3 1 3
#> Fire 9 11 4 3
#> Ice 2 1 1 1
#> Steel 0 1 1 3
The columns of the classification table give the actual Pokemon types in the test data while the rows give the predicted types from our knn model.
Exercise 2. Interpret the value of 11
in the classification table above.
Exercise 3. Interpret the value of 3
in the column with Fire
and the row with Dark
.
Exercise 4. Interpret the value of 0
in the bottom-left of the classification table above.
One common metric used to assess overall model performance is the model’s classification rate, which is computed as the number of correct classifications divided by the total number of observations in the test data set.
Exercise 5. Compute the classification rate “by hand” (that is, by using R
as a calculator).
Code to automatically obtain the classification rate from a confusion matrix is
What does diag()
seem to do in the code above?
Exercise 6. Change the predictors used or change k to improve the classification rate of the model with k = 9
and Attack
, Defense
, HP
, and Speed
as predictors.
Exercise 7. A baseline classification rate to compare to is a model that just classifies everything in the test data set as the most common Type
in the training data set. In this case, what would the “baseline” classification rate be?
Exercise 8. We will choose \(k\), the number of neighbors considered, using a bit of trial and error, but we will also automate the process by writing a for loop to loop through different values of \(k\). However, we should discuss the relative advantages of smaller and larger k values. Which value is “best” is entirely dependent on the data at hand! What are some advantages for making k smaller? What are some advantages for making k larger?
14.4 Practice
14.4.1 Class Exercises
Examine the following code that fits a knn model using the pokemon data set with \(k\) set to \(9\). For this example, we are using the full pokemon data set (with all Types), so, we might expect our classification rate to be a bit lower.
library(tidyverse)
pokemon <- read_csv(here::here("data/pokemon_full.csv"))
set.seed(1119)
## scale the quantitative predictors
pokemon_scaled <- pokemon |>
mutate(across(where(is.numeric), ~ (.x - min(.x)) /
(max(.x) - min(.x))))
train_sample <- pokemon_scaled |>
slice_sample(n = 550)
test_sample <- anti_join(pokemon_scaled, train_sample)
library(class)
train_pokemon <- train_sample |> select(HP, Attack, Defense, Speed,
SpAtk, SpDef, height, weight)
test_pokemon <- test_sample |> select(HP, Attack, Defense, Speed,
SpAtk, SpDef, height, weight)
## put our response variable into a vector
train_cat <- train_sample$Type
test_cat <- test_sample$Type
knn_mod <- knn(train = train_pokemon, test = test_pokemon,
cl = train_cat, k = 9)
knn_mod
tab <- table(knn_mod, test_cat)
sum(diag(tab)) / sum(tab)
If we want to automate generating a classification rate for a knn model with particular predictors, we have two major choices: write a function and map()
across different values of \(k\) or loop through different values of \(k\) with a for
loop.
Class Exercise 1. First, we will take a functional programming approach, by writing a function and “mapping” different values of \(k\) through that function. The following code writes a function called get_class_rate
that has just a single argument: k_val
Run the code and then test the function by running
get_class_rate(k_val = 10)
, which should return the classification rate using10
nearest neighbors.Together, we will define a vector of k values that we want to map through the function and then write code to perform the mapping using the
map()
function from thepurrr
package.
- Put the classification rates, along with the vector of k values into a
tibble()
.
- Make a line plot that shows how the classification rate changes for different values of
k
.
The code below gives an equivalent way to map or loop through values of \(k\) using a for
loop. If you have taken CS 140, you should be able to see a lot of similarities between how loops are defiend in R
and how they are defined in Python.
## define an empty vector to store results
class_rate <- double()
## define values of k that we want to loop through
k_vec <- 1:70
for (i in 1:70) {
knn_mod <- knn(train = train_pokemon, test = test_pokemon,
cl = train_cat, k = k_vec[i])
knn_mod
tab <- table(knn_mod, test_cat)
## for the ith value of k_vec, store the classification rate as
## the ith value of class_rate
class_rate[i] <- sum(diag(tab)) / sum(tab)
}
class_rate
14.4.2 Your Turn
There will be no your turn exercises for this section. Instead, you will apply some of these concepts in your third project.