3  Wrangling with dplyr

Goals:

3.1 filter(),slice(), select(), arrange(), and Piping

Throughout this chapter, we will use the babynames data set in the babynames R package. To begin, install the babynames package by typing install.packages("babynames") in your bottom-left console window, and then load the babynames package in with

Read about the data set by typing ?babynames in your bottom-left window of R Studio. We see that this data set contains baby name data provided by the SSA in the United States dating back to 1880:

babynames
#> # A tibble: 1,924,665 × 5
#>    year sex   name          n   prop
#>   <dbl> <chr> <chr>     <int>  <dbl>
#> 1  1880 F     Mary       7065 0.0724
#> 2  1880 F     Anna       2604 0.0267
#> 3  1880 F     Emma       2003 0.0205
#> 4  1880 F     Elizabeth  1939 0.0199
#> 5  1880 F     Minnie     1746 0.0179
#> 6  1880 F     Margaret   1578 0.0162
#> # ℹ 1,924,659 more rows

There are many interesting and informative plots that we could make with this data set, but most require some data wrangling first. This chapter will provide the foundation for such wrangling skills.

Before we begin, we also need to load in the tidyverse package so that we can use the dplyr functions.

3.1.1 filter() and slice(): Choosing Rows

There are a lot of commonly used dplyr wrangling functions, but we need to start somewhere! We will begin with two functions used to choose rows to keep in the data frame: filter(), which keeps rows based on a condition that is provided, and slice(), which keeps rows based on row numbers that are provided.

Important

filter() is a way to keep rows by specifying a condition related to one or more of the variables in the data set.

We can keep rows based on a categorical variable or a quantitative variable or a combination of any number of categorical and quantitative variables. R uses the following symbols to make comparisons:

  • < and <= for less than and less than or equal to, respectively
  • > and >= for greater than and greater than or equal to, respectively
  • == for equal to (careful: equal to is a double equal sign ==)
  • != for not equal to (in general, ! denotes “not”)

The easiest way to see how filter works is with some examples. The following line of code filter()s the babynames data set based on a condition. See if you can guess what the following statement does before running the code.

babynames |> filter(name == "Matthew")

We’re seeing our first of many, many, many, many, many, many, many instances of using |> to “pipe.” Piping is a really convenient, easy-to-read way to build a sequence of commands. How you can read the above code is:

  1. Take the babynames data frame object and with babynames, and then

  2. perform a filter() step to keep only the rows where name is Matthew.

Note

The “and then” phrase is the most commonly used phrase to explain how to read the pipe |> in a line of code.

Examine the following four filter() statements and see if you can guess what each one is doing before running the code.

babynames |> filter(year >= 2000)
babynames |> filter(sex != "M")
babynames |> filter(prop > 0.05)
babynames |> filter(year == max(year))

Note that some things put in quotes, like "M" while some things aren’t, like 2000. Generally, if we are filtering by a numeric variable, like prop, any value used in the filtering statement does not go in quotes, while if we are filtering by a categorical variable, like sex or name, the value used in the filtering statement does go in quotes.

We can also combine conditions on multiple variables in filter() using Boolean operators. Look at the Venn diagrams in R for Data Science to learn about the various Boolean operators you can use in R: https://r4ds.had.co.nz/transform.html#logical-operators. The Boolean operators can be used in other functions in R as well, as we will see in the next section with mutate().

The following gives some examples. See if you can figure out what each line of code is doing before running it.

babynames |> filter(n > 20000 | prop > 0.05)
babynames |> filter(sex == "F" & name == "Mary")
babynames |> filter(sex == "F" & name == "Mary" & prop > 0.05)
Important

While filter() lets us choose which rows in a data set to keep based on a condition, slice() allows us to specify the row numbers corresponding to rows that we want to keep.

For example, suppose that we only want to keep the first 5 rows of the babynames data:

babynames |> slice(c(1, 2, 3, 4, 5))
#> # A tibble: 5 × 5
#>    year sex   name          n   prop
#>   <dbl> <chr> <chr>     <int>  <dbl>
#> 1  1880 F     Mary       7065 0.0724
#> 2  1880 F     Anna       2604 0.0267
#> 3  1880 F     Emma       2003 0.0205
#> 4  1880 F     Elizabeth  1939 0.0199
#> 5  1880 F     Minnie     1746 0.0179

The c() function is used here to bind the numbers 1, 2, 3, 4, 5 into a vector. c() actually stands for “concatenate.” We can alternatively use slice(1:5), which is shorthand for slice(c(1, 2, 3, 4, 5)). The : is useful for larger numbers of rows that we might want to keep. For example, slice(1:1000) is much easier than slice(c(1, 2, 3, 4, ...., 999, 1000)). While slice() is useful, it is relatively simple. We’ll come back to it again in a few weeks as well when we discuss subsetting in base R.

Exercise 1. Use filter() on the babynames data set to only keep rows with your name. (Note that, the data set has a large variety of names, but, if there were fewer than 5 people born in the United States with your name every year, then your name might not appear in the data set. In that case, just use a different name that interests you!).

Exercise 2. Use filter() to only keep rows in the babynames data set from the year 2000 onward.

Exercise 3. Use filter() to only keep rows in babynames with your name from Exercise 1 and that have a year from 2000 onward.

You will need to use the & Boolean operator to combine your conditions from the previous two exercises.

Exercise 4. Use slice() to keep the first 20 rows of babynames.

3.1.2 arrange(): Ordering Rows

The arrange() function allows us to order rows in the data set using one or more variables. The function is very straightforward: we only need to provide the name of the variable that we want to order by. For example,

babynames |> arrange(n)
#> # A tibble: 1,924,665 × 5
#>    year sex   name          n      prop
#>   <dbl> <chr> <chr>     <int>     <dbl>
#> 1  1880 F     Adelle        5 0.0000512
#> 2  1880 F     Adina         5 0.0000512
#> 3  1880 F     Adrienne      5 0.0000512
#> 4  1880 F     Albertine     5 0.0000512
#> 5  1880 F     Alys          5 0.0000512
#> 6  1880 F     Ana           5 0.0000512
#> # ℹ 1,924,659 more rows

orders the babynames data so that n, the number of babies, goes from smallest to largest. If we want to arrange() by descending order of a variable, we just wrap the name of the variable we are ordering by with desc() for “descending:”

babynames |> arrange(desc(n))
#> # A tibble: 1,924,665 × 5
#>    year sex   name        n   prop
#>   <dbl> <chr> <chr>   <int>  <dbl>
#> 1  1947 F     Linda   99686 0.0548
#> 2  1948 F     Linda   96209 0.0552
#> 3  1947 M     James   94756 0.0510
#> 4  1957 M     Michael 92695 0.0424
#> 5  1947 M     Robert  91642 0.0493
#> 6  1949 F     Linda   91016 0.0518
#> # ℹ 1,924,659 more rows

Exercise 5. Use arrange() to sort the data frame so that it is in descending order by the prop variable. This allows us to examine the names with the highest proportion of babies given that name for that given year.

Exercise 6. Use arrange() on the categorical variable name. What does arrange() seem to do on a categorical variable?

3.1.3 select(): Choosing Columns

We might also be interested in getting rid of some of the columns in a data set. One reason to do this is if there are an overwhelming (30+) columns in a data set, but we know that we just need a few of them. The easiest way to use select() is to just input the names of the columns that you want to keep. For example, if we were only interested in the name and n variables, we could make a data frame with only those variables with:

babynames |> select(name, n)
#> # A tibble: 1,924,665 × 2
#>   name          n
#>   <chr>     <int>
#> 1 Mary       7065
#> 2 Anna       2604
#> 3 Emma       2003
#> 4 Elizabeth  1939
#> 5 Minnie     1746
#> 6 Margaret   1578
#> # ℹ 1,924,659 more rows

We might also want to use select() to get rid of one or more columns. If this is the case, we denote any column you want to get rid of with -. For example, we might want to get rid of the sex column and the prop column with

babynames |> select(-sex, -prop)
#> # A tibble: 1,924,665 × 3
#>    year name          n
#>   <dbl> <chr>     <int>
#> 1  1880 Mary       7065
#> 2  1880 Anna       2604
#> 3  1880 Emma       2003
#> 4  1880 Elizabeth  1939
#> 5  1880 Minnie     1746
#> 6  1880 Margaret   1578
#> # ℹ 1,924,659 more rows

select() comes with many useful helper functions, but these are oftentimes not needed. One of the helper functions that is actually often useful is everything(), which selects all columns in a data frame. This can be used to help re-order columns if you have a particular column that you want to appear first:

babynames |> select(name, everything())
#> # A tibble: 1,924,665 × 5
#>   name       year sex       n   prop
#>   <chr>     <dbl> <chr> <int>  <dbl>
#> 1 Mary       1880 F      7065 0.0724
#> 2 Anna       1880 F      2604 0.0267
#> 3 Emma       1880 F      2003 0.0205
#> 4 Elizabeth  1880 F      1939 0.0199
#> 5 Minnie     1880 F      1746 0.0179
#> 6 Margaret   1880 F      1578 0.0162
#> # ℹ 1,924,659 more rows

The previous line of code puts name as the first variable in the data frame.

Note

For data frames with only a few columns, re-ordering the columns is generally not that useful, but, for data frames with hundreds or thousands of columns, moving variables of interest to the beginning of the data frame helps us view these variables more readily.

Exercise 7. Use select() to keep only the year, name and prop variables in two ways: (1) keep year, name, and prop directly by specifying these variable names in select() and (2) keep only year, name, and prop indirectly by specifying that sex, and n should be dropped from the data frame.

3.1.4 Assigning a Data Frame a Name

Up until now, we have not assigned any of the new data frames we are creating a name. For example, with

babynames |> filter(name == "Matthew")
#> # A tibble: 212 × 5
#>    year sex   name        n     prop
#>   <dbl> <chr> <chr>   <int>    <dbl>
#> 1  1880 M     Matthew   113 0.000954
#> 2  1881 M     Matthew    80 0.000739
#> 3  1882 M     Matthew   109 0.000893
#> 4  1883 M     Matthew    86 0.000765
#> 5  1884 M     Matthew   117 0.000953
#> 6  1885 M     Matthew   111 0.000957
#> # ℹ 206 more rows

we get a printout of the data with only the name Matthew, but, the data frame object is not assigned a name at all. So, we cannot use ggplot() to make a plot of n only for the name Matthew because we do not have a named data set that only has the name Matthew. In other words, R performs the filter(), but the filtered data frame doesn’t get saved to any particular named object. If we want to “save” the new data set for future use (like in a plot command), then we can use the assignment operator, <-:

babynames_matthew <- babynames |> filter(name == "Matthew")

In the previous chunk, we assign the data frame with only Matthew names to the name babynames_matthew:

babynames_matthew
#> # A tibble: 212 × 5
#>    year sex   name        n     prop
#>   <dbl> <chr> <chr>   <int>    <dbl>
#> 1  1880 M     Matthew   113 0.000954
#> 2  1881 M     Matthew    80 0.000739
#> 3  1882 M     Matthew   109 0.000893
#> 4  1883 M     Matthew    86 0.000765
#> 5  1884 M     Matthew   117 0.000953
#> 6  1885 M     Matthew   111 0.000957
#> # ℹ 206 more rows
Note

We can really use almost any name we would like but it’s more helpful to use a name that makes sense (babynames_matthew) than a name that is meaningless.

An example of giving a data frame a meaningless name would be:

take_it_kronk_feel_the_power <- babynames |> filter(name == "Matthew")

We can use the data frame take_it_kronk_feel_the_power in the same way as babynames_matthew, but take_it_kronk_feel_the_power is not as easy to remember and doesn’t make any sense in this context.

Exercise 8. Name your data frame that you created from Exercise 1. Then, type in the name you assigned to the data frame to verify that you can see some printed out data.

3.1.5 More about the Pipe

We are jumping straight into using piping, but we do want to have an appreciation on how terrible life would be without it. What piping does is make whatever is given before the |> pipe the first argument of whatever function follows the |>. So, for a toy data frame named df with variable xvar,

df |> filter(xvar <= 3)

is equivalent to

filter(df, xvar <= 3)

It might also help to use an analogy when thinking about piping. Consider the Ke$ha’s morning routine in the opening of the song Tik Tok. If we were to write her morning routine in terms of piping,

kesha |> wake_up(time = "morning", feels_like = "P-Diddy") |>
  grab(glasses) |>
  brush(teeth, item = "jack", unit = "bottle") |> ....

Kesha first wakes up in the morning, and then the Kesha that has woken up grabs her glasses, and then the Kesha who has woken up and has grabbed her glasses proceeds to brush her teeth, etc.

The pipe operator |> is loaded automatically with R. We will heavily use the pipe throughout the entire semester so it’s worth it to delve a little deeper into what it is here. We will use the fitness data to further explore the pipe. Read in the data with

fitness_df <- read_csv("https://raw.githubusercontent.com/highamm/ds234_quarto/main/data_online/higham_fitness_clean.csv",
                       col_types = list(stepgoal = col_factor()))
fitness_df
#> # A tibble: 1,722 × 9
#>   Start      active_cals distance flights  steps month weekday dayofyear
#>   <date>           <dbl>    <dbl>   <dbl>  <dbl> <chr> <chr>       <dbl>
#> 1 2018-11-28        57.8    0.930       0  1885. Nov   Wed           332
#> 2 2018-11-29       509.     4.64       18  8953. Nov   Thu           333
#> 3 2018-11-30       599.     6.05       12 11665  Nov   Fri           334
#> 4 2018-12-01       661.     6.80        6 12117  Dec   Sat           335
#> 5 2018-12-02       527.     4.61        1  8925. Dec   Sun           336
#> 6 2018-12-03       550.     3.96        2  7205  Dec   Mon           337
#> # ℹ 1,716 more rows
#> # ℹ 1 more variable: stepgoal <fct>

Thus far, we have used a single pipe |> to do one task, but, we can string together multiple dplyr functions with multiple |> statements. For example, suppose that we wanted to keep only the weekdays that were Sat and Sun, and then, order the data set so that it was in descending order of steps taken:

fitness_df |> filter(weekday == "Sat" | weekday == "Sun") |>
  arrange(desc(steps))
#> # A tibble: 492 × 9
#>   Start      active_cals distance flights  steps month weekday dayofyear
#>   <date>           <dbl>    <dbl>   <dbl>  <dbl> <chr> <chr>       <dbl>
#> 1 2023-07-15       2291.     18.0     193 32589. Jul   Sat           196
#> 2 2019-08-11       1385.     15.8     173 29480. Aug   Sun           223
#> 3 2019-04-14       1767.     17.9       1 25578  Apr   Sun           104
#> 4 2018-12-15       1091.     11.2      27 21224. Dec   Sat           349
#> 5 2022-05-29       1291.     11.8     103 20853  May   Sun           149
#> 6 2019-02-10       1392.     12.8       5 20700. Feb   Sun            41
#> # ℹ 486 more rows
#> # ℹ 1 more variable: stepgoal <fct>

This reads: “take fitness_df and then filter() it to keep only Saturdays and Sundays and then sort it in descending order of steps.”

Important

Consecutive pipes build on each other: we can slowly build out what the pipe is doing step-by-step. Running code “pipe by pipe” can be a very helpful way to understand what each consecutive piping statement is doing.

The code

fitness_df |> filter(weekday == "Sat" | weekday == "Sun")
#> # A tibble: 492 × 9
#>   Start      active_cals distance flights  steps month weekday dayofyear
#>   <date>           <dbl>    <dbl>   <dbl>  <dbl> <chr> <chr>       <dbl>
#> 1 2018-12-01        661.     6.80       6 12117  Dec   Sat           335
#> 2 2018-12-02        527.     4.61       1  8925. Dec   Sun           336
#> 3 2018-12-08        775.     6.46       0  8834. Dec   Sat           342
#> 4 2018-12-09        341.     2.81       0  5265  Dec   Sun           343
#> 5 2018-12-15       1091.    11.2       27 21224. Dec   Sat           349
#> 6 2018-12-16       1029.     8.84      79 16376  Dec   Sun           350
#> # ℹ 486 more rows
#> # ℹ 1 more variable: stepgoal <fct>

is equivalent to:

filter(fitness_df, weekday == "Sat" | weekday == "Sun")
#> # A tibble: 492 × 9
#>   Start      active_cals distance flights  steps month weekday dayofyear
#>   <date>           <dbl>    <dbl>   <dbl>  <dbl> <chr> <chr>       <dbl>
#> 1 2018-12-01        661.     6.80       6 12117  Dec   Sat           335
#> 2 2018-12-02        527.     4.61       1  8925. Dec   Sun           336
#> 3 2018-12-08        775.     6.46       0  8834. Dec   Sat           342
#> 4 2018-12-09        341.     2.81       0  5265  Dec   Sun           343
#> 5 2018-12-15       1091.    11.2       27 21224. Dec   Sat           349
#> 6 2018-12-16       1029.     8.84      79 16376  Dec   Sun           350
#> # ℹ 486 more rows
#> # ℹ 1 more variable: stepgoal <fct>

Then,

filter(fitness_df, weekday == "Sat" | weekday == "Sun") |>
  arrange(desc(steps))
#> # A tibble: 492 × 9
#>   Start      active_cals distance flights  steps month weekday dayofyear
#>   <date>           <dbl>    <dbl>   <dbl>  <dbl> <chr> <chr>       <dbl>
#> 1 2023-07-15       2291.     18.0     193 32589. Jul   Sat           196
#> 2 2019-08-11       1385.     15.8     173 29480. Aug   Sun           223
#> 3 2019-04-14       1767.     17.9       1 25578  Apr   Sun           104
#> 4 2018-12-15       1091.     11.2      27 21224. Dec   Sat           349
#> 5 2022-05-29       1291.     11.8     103 20853  May   Sun           149
#> 6 2019-02-10       1392.     12.8       5 20700. Feb   Sun            41
#> # ℹ 486 more rows
#> # ℹ 1 more variable: stepgoal <fct>

is equivalent to:

arrange(filter(fitness_df, weekday == "Sat" | weekday == "Sun"),
        desc(steps))
#> # A tibble: 492 × 9
#>   Start      active_cals distance flights  steps month weekday dayofyear
#>   <date>           <dbl>    <dbl>   <dbl>  <dbl> <chr> <chr>       <dbl>
#> 1 2023-07-15       2291.     18.0     193 32589. Jul   Sat           196
#> 2 2019-08-11       1385.     15.8     173 29480. Aug   Sun           223
#> 3 2019-04-14       1767.     17.9       1 25578  Apr   Sun           104
#> 4 2018-12-15       1091.     11.2      27 21224. Dec   Sat           349
#> 5 2022-05-29       1291.     11.8     103 20853  May   Sun           149
#> 6 2019-02-10       1392.     12.8       5 20700. Feb   Sun            41
#> # ℹ 486 more rows
#> # ℹ 1 more variable: stepgoal <fct>

So, why use the pipe? Compare the code the uses the pipe operator to the code that doesn’t. Which is easier to read? Which do you think is easier to write? The example shows that, for our purposes, the pipe is most useful in aiding the readability of our code. It’s a lot easier to see what’s happening in the code chunk with the pipes than it is in the previous code chunk without the pipe because, with the pipe, we can read the code from left to right and top to bottom. Without the pipe, we need to read the code from the “inside to the outside”, which is much more challenging.

There are many situations in which using the |> would not be appropriate. To use the pipe, what precedes the |> must be the first argument in the function following the |>. For tidyverse functions that we use throughout the course, this condition is almost always true because these functions were designed with piping in mind. But, for non-tidyverse functions, this may not be the case.

For example, if you have taken STAT 213, you’ve used lm() to fit many different types of linear models. If you haven’t taken STAT 213, lm(response ~ explanatory, data = name_of_data_set) stands for “linear model” and can be used to fit the simple linear regression model that you learned about in STAT 113. You might expect something like this to fit a linear model using fitness_df with active_cals as the response and steps as the predictor:

fitness_df |> lm(active_cals ~ steps)
#> Error in as.data.frame.default(data): cannot coerce class '"formula"' to a data.frame

But it throws us an error. Typing in ?lm reveals that its first argument is a formula to fit the model, not a data set. So the function is trying to run

lm(fitness_df, active_cals ~ steps)
#> Error in as.data.frame.default(data): cannot coerce class '"formula"' to a data.frame

which doesn’t work because the arguments to the function are mixed up (the formula should appear first and the data set should appear second).

Note

The pipe operator |> is relatively new. Previously, the primary pipe operator used was %>% and came from the magrittr package. For almost all cases, the two operators are equivalent. However, when scanning the Internet for help with code, you will probably see %>% used in many of people’s responses on sites like StackOverflow.

Exercise 9. This “More About the Pipe” section will make more sense as we proceed through the course. For now, we just want to be able to understand that something like df |> slice(1:6) is equivalent to slice(df, 1:6). Practice by converting the following code to use the pipe:

filter(fitness_df, active_cals > 50)
#> # A tibble: 1,690 × 9
#>   Start      active_cals distance flights  steps month weekday dayofyear
#>   <date>           <dbl>    <dbl>   <dbl>  <dbl> <chr> <chr>       <dbl>
#> 1 2018-11-28        57.8    0.930       0  1885. Nov   Wed           332
#> 2 2018-11-29       509.     4.64       18  8953. Nov   Thu           333
#> 3 2018-11-30       599.     6.05       12 11665  Nov   Fri           334
#> 4 2018-12-01       661.     6.80        6 12117  Dec   Sat           335
#> 5 2018-12-02       527.     4.61        1  8925. Dec   Sun           336
#> 6 2018-12-03       550.     3.96        2  7205  Dec   Mon           337
#> # ℹ 1,684 more rows
#> # ℹ 1 more variable: stepgoal <fct>

3.2 Practice

In general, practice exercises will be split between exercises that will be done together as a class and exercises that you will do in groups or on your own. The purpose of the class exercises is to give some guidance on how we might think logically through some of the code and the results. The purpose of the group exercises and the on your own exercises is so that you have a chance to practice what you’ve learned with your table or on your own.

3.2.1 Class Exercises

For many of these class exercises, we will start practicing stringing together multiple piping statements. We will also practice more with ggplot() by making plots of some of the data sets we are creating.

Class Exercise 1. In the babynames data set, use filter(), arrange(), and slice() to print the 10 most popular Male babynames in 2017.

Class Exercise 2. Name the data frame you created in the previous exercise and use the data frame to make a bar plot of the 10 most popular Male babynames in 2017, along with the number of babies with each name in that year.

Class Exercise 3. In the babynames data set, keep only the rows with your name (or, another name that interests you) and one sex (either "M" or "F").

Class Exercise 4. Name the new data set you made in the previous exercise and then construct a line plot that looks at either the n or prop of your chosen name through year.

Class Exercise 5. Explain why the following code gives a warning message and returns NA. Use the order of Arguments in ?mean in your explanation.

fitness_df |> mean(distance)
#> [1] NA

3.2.2 Your Turn

Your Turn 1. Get a little more practice with using conditions in filter() by completing the following:

  1. Create a data set with all years except the year 1945.

  2. Create a data set that only has rows where n is between 400 and 5000.

  3. Create a data set with rows that are either F (female) Monica names or F (female) Hilary names.

The syntax filter(x_var > 20 & < 30) is not correct. The variable that we are filtering with needs to go in each conditional statement: filter(x_var > 20 & x_var < 30)

Your Turn 2. Name your data set is part (c) of the previous exercise. Then, use the data set to make a line plot comparing the popularity of the Monica and Hilary names through time.

Your Turn 3. Choose 5 names paired with 5 sexes that interest you and create a new data set that only has data on those 5 name/sex combinations. Then, make a line plot showing the popularity of these 5 names over time.

Your Turn 4. Choose a year and a sex that interests you and create a data set that only contain observations from that year and sex. Then, make a bar plot that shows the top 10 names from that year and sex, showing the prop of babies born with each of those 10 names in that year.

3.3 mutate(), group_by(), summarise(), and Missing Values

Goals:

  • Use the mutate(), if_else(), and case_when() functions to create new variables.
  • Use group_by() and summarise() to create useful summaries of a data set. * Explain what dplyr functions do with missing values.
  • Combine the above goals with plotting to explore a data set on SLU majors.

Throughout this section, we will use a data set that contains observations for each of SLU’s majors, with three variables per major:

  • Major, the name of the major.
  • nfemales, the number of female graduates in that major from 2017 - 2021.
  • nmales, the number of male graduates in that major from 2017 - 2021.

Begin by reading in the data set with

library(tidyverse)
slumajors_df <- read_csv("https://raw.githubusercontent.com/highamm/ds234_quarto/main/data_online/SLU_Majors_17_21.csv")
slumajors_df
#> # A tibble: 30 × 3
#>   Major                        nfemales nmales
#>   <chr>                           <dbl>  <dbl>
#> 1 Anthropology                       35     13
#> 2 Art & Art History                  62     11
#> 3 Biochemistry                       15      6
#> 4 Biology                           152     58
#> 5 Business in the Liberal Arts      192    301
#> 6 Chemistry                          28     20
#> # ℹ 24 more rows

3.3.1 mutate(): Create Variables

Sometimes, we will want to create a new variable that’s not in the data set, oftentimes using if_else(), case_when(), or basic algebraic operations on one or more of the columns already present in the data set.

R understands the following symbols:

  • + for addition, - for subtraction
  • * for multiplication, / for division
  • ^ for raising something to a power (3 ^ 2 is equal to 9)

R also does the same order of operations as usual: parentheses, then exponents, then multiplication and division, then addition and subtraction.

For example, suppose that we want to create a variable in slumajors_df that has the total number of students graduating in each major. We can do this with mutate():

slumajors_df |> mutate(ntotal = nfemales + nmales)
#> # A tibble: 30 × 4
#>   Major                        nfemales nmales ntotal
#>   <chr>                           <dbl>  <dbl>  <dbl>
#> 1 Anthropology                       35     13     48
#> 2 Art & Art History                  62     11     73
#> 3 Biochemistry                       15      6     21
#> 4 Biology                           152     58    210
#> 5 Business in the Liberal Arts      192    301    493
#> 6 Chemistry                          28     20     48
#> # ℹ 24 more rows

The |> operator reads as “and then”, as in “take slumajors_df and then mutate() it.”

Piping is a really convenient, easy-to-read way to build a sequence of commands. How you can read the above code is:

  1. Take slumajors_df and with slumajors_df,

  2. perform a mutate() step to create the new variable called ntotal, which is nfemales plus nmales.

Since this is our first time using mutate(), let’s also delve into what the function is doing. In general, mutate() reads:

mutate(name_of_new_variable = operations_on_old_variables).

R just automatically assumes that you want to do the operation for every single row in the data set, which is often quite convenient!

We might also want to create a variable that is the percentage of students identifying as female for each major:

slumajors_df |>
  mutate(percfemale = 100 * nfemales / (nfemales + nmales))
#> # A tibble: 30 × 4
#>   Major                        nfemales nmales percfemale
#>   <chr>                           <dbl>  <dbl>      <dbl>
#> 1 Anthropology                       35     13       72.9
#> 2 Art & Art History                  62     11       84.9
#> 3 Biochemistry                       15      6       71.4
#> 4 Biology                           152     58       72.4
#> 5 Business in the Liberal Arts      192    301       38.9
#> 6 Chemistry                          28     20       58.3
#> # ℹ 24 more rows
Important

What happened to ntotal? Is it still in the printout? It’s not: when we created the variable ntotal, we didn’t actually save (or assign a name to) the new data set as anything. So R makes and prints the new variable, but it doesn’t get saved to any data set.

If we want to save the new data set for later use, then we can use the <- operator. Here, we’re naming the new data set with the same name as the old data set: slumajors_df. Then, we’re doing the same thing for the percfemale variable. We won’t always want to give the new data set the same name as the old one: we’ll talk about this in more detail in the class exercises.

slumajors_df <- slumajors_df |>
  mutate(percfemale = 100 * nfemales / (nfemales + nmales))
slumajors_df <- slumajors_df |> mutate(ntotal = nfemales + nmales)

But, you can pipe as many things together as you want to, so it’s probably easier to just create both variables in one go. The following chunk says to “Take slumajors_df and create a new variable ntotal. With that new data set, create a new variable called percfemale.” Finally, the slumajors_df <- at the beginning says to “save this new data set as a data set with the same name, slumajors_df, thus overwriting the original slumajors_df.”

slumajors_df <- slumajors_df |>
  mutate(ntotal = nfemales + nmales) |>
  mutate(percfemale = 100 * nfemales / (nfemales + nmales))

Now suppose that you want to make a new variable that is conditional on another variable (or more than one variable) in the data set. Then we would typically use mutate() coupled with

  • if_else() if your new variable is created on only one condition
  • case_when() if your new variable is created on more than one condition
Important

The conditions that we supply use the same syntax as the conditions we used with filter(). For example, we still use the same Boolean operators (& for “and” and | for “or”, etc.).

Suppose we want to create a new variable that tells us whether or not the Major has a majority of Women. That is, we want this new variable, morewomen to be "Yes" if the Major has more than 50% women and "No" if it has 50% or less.

slumajors_df |> mutate(morewomen = if_else(percfemale > 50,
                                            true = "Yes",
                                            false = "No"))
#> # A tibble: 30 × 6
#>   Major                        nfemales nmales percfemale ntotal morewomen
#>   <chr>                           <dbl>  <dbl>      <dbl>  <dbl> <chr>    
#> 1 Anthropology                       35     13       72.9     48 Yes      
#> 2 Art & Art History                  62     11       84.9     73 Yes      
#> 3 Biochemistry                       15      6       71.4     21 Yes      
#> 4 Biology                           152     58       72.4    210 Yes      
#> 5 Business in the Liberal Arts      192    301       38.9    493 No       
#> 6 Chemistry                          28     20       58.3     48 Yes      
#> # ℹ 24 more rows

The mutate() statement reads: create a new variable called morewomen that is equal to "Yes" if percfemale > 50 is true and is equal to "No" if perfemale is not > 0.5. The first argument is the condition, the second is what to name the new variable when the condition holds, and the third is what to name the variable if the condition does not hold.

We use conditions all of the time in every day life. For example, New York had a quarantine order stating that people coming from 22 states in July 2020 would need to quarantine. In terms of a condition, this would read “if you are traveling to New York from one of the 22 states, then you need to quarantine for 2 weeks. Else, if not, then you don’t need to quarantine.” The trick in using these conditions in R is getting used to the syntax of the code.

We can see from the above set up that if we had more than one condition, then we’d need to use a different function (or use nested if_else() statements, which can be a nightmare to read). If we have more than one condition for creating the new variable, we will use case_when().

For example, when looking at the output, we see that Biochemistry has 56% female graduates. That’s “about” a 50/50 split, so suppose we want a variable called large_majority that is "female" when the percent women is 70 or more, "male" when the percent women is 30 or less, and "none" when the percent female is between 30 and 70.

slumajors_df |> mutate(large_majority =
                          case_when(percfemale >= 70 ~ "female",
                                    percfemale <= 30 ~ "male",
                                    percfemale > 30 & percfemale < 70 ~ "none")) 
#> # A tibble: 30 × 6
#>   Major                      nfemales nmales percfemale ntotal large_majority
#>   <chr>                         <dbl>  <dbl>      <dbl>  <dbl> <chr>         
#> 1 Anthropology                     35     13       72.9     48 female        
#> 2 Art & Art History                62     11       84.9     73 female        
#> 3 Biochemistry                     15      6       71.4     21 female        
#> 4 Biology                         152     58       72.4    210 female        
#> 5 Business in the Liberal A…      192    301       38.9    493 none          
#> 6 Chemistry                        28     20       58.3     48 none          
#> # ℹ 24 more rows

The case_when() function reads “When the percent female is more than or equal to 70, assign the new variable large_majority the value of "female", when it’s less or equal to 30, assign the new variable large_majority the value of "male", and when it’s more than 30 and less than 70, assign the large_majority variable the value of "none".

Let’s save these two new variables to the slumajors_df:

slumajors_df <- slumajors_df |>
  mutate(morewomen = if_else(percfemale > 50,
                             true = "Yes",
                             false = "No")) |>
  mutate(large_majority =
           case_when(percfemale >= 70 ~ "female",
                     percfemale <= 30 ~ "male",
                     percfemale > 30 & percfemale < 70 ~ "none")) 
Note

Numeric values typically do not go in quotation marks (like 70 and 30) while strings like "female" and "none" do need quotes.

Exercise 1. Use mutate() with if_else() to create a new variable that is called major_size and is equal to "large" when the total number of majors is 100 or more and "small" when the total number of majors is less than 100.

Exercise 2. Use mutate() with case_when() to create a new variable that is called major_size2 and is "large" when the total number of majors is 150 or more, "medium" when the total number of majors is between 41 and 149, and "small" when the total number of majors is 40 or fewer.

Exercise 3. Create a new variable that is the nfemales to nmales ratio (so your new variable should be nfemales divided by nmales).

3.3.2 summarise(): Create Summaries

The summarise() function is useful to get various summaries from the data. For example, suppose that we want to know the average major size at SLU across the five year span or the total number of majors across those five years. Then we can use summarise() and a summary function, like mean(), sum(), median(), max(), min(), n(), etc. You’ll notice that the format of summarise() is extremely similar to the format of mutate(). Using the slumajors_df data again just for one quick example,

slumajors_df |>
  summarise(meantotalmajor = mean(ntotal),
            totalgrad = sum(ntotal))
#> # A tibble: 1 × 2
#>   meantotalmajor totalgrad
#>            <dbl>     <dbl>
#> 1           117.      3502

In the code chunk above, we obtain two summaries (meantotalmajor and totalgrad), separated by a comma. In general, this syntax also works with mutate() to create more than one variable in the same mutate() function.

Note

Most summary functions are intuitive if you’ve had intro stat. But, if you’re not sure whether the summary for getting the maximum is maximum() or max(), just try both or do a quick google search.

Another useful summary function is n(), the counting function, which counts up the total number of rows.

slumajors_df |> summarise(totalobs = n())
#> # A tibble: 1 × 1
#>   totalobs
#>      <int>
#> 1       30
Note

The counting function n() typically doesn’t have any arguments. It’s typically more useful when paired with group_by(), which is discussed next.

Exercise 4. Find the total number of nfemales in the data set and the total number of nmales in the data set.

The summary function you will need to use within summarise() is sum(), which will add up the numbers in the nfemales column. You’ll then need to add a , and then sum() the number of nmales.

3.3.3 group_by(): Groups

summarise() is often most useful when paired with a group_by() statement. Doing so allows us to get summaries across different groups. Conceptually, we can think of toy_df |> group_by(categorical_var) as “grouping” toy_df into a bunch of different groups, with one group for each level of categorical_var. Then, operations from any other functions after the group_by() function are carried out for each group.

For example, suppose that you wanted the total number of registered births per year in the babynames data set that we used earlier. We can then use group_by() to “split” the babynames data frame into 138 separate data frames (one for each year in the data set). After we group_by(year), then the summarise() statement below adds up the n variable for each year:

babynames |> group_by(year) |>
  summarise(totalbirths = sum(n))
#> # A tibble: 138 × 2
#>    year totalbirths
#>   <dbl>       <int>
#> 1  1880      201484
#> 2  1881      192696
#> 3  1882      221533
#> 4  1883      216946
#> 5  1884      243462
#> 6  1885      240854
#> # ℹ 132 more rows

As another example, we can also use the n() function to count up the number of rows for each year:

babynames |> group_by(year) |>
  summarise(n_name = n())
#> # A tibble: 138 × 2
#>    year n_name
#>   <dbl>  <int>
#> 1  1880   2000
#> 2  1881   1935
#> 3  1882   2127
#> 4  1883   2084
#> 5  1884   2297
#> 6  1885   2294
#> # ℹ 132 more rows
Note

group_by() can be used in combination with the other dplyr functions we have learned about (mutate(), filter(), etc.). We will explore this functionality in class.

Exercise 5. With the babynames data set, use group_by() and summarise() to create a data set that has the total number of births for each name.

Exercise 6. With the babynames data set, use group_by() and summarise() to create a data set that has how many times each name appears in babynames.

For one of the previous exercises, you will need to use the sum() function on n, the column with the total number of births while for the other exercise, you will need to use the n() function.

3.3.4 Missing Values

Both of the data sets that we’ve worked with are nice in that they do not have any missing values. We’ll see plenty of examples of data sets with missing values later, so we should examine how the various functions that we’ve talked about so far tackle missing values.

Missing values in R are denoted with NA for “Not Available.” Run the following code to create a toy data set with some missing values so that we can see how the various functions we’ve used so far deal with NA values.

toy_df <- tibble(x = c(NA, 3, 4, 7),
                 y = c(1, 4, 3, 2),
                 z = c("A", "A", "B", NA))
toy_df
#> # A tibble: 4 × 3
#>       x     y z    
#>   <dbl> <dbl> <chr>
#> 1    NA     1 A    
#> 2     3     4 A    
#> 3     4     3 B    
#> 4     7     2 <NA>

We will first discuss how to remove missing values when computing summaries for a variable that contains some NA values.

Important

Missing values should not be removed without carefully examination and a note of what the consequences might be (i.e. why are these values missing?).

We have a toy data set that is meaningless, so we aren’t asking those questions now, but we will for any future data set that does have missing values!

If we have investigated the missing values and are comfortable with removing them, many functions that we would use in summarise() have an na.rm argument that we can set to TRUE to tell summarise() to remove any NAs before taking the mean(), median(), max(), etc.

toy_df |> summarise(meanx = mean(x, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   meanx
#>   <dbl>
#> 1  4.67

If we want to remove the missing values more directly, we can use the is.na() function in combination with filter(). If the variable is NA (Not Available) for an observation, is.na() evaluates to TRUE; if not, is.na() evaluates to FALSE. Test this out using mutate() to create a new variable for whether x is missing:

toy_df |> mutate(missingx = is.na(x))
#> # A tibble: 4 × 4
#>       x     y z     missingx
#>   <dbl> <dbl> <chr> <lgl>   
#> 1    NA     1 A     TRUE    
#> 2     3     4 A     FALSE   
#> 3     4     3 B     FALSE   
#> 4     7     2 <NA>  FALSE

missingx is TRUE only for the the first observation. We can use this to our advantage with filter() to filter it out of the data set, without going through the extra step of actually making a new variable missingx:

toy_df |> filter(is.na(x) != TRUE)
#> # A tibble: 3 × 3
#>       x     y z    
#>   <dbl> <dbl> <chr>
#> 1     3     4 A    
#> 2     4     3 B    
#> 3     7     2 <NA>

You’ll commonly see this written as short-hand in people’s code you may come across as:

toy_df |> filter(!is.na(x))
#> # A tibble: 3 × 3
#>       x     y z    
#>   <dbl> <dbl> <chr>
#> 1     3     4 A    
#> 2     4     3 B    
#> 3     7     2 <NA>

which says to “keep anything that does not have a missing x value” (recall that the ! means “not” so that we are filtering to keep any value of x that is “not” missing).

Exercise 7. filter() toy_df to keep only the rows where z is not missing.

3.4 Practice

In general, practice exercises will be split between exercises that will be done together as a class and exercises that you will do in groups or on your own. The purpose of the class exercises is to give some guidance on how we might think logically through some of the code and the results. The purpose of the group exercises and the on your own exercises is so that you have a chance to practice what you’ve learned with your table or on your own.

3.4.1 Class Exercises

Class Exercise 1. About 55% of SLU students identify as female. So, in the definition of the morewomen variable, does it make more sense to use 55% as the cutoff or 50%?

Class Exercise 2. Do you think it is ethical to exclude non-binary genders from analyses and graphs in the slumajors data set? What about in the stat113_df data that we looked at earlier? Why or why not? Try to construct arguments for both sides: you might also consider data privacy in your arguments.

Class Exercise 3. Investigate what happens with case_when() when you give overlapping conditions and when you give conditions that don’t cover all observations. For overlapping conditions, create a variable testcase that is "Yes" when percfemale is greater than or equal to 40 and "No" when percfemale is greater than 60 For conditions that don’t cover all observations, create a variable testcase2 that is "Yes" when percfemale is greater than or equal to 55 and "No" when percfemale is less than 35. Put the new variables at the beginning of the data frame so that you can see what they contain more easily.

Class Exercise 4. The rank() function can be used within mutate() to “rank” a variable from lowest values to highest values (so that the case with the lowest value receives a 1, the second lowest value receives a 2, etc.). Use group_by() and mutate() to rank the names from most to least popular in each year-sex combination. We will make some notes about what group_by() does with more than 1 variable.

Class Exercise 5. From the data set in the previous exercise, filter() the data to keep only the most popular name in each year-sex combination and then construct a summary table showing how many times each name appears as the most popular name. With this exercise, we will make some notes about the ungroup() function, and, if we have time investigate a way to achieve the objective of this question without explicitly making a rank variable.

Class Exercise 6. Run the following code. Intuitively, a slice(1, 2, 3, 4, 5) should grab the first five rows of the data set, but, when we try to run that, we get 1380 rows. What is the issue?

babynames_test <- babynames |>
  group_by(year, sex) |> mutate(ntest = n / prop)
babynames_test |> slice(1, 2, 3, 4, 5)
#> # A tibble: 1,380 × 6
#> # Groups:   year, sex [276]
#>    year sex   name          n   prop   ntest
#>   <dbl> <chr> <chr>     <int>  <dbl>   <dbl>
#> 1  1880 F     Mary       7065 0.0724  97605.
#> 2  1880 F     Anna       2604 0.0267  97605.
#> 3  1880 F     Emma       2003 0.0205  97605.
#> 4  1880 F     Elizabeth  1939 0.0199  97605.
#> 5  1880 F     Minnie     1746 0.0179  97605.
#> 6  1880 M     John       9655 0.0815 118400.
#> # ℹ 1,374 more rows

3.4.2 Your Turn

Your Turn 1. Investigate how group_by() behaves with the five other main dplyr functions we have used. You can use toy_df to perform your investigation.

  1. mutate()

  2. slice()

  3. filter()

  4. arrange()

  5. select()

Your Turn 2. Use toy_df again to investigate how the dplyr functions treat missing values by default.

  1. mutate(). Try to create a new variable with mutate() involving x. What does R do with the missing value?

  2. arrange(). Try arranging the data set by x. What does R do with the missing value?

  3. filter(). Try filtering so that only observations where x is less than 5 are kept. What does R do with the missing value?

  4. summarise(). Try using summarise() with a function involving x. What does R return?

  5. group_by() and summarise(). To your statement in 4, add a group_by(z) statement before your summarise(). What does R return now?

Your Turn 3. In some cases throughout this chapter, we’ve renamed data sets using <- with the same name like

toy_df <- toy_df |> mutate(newvar = x / y)

In other cases, we’ve given the data set a new name, like

toy_small <- toy_df |> filter(!is.na(x))

For which of the functions below is a generally “safe” to name the data set using the same name after using the function. Give a one sentence reason for each part.

  1. mutate()

  2. arrange()

  3. filter()

  4. summarise()

  5. select()

Your Turn 4. Compare summarise() with mutate() using the following code. What’s the difference between the two functions?

slumajors_df |>
  summarise(meantotalmajor = mean(ntotal),
            totalgrad = sum(ntotal)) 
slumajors_df |>
  mutate(meantotalmajor = mean(ntotal),
            totalgrad = sum(ntotal)) |>
  select(meantotalmajor, totalgrad, everything())

Your Turn 5. With the babynames data, create a data set that has the number of unique names for each year. Then, make a line plot of the data.

Your Turn 6. With the babynames data, create a data set that has the number of unique names for each year-sex combination. Then, make a line plot of the data.

Your Turn 7. Create a data set that has a column for name so that each row shows the total number of births for each unique name across all years and both sexes.