3 Wrangling with dplyr
Goals:
- Use the
filter()
andslice()
functions to keep certain rows of a data frame. - Use the
select()
function to keep certain columns of a data frame. - Use the
arrange()
function to sort a data frame. - Explain what the pipe operator
|>
does and explain when you can and cannot use the pipe operator. - Combine the above goals with plotting to explore the
babynames
data set.
3.1 filter()
,slice()
, select()
, arrange()
, and Piping
Throughout this chapter, we will use the babynames
data set in the babynames
R
package. To begin, install the babynames
package by typing install.packages("babynames")
in your bottom-left console window, and then load the babynames
package in with
Read about the data set by typing ?babynames
in your bottom-left window of R Studio
. We see that this data set contains baby name data provided by the SSA in the United States dating back to 1880:
babynames
#> # A tibble: 1,924,665 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1880 F Mary 7065 0.0724
#> 2 1880 F Anna 2604 0.0267
#> 3 1880 F Emma 2003 0.0205
#> 4 1880 F Elizabeth 1939 0.0199
#> 5 1880 F Minnie 1746 0.0179
#> 6 1880 F Margaret 1578 0.0162
#> # ℹ 1,924,659 more rows
There are many interesting and informative plots that we could make with this data set, but most require some data wrangling first. This chapter will provide the foundation for such wrangling skills.
Before we begin, we also need to load in the tidyverse
package so that we can use the dplyr
functions.
3.1.1 filter()
and slice()
: Choosing Rows
There are a lot of commonly used dplyr
wrangling functions, but we need to start somewhere! We will begin with two functions used to choose rows to keep in the data frame: filter()
, which keeps rows based on a condition that is provided, and slice()
, which keeps rows based on row numbers that are provided.
filter()
is a way to keep rows by specifying a condition related to one or more of the variables in the data set.
We can keep rows based on a categorical variable or a quantitative variable or a combination of any number of categorical and quantitative variables. R
uses the following symbols to make comparisons:
-
<
and<=
for less than and less than or equal to, respectively -
>
and>=
for greater than and greater than or equal to, respectively -
==
for equal to (careful: equal to is a double equal sign==
) -
!=
for not equal to (in general,!
denotes “not”)
The easiest way to see how filter works is with some examples. The following line of code filter()
s the babynames
data set based on a condition. See if you can guess what the following statement does before running the code.
babynames |> filter(name == "Matthew")
We’re seeing our first of many, many, many, many, many, many, many instances of using |>
to “pipe.” Piping is a really convenient, easy-to-read way to build a sequence of commands. How you can read the above code is:
Take the
babynames
data frame object and withbabynames
, and thenperform a
filter()
step to keep only the rows wherename
isMatthew
.
The “and then” phrase is the most commonly used phrase to explain how to read the pipe |>
in a line of code.
Examine the following four filter()
statements and see if you can guess what each one is doing before running the code.
Note that some things put in quotes, like "M"
while some things aren’t, like 2000
. Generally, if we are filtering by a numeric variable, like prop
, any value used in the filtering statement does not go in quotes, while if we are filtering by a categorical variable, like sex
or name
, the value used in the filtering statement does go in quotes.
We can also combine conditions on multiple variables in filter()
using Boolean operators. Look at the Venn diagrams in R
for Data Science to learn about the various Boolean operators you can use in R
: https://r4ds.had.co.nz/transform.html#logical-operators. The Boolean operators can be used in other functions in R
as well, as we will see in the next section with mutate()
.
The following gives some examples. See if you can figure out what each line of code is doing before running it.
For example, suppose that we only want to keep the first 5 rows of the babynames
data:
The c()
function is used here to bind the numbers 1, 2, 3, 4, 5
into a vector. c()
actually stands for “concatenate.” We can alternatively use slice(1:5)
, which is shorthand for slice(c(1, 2, 3, 4, 5))
. The :
is useful for larger numbers of rows that we might want to keep. For example, slice(1:1000)
is much easier than slice(c(1, 2, 3, 4, ...., 999, 1000))
. While slice()
is useful, it is relatively simple. We’ll come back to it again in a few weeks as well when we discuss subsetting in base R
.
Exercise 1. Use filter()
on the babynames
data set to only keep rows with your name. (Note that, the data set has a large variety of names, but, if there were fewer than 5 people born in the United States with your name every year, then your name might not appear in the data set. In that case, just use a different name that interests you!).
Exercise 2. Use filter()
to only keep rows in the babynames
data set from the year
2000
onward.
Exercise 3. Use filter()
to only keep rows in babynames
with your name from Exercise 1 and that have a year
from 2000
onward.
You will need to use the &
Boolean operator to combine your conditions from the previous two exercises.
Exercise 4. Use slice()
to keep the first 20
rows of babynames
.
3.1.2 arrange()
: Ordering Rows
The arrange()
function allows us to order rows in the data set using one or more variables. The function is very straightforward: we only need to provide the name of the variable that we want to order by. For example,
babynames |> arrange(n)
#> # A tibble: 1,924,665 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1880 F Adelle 5 0.0000512
#> 2 1880 F Adina 5 0.0000512
#> 3 1880 F Adrienne 5 0.0000512
#> 4 1880 F Albertine 5 0.0000512
#> 5 1880 F Alys 5 0.0000512
#> 6 1880 F Ana 5 0.0000512
#> # ℹ 1,924,659 more rows
orders the babynames
data so that n
, the number of babies, goes from smallest to largest. If we want to arrange()
by descending order of a variable, we just wrap the name of the variable we are ordering by with desc()
for “descending:”
babynames |> arrange(desc(n))
#> # A tibble: 1,924,665 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1947 F Linda 99686 0.0548
#> 2 1948 F Linda 96209 0.0552
#> 3 1947 M James 94756 0.0510
#> 4 1957 M Michael 92695 0.0424
#> 5 1947 M Robert 91642 0.0493
#> 6 1949 F Linda 91016 0.0518
#> # ℹ 1,924,659 more rows
Exercise 5. Use arrange()
to sort the data frame so that it is in descending order by the prop
variable. This allows us to examine the names with the highest proportion of babies given that name for that given year
.
Exercise 6. Use arrange()
on the categorical variable name
. What does arrange()
seem to do on a categorical variable?
3.1.3 select()
: Choosing Columns
We might also be interested in getting rid of some of the columns in a data set. One reason to do this is if there are an overwhelming (30+) columns in a data set, but we know that we just need a few of them. The easiest way to use select()
is to just input the names of the columns that you want to keep. For example, if we were only interested in the name
and n
variables, we could make a data frame with only those variables with:
babynames |> select(name, n)
#> # A tibble: 1,924,665 × 2
#> name n
#> <chr> <int>
#> 1 Mary 7065
#> 2 Anna 2604
#> 3 Emma 2003
#> 4 Elizabeth 1939
#> 5 Minnie 1746
#> 6 Margaret 1578
#> # ℹ 1,924,659 more rows
We might also want to use select()
to get rid of one or more columns. If this is the case, we denote any column you want to get rid of with -
. For example, we might want to get rid of the sex
column and the prop
column with
babynames |> select(-sex, -prop)
#> # A tibble: 1,924,665 × 3
#> year name n
#> <dbl> <chr> <int>
#> 1 1880 Mary 7065
#> 2 1880 Anna 2604
#> 3 1880 Emma 2003
#> 4 1880 Elizabeth 1939
#> 5 1880 Minnie 1746
#> 6 1880 Margaret 1578
#> # ℹ 1,924,659 more rows
select()
comes with many useful helper functions, but these are oftentimes not needed. One of the helper functions that is actually often useful is everything()
, which selects all columns in a data frame. This can be used to help re-order columns if you have a particular column that you want to appear first:
babynames |> select(name, everything())
#> # A tibble: 1,924,665 × 5
#> name year sex n prop
#> <chr> <dbl> <chr> <int> <dbl>
#> 1 Mary 1880 F 7065 0.0724
#> 2 Anna 1880 F 2604 0.0267
#> 3 Emma 1880 F 2003 0.0205
#> 4 Elizabeth 1880 F 1939 0.0199
#> 5 Minnie 1880 F 1746 0.0179
#> 6 Margaret 1880 F 1578 0.0162
#> # ℹ 1,924,659 more rows
The previous line of code puts name
as the first variable in the data frame.
For data frames with only a few columns, re-ordering the columns is generally not that useful, but, for data frames with hundreds or thousands of columns, moving variables of interest to the beginning of the data frame helps us view these variables more readily.
Exercise 7. Use select()
to keep only the year
, name
and prop
variables in two ways: (1) keep year
, name
, and prop
directly by specifying these variable names in select()
and (2) keep only year
, name
, and prop
indirectly by specifying that sex
, and n
should be dropped from the data frame.
3.1.4 Assigning a Data Frame a Name
Up until now, we have not assigned any of the new data frames we are creating a name. For example, with
babynames |> filter(name == "Matthew")
#> # A tibble: 212 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1880 M Matthew 113 0.000954
#> 2 1881 M Matthew 80 0.000739
#> 3 1882 M Matthew 109 0.000893
#> 4 1883 M Matthew 86 0.000765
#> 5 1884 M Matthew 117 0.000953
#> 6 1885 M Matthew 111 0.000957
#> # ℹ 206 more rows
we get a printout of the data with only the name
Matthew
, but, the data frame object is not assigned a name at all. So, we cannot use ggplot()
to make a plot of n
only for the name Matthew
because we do not have a named data set that only has the name Matthew
. In other words, R
performs the filter()
, but the filtered data frame doesn’t get saved to any particular named object. If we want to “save” the new data set for future use (like in a plot command), then we can use the assignment operator, <-
:
babynames_matthew <- babynames |> filter(name == "Matthew")
In the previous chunk, we assign the data frame with only Matthew
names to the name babynames_matthew
:
babynames_matthew
#> # A tibble: 212 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1880 M Matthew 113 0.000954
#> 2 1881 M Matthew 80 0.000739
#> 3 1882 M Matthew 109 0.000893
#> 4 1883 M Matthew 86 0.000765
#> 5 1884 M Matthew 117 0.000953
#> 6 1885 M Matthew 111 0.000957
#> # ℹ 206 more rows
We can really use almost any name we would like but it’s more helpful to use a name that makes sense (babynames_matthew
) than a name that is meaningless.
An example of giving a data frame a meaningless name would be:
take_it_kronk_feel_the_power <- babynames |> filter(name == "Matthew")
We can use the data frame take_it_kronk_feel_the_power
in the same way as babynames_matthew
, but take_it_kronk_feel_the_power
is not as easy to remember and doesn’t make any sense in this context.
Exercise 8. Name your data frame that you created from Exercise 1. Then, type in the name you assigned to the data frame to verify that you can see some printed out data.
3.1.5 More about the Pipe
We are jumping straight into using piping, but we do want to have an appreciation on how terrible life would be without it. What piping does is make whatever is given before the |>
pipe the first argument of whatever function follows the |>
. So, for a toy data frame named df
with variable xvar
,
df |> filter(xvar <= 3)
is equivalent to
filter(df, xvar <= 3)
It might also help to use an analogy when thinking about piping. Consider the Ke$ha’s morning routine in the opening of the song Tik Tok. If we were to write her morning routine in terms of piping,
|> wake_up(time = "morning", feels_like = "P-Diddy") |>
kesha grab(glasses) |>
brush(teeth, item = "jack", unit = "bottle") |> ....
Kesha first wakes up in the morning, and then the Kesha that has woken up grabs her glasses, and then the Kesha who has woken up and has grabbed her glasses proceeds to brush her teeth, etc.
The pipe operator |>
is loaded automatically with R
. We will heavily use the pipe throughout the entire semester so it’s worth it to delve a little deeper into what it is here. We will use the fitness data to further explore the pipe. Read in the data with
fitness_df <- read_csv("https://raw.githubusercontent.com/highamm/ds234_quarto/main/data_online/higham_fitness_clean.csv",
col_types = list(stepgoal = col_factor()))
fitness_df
#> # A tibble: 1,722 × 9
#> Start active_cals distance flights steps month weekday dayofyear
#> <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 2018-11-28 57.8 0.930 0 1885. Nov Wed 332
#> 2 2018-11-29 509. 4.64 18 8953. Nov Thu 333
#> 3 2018-11-30 599. 6.05 12 11665 Nov Fri 334
#> 4 2018-12-01 661. 6.80 6 12117 Dec Sat 335
#> 5 2018-12-02 527. 4.61 1 8925. Dec Sun 336
#> 6 2018-12-03 550. 3.96 2 7205 Dec Mon 337
#> # ℹ 1,716 more rows
#> # ℹ 1 more variable: stepgoal <fct>
Thus far, we have used a single pipe |>
to do one task, but, we can string together multiple dplyr
functions with multiple |>
statements. For example, suppose that we wanted to keep only the weekday
s that were Sat
and Sun
, and then, order the data set so that it was in descending order of steps
taken:
fitness_df |> filter(weekday == "Sat" | weekday == "Sun") |>
arrange(desc(steps))
#> # A tibble: 492 × 9
#> Start active_cals distance flights steps month weekday dayofyear
#> <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 2023-07-15 2291. 18.0 193 32589. Jul Sat 196
#> 2 2019-08-11 1385. 15.8 173 29480. Aug Sun 223
#> 3 2019-04-14 1767. 17.9 1 25578 Apr Sun 104
#> 4 2018-12-15 1091. 11.2 27 21224. Dec Sat 349
#> 5 2022-05-29 1291. 11.8 103 20853 May Sun 149
#> 6 2019-02-10 1392. 12.8 5 20700. Feb Sun 41
#> # ℹ 486 more rows
#> # ℹ 1 more variable: stepgoal <fct>
This reads: “take fitness_df
and then filter()
it to keep only Saturdays and Sundays and then sort it in descending order of steps
.”
Consecutive pipes build on each other: we can slowly build out what the pipe is doing step-by-step. Running code “pipe by pipe” can be a very helpful way to understand what each consecutive piping statement is doing.
The code
fitness_df |> filter(weekday == "Sat" | weekday == "Sun")
#> # A tibble: 492 × 9
#> Start active_cals distance flights steps month weekday dayofyear
#> <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 2018-12-01 661. 6.80 6 12117 Dec Sat 335
#> 2 2018-12-02 527. 4.61 1 8925. Dec Sun 336
#> 3 2018-12-08 775. 6.46 0 8834. Dec Sat 342
#> 4 2018-12-09 341. 2.81 0 5265 Dec Sun 343
#> 5 2018-12-15 1091. 11.2 27 21224. Dec Sat 349
#> 6 2018-12-16 1029. 8.84 79 16376 Dec Sun 350
#> # ℹ 486 more rows
#> # ℹ 1 more variable: stepgoal <fct>
is equivalent to:
filter(fitness_df, weekday == "Sat" | weekday == "Sun")
#> # A tibble: 492 × 9
#> Start active_cals distance flights steps month weekday dayofyear
#> <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 2018-12-01 661. 6.80 6 12117 Dec Sat 335
#> 2 2018-12-02 527. 4.61 1 8925. Dec Sun 336
#> 3 2018-12-08 775. 6.46 0 8834. Dec Sat 342
#> 4 2018-12-09 341. 2.81 0 5265 Dec Sun 343
#> 5 2018-12-15 1091. 11.2 27 21224. Dec Sat 349
#> 6 2018-12-16 1029. 8.84 79 16376 Dec Sun 350
#> # ℹ 486 more rows
#> # ℹ 1 more variable: stepgoal <fct>
Then,
filter(fitness_df, weekday == "Sat" | weekday == "Sun") |>
arrange(desc(steps))
#> # A tibble: 492 × 9
#> Start active_cals distance flights steps month weekday dayofyear
#> <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 2023-07-15 2291. 18.0 193 32589. Jul Sat 196
#> 2 2019-08-11 1385. 15.8 173 29480. Aug Sun 223
#> 3 2019-04-14 1767. 17.9 1 25578 Apr Sun 104
#> 4 2018-12-15 1091. 11.2 27 21224. Dec Sat 349
#> 5 2022-05-29 1291. 11.8 103 20853 May Sun 149
#> 6 2019-02-10 1392. 12.8 5 20700. Feb Sun 41
#> # ℹ 486 more rows
#> # ℹ 1 more variable: stepgoal <fct>
is equivalent to:
arrange(filter(fitness_df, weekday == "Sat" | weekday == "Sun"),
desc(steps))
#> # A tibble: 492 × 9
#> Start active_cals distance flights steps month weekday dayofyear
#> <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 2023-07-15 2291. 18.0 193 32589. Jul Sat 196
#> 2 2019-08-11 1385. 15.8 173 29480. Aug Sun 223
#> 3 2019-04-14 1767. 17.9 1 25578 Apr Sun 104
#> 4 2018-12-15 1091. 11.2 27 21224. Dec Sat 349
#> 5 2022-05-29 1291. 11.8 103 20853 May Sun 149
#> 6 2019-02-10 1392. 12.8 5 20700. Feb Sun 41
#> # ℹ 486 more rows
#> # ℹ 1 more variable: stepgoal <fct>
So, why use the pipe? Compare the code the uses the pipe operator to the code that doesn’t. Which is easier to read? Which do you think is easier to write? The example shows that, for our purposes, the pipe is most useful in aiding the readability of our code. It’s a lot easier to see what’s happening in the code chunk with the pipes than it is in the previous code chunk without the pipe because, with the pipe, we can read the code from left to right and top to bottom. Without the pipe, we need to read the code from the “inside to the outside”, which is much more challenging.
There are many situations in which using the |>
would not be appropriate. To use the pipe, what precedes the |>
must be the first argument in the function following the |>
. For tidyverse
functions that we use throughout the course, this condition is almost always true because these functions were designed with piping in mind. But, for non-tidyverse
functions, this may not be the case.
For example, if you have taken STAT 213, you’ve used lm()
to fit many different types of linear models. If you haven’t taken STAT 213, lm(response ~ explanatory, data = name_of_data_set)
stands for “linear model” and can be used to fit the simple linear regression model that you learned about in STAT 113. You might expect something like this to fit a linear model using fitness_df
with active_cals
as the response and steps
as the predictor:
fitness_df |> lm(active_cals ~ steps)
#> Error in as.data.frame.default(data): cannot coerce class '"formula"' to a data.frame
But it throws us an error. Typing in ?lm
reveals that its first argument is a formula
to fit the model, not a data set. So the function is trying to run
lm(fitness_df, active_cals ~ steps)
#> Error in as.data.frame.default(data): cannot coerce class '"formula"' to a data.frame
which doesn’t work because the arguments to the function are mixed up (the formula should appear first and the data set should appear second).
The pipe operator |>
is relatively new. Previously, the primary pipe operator used was %>%
and came from the magrittr
package. For almost all cases, the two operators are equivalent. However, when scanning the Internet for help with code, you will probably see %>%
used in many of people’s responses on sites like StackOverflow.
Exercise 9. This “More About the Pipe” section will make more sense as we proceed through the course. For now, we just want to be able to understand that something like df |> slice(1:6)
is equivalent to slice(df, 1:6)
. Practice by converting the following code to use the pipe:
filter(fitness_df, active_cals > 50)
#> # A tibble: 1,690 × 9
#> Start active_cals distance flights steps month weekday dayofyear
#> <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 2018-11-28 57.8 0.930 0 1885. Nov Wed 332
#> 2 2018-11-29 509. 4.64 18 8953. Nov Thu 333
#> 3 2018-11-30 599. 6.05 12 11665 Nov Fri 334
#> 4 2018-12-01 661. 6.80 6 12117 Dec Sat 335
#> 5 2018-12-02 527. 4.61 1 8925. Dec Sun 336
#> 6 2018-12-03 550. 3.96 2 7205 Dec Mon 337
#> # ℹ 1,684 more rows
#> # ℹ 1 more variable: stepgoal <fct>
3.2 Practice
In general, practice exercises will be split between exercises that will be done together as a class and exercises that you will do in groups or on your own. The purpose of the class exercises is to give some guidance on how we might think logically through some of the code and the results. The purpose of the group exercises and the on your own exercises is so that you have a chance to practice what you’ve learned with your table or on your own.
3.2.1 Class Exercises
For many of these class exercises, we will start practicing stringing together multiple piping statements. We will also practice more with ggplot()
by making plots of some of the data sets we are creating.
Class Exercise 1. In the babynames
data set, use filter()
, arrange()
, and slice()
to print the 10 most popular Male babynames in 2017.
Class Exercise 2. Name the data frame you created in the previous exercise and use the data frame to make a bar plot of the 10 most popular Male babynames in 2017, along with the number of babies with each name in that year.
Class Exercise 3. In the babynames
data set, keep only the rows with your name (or, another name that interests you) and one sex (either "M"
or "F"
).
Class Exercise 4. Name the new data set you made in the previous exercise and then construct a line plot that looks at either the n
or prop
of your chosen name through year
.
Class Exercise 5. Explain why the following code gives a warning message and returns NA
. Use the order of Arguments in ?mean
in your explanation.
fitness_df |> mean(distance)
#> [1] NA
3.2.2 Your Turn
Your Turn 1. Get a little more practice with using conditions in filter()
by completing the following:
Create a data set with all years except the year 1945.
Create a data set that only has rows where
n
is between 400 and 5000.Create a data set with rows that are either
F
(female)Monica
names orF
(female)Hilary
names.
The syntax filter(x_var > 20 & < 30)
is not correct. The variable that we are filtering with needs to go in each conditional statement: filter(x_var > 20 & x_var < 30)
Your Turn 2. Name your data set is part (c) of the previous exercise. Then, use the data set to make a line plot comparing the popularity of the Monica
and Hilary
names through time.
Your Turn 3. Choose 5 names paired with 5 sexes that interest you and create a new data set that only has data on those 5 name/sex combinations. Then, make a line plot showing the popularity of these 5 names over time.
Your Turn 4. Choose a year and a sex that interests you and create a data set that only contain observations from that year and sex. Then, make a bar plot that shows the top 10 names from that year and sex, showing the prop
of babies born with each of those 10 names in that year.
3.3 mutate()
, group_by()
, summarise()
, and Missing Values
Goals:
- Use the
mutate()
,if_else()
, andcase_when()
functions to create new variables. - Use
group_by()
andsummarise()
to create useful summaries of a data set. * Explain whatdplyr
functions do with missing values. - Combine the above goals with plotting to explore a data set on SLU majors.
Throughout this section, we will use a data set that contains observations for each of SLU’s majors, with three variables per major:
-
Major
, the name of the major. -
nfemales
, the number of female graduates in that major from 2017 - 2021. -
nmales
, the number of male graduates in that major from 2017 - 2021.
Begin by reading in the data set with
library(tidyverse)
slumajors_df <- read_csv("https://raw.githubusercontent.com/highamm/ds234_quarto/main/data_online/SLU_Majors_17_21.csv")
slumajors_df
#> # A tibble: 30 × 3
#> Major nfemales nmales
#> <chr> <dbl> <dbl>
#> 1 Anthropology 35 13
#> 2 Art & Art History 62 11
#> 3 Biochemistry 15 6
#> 4 Biology 152 58
#> 5 Business in the Liberal Arts 192 301
#> 6 Chemistry 28 20
#> # ℹ 24 more rows
3.3.1 mutate()
: Create Variables
Sometimes, we will want to create a new variable that’s not in the data set, oftentimes using if_else()
, case_when()
, or basic algebraic operations on one or more of the columns already present in the data set.
R
understands the following symbols:
-
+
for addition,-
for subtraction -
*
for multiplication,/
for division -
^
for raising something to a power (3 ^ 2
is equal to9
)
R
also does the same order of operations as usual: parentheses, then exponents, then multiplication and division, then addition and subtraction.
For example, suppose that we want to create a variable in slumajors_df
that has the total number of students graduating in each major. We can do this with mutate()
:
slumajors_df |> mutate(ntotal = nfemales + nmales)
#> # A tibble: 30 × 4
#> Major nfemales nmales ntotal
#> <chr> <dbl> <dbl> <dbl>
#> 1 Anthropology 35 13 48
#> 2 Art & Art History 62 11 73
#> 3 Biochemistry 15 6 21
#> 4 Biology 152 58 210
#> 5 Business in the Liberal Arts 192 301 493
#> 6 Chemistry 28 20 48
#> # ℹ 24 more rows
The |>
operator reads as “and then”, as in “take slumajors_df
and then mutate()
it.”
Piping is a really convenient, easy-to-read way to build a sequence of commands. How you can read the above code is:
Take
slumajors_df
and withslumajors_df
,perform a
mutate()
step to create the new variable calledntotal
, which isnfemales
plusnmales
.
Since this is our first time using mutate()
, let’s also delve into what the function is doing. In general, mutate()
reads:
mutate(name_of_new_variable = operations_on_old_variables)
.
R
just automatically assumes that you want to do the operation for every single row in the data set, which is often quite convenient!
We might also want to create a variable that is the percentage of students identifying as female for each major:
slumajors_df |>
mutate(percfemale = 100 * nfemales / (nfemales + nmales))
#> # A tibble: 30 × 4
#> Major nfemales nmales percfemale
#> <chr> <dbl> <dbl> <dbl>
#> 1 Anthropology 35 13 72.9
#> 2 Art & Art History 62 11 84.9
#> 3 Biochemistry 15 6 71.4
#> 4 Biology 152 58 72.4
#> 5 Business in the Liberal Arts 192 301 38.9
#> 6 Chemistry 28 20 58.3
#> # ℹ 24 more rows
What happened to ntotal
? Is it still in the printout? It’s not: when we created the variable ntotal
, we didn’t actually save (or assign a name to) the new data set as anything. So R
makes and prints the new variable, but it doesn’t get saved to any data set.
If we want to save the new data set for later use, then we can use the <-
operator. Here, we’re naming the new data set with the same name as the old data set: slumajors_df
. Then, we’re doing the same thing for the percfemale
variable. We won’t always want to give the new data set the same name as the old one: we’ll talk about this in more detail in the class exercises.
slumajors_df <- slumajors_df |>
mutate(percfemale = 100 * nfemales / (nfemales + nmales))
slumajors_df <- slumajors_df |> mutate(ntotal = nfemales + nmales)
But, you can pipe as many things together as you want to, so it’s probably easier to just create both variables in one go. The following chunk says to “Take slumajors_df
and create a new variable ntotal
. With that new data set, create a new variable called percfemale
.” Finally, the slumajors_df <-
at the beginning says to “save this new data set as a data set with the same name, slumajors_df
, thus overwriting the original slumajors_df
.”
Now suppose that you want to make a new variable that is conditional on another variable (or more than one variable) in the data set. Then we would typically use mutate()
coupled with
-
if_else()
if your new variable is created on only one condition -
case_when()
if your new variable is created on more than one condition
The conditions that we supply use the same syntax as the conditions we used with filter()
. For example, we still use the same Boolean operators (&
for “and” and |
for “or”, etc.).
Suppose we want to create a new variable that tells us whether or not the Major
has a majority of Women. That is, we want this new variable, morewomen
to be "Yes"
if the Major
has more than 50% women and "No"
if it has 50% or less.
slumajors_df |> mutate(morewomen = if_else(percfemale > 50,
true = "Yes",
false = "No"))
#> # A tibble: 30 × 6
#> Major nfemales nmales percfemale ntotal morewomen
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Anthropology 35 13 72.9 48 Yes
#> 2 Art & Art History 62 11 84.9 73 Yes
#> 3 Biochemistry 15 6 71.4 21 Yes
#> 4 Biology 152 58 72.4 210 Yes
#> 5 Business in the Liberal Arts 192 301 38.9 493 No
#> 6 Chemistry 28 20 58.3 48 Yes
#> # ℹ 24 more rows
The mutate()
statement reads: create a new variable called morewomen
that is equal to "Yes"
if percfemale
> 50
is true and is equal to "No"
if perfemale
is not > 0.5
. The first argument is the condition, the second is what to name the new variable when the condition holds, and the third is what to name the variable if the condition does not hold.
We use conditions all of the time in every day life. For example, New York had a quarantine order stating that people coming from 22 states in July 2020 would need to quarantine. In terms of a condition, this would read “if you are traveling to New York from one of the 22 states, then you need to quarantine for 2 weeks. Else, if not, then you don’t need to quarantine.” The trick in using these conditions in R
is getting used to the syntax of the code.
We can see from the above set up that if we had more than one condition, then we’d need to use a different function (or use nested if_else()
statements, which can be a nightmare to read). If we have more than one condition for creating the new variable, we will use case_when()
.
For example, when looking at the output, we see that Biochemistry
has 56% female graduates. That’s “about” a 50/50 split, so suppose we want a variable called large_majority
that is "female"
when the percent women is 70 or more, "male"
when the percent women is 30 or less, and "none"
when the percent female is between 30 and 70.
slumajors_df |> mutate(large_majority =
case_when(percfemale >= 70 ~ "female",
percfemale <= 30 ~ "male",
percfemale > 30 & percfemale < 70 ~ "none"))
#> # A tibble: 30 × 6
#> Major nfemales nmales percfemale ntotal large_majority
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Anthropology 35 13 72.9 48 female
#> 2 Art & Art History 62 11 84.9 73 female
#> 3 Biochemistry 15 6 71.4 21 female
#> 4 Biology 152 58 72.4 210 female
#> 5 Business in the Liberal A… 192 301 38.9 493 none
#> 6 Chemistry 28 20 58.3 48 none
#> # ℹ 24 more rows
The case_when()
function reads “When the percent female is more than or equal to 70, assign the new variable large_majority
the value of "female"
, when it’s less or equal to 30, assign the new variable large_majority
the value of "male"
, and when it’s more than 30 and less than 70, assign the large_majority
variable the value of "none"
.
Let’s save these two new variables to the slumajors_df
:
Numeric values typically do not go in quotation marks (like 70
and 30
) while strings like "female"
and "none"
do need quotes.
Exercise 1. Use mutate()
with if_else()
to create a new variable that is called major_size
and is equal to "large"
when the total number of majors is 100 or more and "small"
when the total number of majors is less than 100.
Exercise 2. Use mutate()
with case_when()
to create a new variable that is called major_size2
and is "large"
when the total number of majors is 150 or more, "medium"
when the total number of majors is between 41 and 149, and "small"
when the total number of majors is 40 or fewer.
Exercise 3. Create a new variable that is the nfemales
to nmales
ratio (so your new variable should be nfemales
divided by nmales
).
3.3.2 summarise()
: Create Summaries
The summarise()
function is useful to get various summaries from the data. For example, suppose that we want to know the average major size at SLU across the five year span or the total number of majors across those five years. Then we can use summarise()
and a summary function, like mean()
, sum()
, median()
, max()
, min()
, n()
, etc. You’ll notice that the format of summarise()
is extremely similar to the format of mutate()
. Using the slumajors_df
data again just for one quick example,
In the code chunk above, we obtain two summaries (meantotalmajor
and totalgrad
), separated by a comma. In general, this syntax also works with mutate()
to create more than one variable in the same mutate()
function.
Most summary functions are intuitive if you’ve had intro stat. But, if you’re not sure whether the summary for getting the maximum is maximum()
or max()
, just try both or do a quick google search.
Another useful summary function is n()
, the counting function, which counts up the total number of rows.
The counting function n()
typically doesn’t have any arguments. It’s typically more useful when paired with group_by()
, which is discussed next.
Exercise 4. Find the total number of nfemales
in the data set and the total number of nmales
in the data set.
The summary function you will need to use within summarise()
is sum()
, which will add up the numbers in the nfemales
column. You’ll then need to add a ,
and then sum()
the number of nmales
.
3.3.3 group_by()
: Groups
summarise()
is often most useful when paired with a group_by()
statement. Doing so allows us to get summaries across different groups. Conceptually, we can think of toy_df |> group_by(categorical_var)
as “grouping” toy_df
into a bunch of different groups, with one group for each level of categorical_var
. Then, operations from any other functions after the group_by()
function are carried out for each group.
For example, suppose that you wanted the total number of registered births per year in the babynames
data set that we used earlier. We can then use group_by()
to “split” the babynames
data frame into 138 separate data frames (one for each year
in the data set). After we group_by(year)
, then the summarise()
statement below adds up the n
variable for each year
:
As another example, we can also use the n()
function to count up the number of rows for each year
:
group_by()
can be used in combination with the other dplyr
functions we have learned about (mutate()
, filter()
, etc.). We will explore this functionality in class.
Exercise 5. With the babynames
data set, use group_by()
and summarise()
to create a data set that has the total number of births for each name
.
Exercise 6. With the babynames
data set, use group_by()
and summarise()
to create a data set that has how many times each name
appears in babynames
.
3.3.4 Missing Values
Both of the data sets that we’ve worked with are nice in that they do not have any missing values. We’ll see plenty of examples of data sets with missing values later, so we should examine how the various functions that we’ve talked about so far tackle missing values.
Missing values in R
are denoted with NA
for “Not Available.” Run the following code to create a toy data set with some missing values so that we can see how the various functions we’ve used so far deal with NA
values.
We will first discuss how to remove missing values when computing summaries for a variable that contains some NA
values.
Missing values should not be removed without carefully examination and a note of what the consequences might be (i.e. why are these values missing?).
We have a toy data set that is meaningless, so we aren’t asking those questions now, but we will for any future data set that does have missing values!
If we have investigated the missing values and are comfortable with removing them, many functions that we would use in summarise()
have an na.rm
argument that we can set to TRUE
to tell summarise()
to remove any NA
s before taking the mean()
, median()
, max()
, etc.
If we want to remove the missing values more directly, we can use the is.na()
function in combination with filter()
. If the variable is NA
(Not Available) for an observation, is.na()
evaluates to TRUE
; if not, is.na()
evaluates to FALSE
. Test this out using mutate()
to create a new variable for whether x
is missing:
missingx
is TRUE
only for the the first observation. We can use this to our advantage with filter()
to filter it out of the data set, without going through the extra step of actually making a new variable missingx
:
You’ll commonly see this written as short-hand in people’s code you may come across as:
which says to “keep anything that does not have a missing x
value” (recall that the !
means “not” so that we are filtering to keep any value of x
that is “not” missing).
Exercise 7. filter()
toy_df
to keep only the rows where z
is not missing.
3.4 Practice
In general, practice exercises will be split between exercises that will be done together as a class and exercises that you will do in groups or on your own. The purpose of the class exercises is to give some guidance on how we might think logically through some of the code and the results. The purpose of the group exercises and the on your own exercises is so that you have a chance to practice what you’ve learned with your table or on your own.
3.4.1 Class Exercises
Class Exercise 1. About 55% of SLU students identify as female. So, in the definition of the morewomen
variable, does it make more sense to use 55% as the cutoff or 50%?
Class Exercise 2. Do you think it is ethical to exclude non-binary genders from analyses and graphs in the slumajors
data set? What about in the stat113_df
data that we looked at earlier? Why or why not? Try to construct arguments for both sides: you might also consider data privacy in your arguments.
Class Exercise 3. Investigate what happens with case_when()
when you give overlapping conditions and when you give conditions that don’t cover all observations. For overlapping conditions, create a variable testcase
that is "Yes"
when percfemale
is greater than or equal to 40 and "No"
when percfemale
is greater than 60 For conditions that don’t cover all observations, create a variable testcase2
that is "Yes"
when percfemale
is greater than or equal to 55 and "No"
when percfemale
is less than 35. Put the new variables at the beginning of the data frame so that you can see what they contain more easily.
Class Exercise 4. The rank()
function can be used within mutate()
to “rank” a variable from lowest values to highest values (so that the case with the lowest value receives a 1
, the second lowest value receives a 2
, etc.). Use group_by()
and mutate()
to rank the names from most to least popular in each year-sex combination. We will make some notes about what group_by()
does with more than 1 variable.
Class Exercise 5. From the data set in the previous exercise, filter()
the data to keep only the most popular name in each year-sex combination and then construct a summary table showing how many times each name appears as the most popular name. With this exercise, we will make some notes about the ungroup()
function, and, if we have time investigate a way to achieve the objective of this question without explicitly making a rank
variable.
Class Exercise 6. Run the following code. Intuitively, a slice(1, 2, 3, 4, 5)
should grab the first five rows of the data set, but, when we try to run that, we get 1380 rows. What is the issue?
babynames_test <- babynames |>
group_by(year, sex) |> mutate(ntest = n / prop)
babynames_test |> slice(1, 2, 3, 4, 5)
#> # A tibble: 1,380 × 6
#> # Groups: year, sex [276]
#> year sex name n prop ntest
#> <dbl> <chr> <chr> <int> <dbl> <dbl>
#> 1 1880 F Mary 7065 0.0724 97605.
#> 2 1880 F Anna 2604 0.0267 97605.
#> 3 1880 F Emma 2003 0.0205 97605.
#> 4 1880 F Elizabeth 1939 0.0199 97605.
#> 5 1880 F Minnie 1746 0.0179 97605.
#> 6 1880 M John 9655 0.0815 118400.
#> # ℹ 1,374 more rows
3.4.2 Your Turn
Your Turn 1. Investigate how group_by()
behaves with the five other main dplyr
functions we have used. You can use toy_df
to perform your investigation.
Your Turn 2. Use toy_df
again to investigate how the dplyr
functions treat missing values by default.
mutate()
. Try to create a new variable withmutate()
involvingx
. What doesR
do with the missing value?arrange()
. Try arranging the data set byx
. What doesR
do with the missing value?filter()
. Try filtering so that only observations where x is less than 5 are kept. What doesR
do with the missing value?summarise()
. Try usingsummarise()
with a function involving x. What doesR
return?group_by()
andsummarise()
. To your statement in 4, add agroup_by(z)
statement before yoursummarise()
. What doesR
return now?
Your Turn 3. In some cases throughout this chapter, we’ve renamed data sets using <-
with the same name like
toy_df <- toy_df |> mutate(newvar = x / y)
In other cases, we’ve given the data set a new name, like
For which of the functions below is a generally “safe” to name the data set using the same name after using the function. Give a one sentence reason for each part.
Your Turn 4. Compare summarise()
with mutate()
using the following code. What’s the difference between the two functions?
Your Turn 5. With the babynames
data, create a data set that has the number of unique names for each year. Then, make a line plot of the data.
Your Turn 6. With the babynames
data, create a data set that has the number of unique names for each year-sex combination. Then, make a line plot of the data.
Your Turn 7. Create a data set that has a column for name
so that each row shows the total number of births for each unique name across all years and both sexes.