11  Merging with dplyr

Goals:

11.1 Stacking Rows and Appending Columns

11.1.1 Stacking with bind_rows()

First, we will talk about combining two data sets by “stacking” them on top of each other to form one new data set. The bind_rows() function can be used for this purpose if the two data sets have identical column names.

A common instance where this is useful is if two data sets come from the same source and have different locations or years, but the same exact column names.

For example, examine the following website and notice how there are .csv files given for each year of matches in the ATP (Association of (men’s) Tennis Professionals). https://github.com/JeffSackmann/tennis_atp.

Then, read in the data sets, and look at how many columns each has.

library(tidyverse)
library(here)
atp_2019 <- read_csv(here("data/atp_matches_2019.csv"))
atp_2018 <- read_csv(here("data/atp_matches_2018.csv"))
head(atp_2019) 
head(atp_2018)

To combine results from both data sets,

atp_df <- bind_rows(atp_2018, atp_2019)
#> Error in `bind_rows()`:
#> ! Can't combine `..1$winner_seed` <double> and `..2$winner_seed` <character>.

The issue is that the winner_seed and loser_seed variables are of different variable classes in the atp_2018 data vs. the atp_2019 data. The following code forces the seed variables in the 2018 data set to be characters.

atp_2018 <- read_csv(here("data/atp_matches_2018.csv"),
                     col_types = cols(winner_seed = col_character(),
                                      loser_seed = col_character()))

We can try combining the data sets now.

atp_df <- bind_rows(atp_2018, atp_2019)
atp_df

Do a quick check to make sure the number of rows in atp_2018 plus the number of rows in atp_2019 equals the number of rows in atp_df.

It might seem a little annoying, but, by default bind_rows() will only combine two data sets by stacking rows if the data sets have identical column names and identical column classes, as we saw in the previous example.

Now run the following and look at the output.

df_test2a <- tibble(xvar = c(1, 2))
df_test2b <- tibble(xvar = c(1, 2), y = c(5, 1))
bind_rows(df_test2a, df_test2b)
#> # A tibble: 4 × 2
#>    xvar     y
#>   <dbl> <dbl>
#> 1     1    NA
#> 2     2    NA
#> 3     1     5
#> 4     2     1

We can see from this toy example that, if a variable is missing entirely in one of the data sets we are binding together, then NA values are filled in for that variable.

11.1.2 Binding Columns with bind_cols()

bind_cols() combines two data sets by combining their columns into one data set. We won’t spend much time talking about how to bind together columns because it’s generally a little dangerous.

We will use a couple of test data sets, df_test1a and df_test1b, to see it in action:

df_test1a <- tibble(xvar = c(1, 2), yvar = c(5, 1))
df_test1b <- tibble(x = c(1, 2), y = c(5, 1))
bind_cols(df_test1a, df_test1b)
#> # A tibble: 2 × 4
#>    xvar  yvar     x     y
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     5     1     5
#> 2     2     1     2     1

Exercise 1. Run the following and explain why R does not simply stack the rows. Then, fix the issue with the rename() function.

df_test1a <- tibble(xvar = c(1, 2), yvar = c(5, 1))
df_test1b <- tibble(x = c(1, 2), y = c(5, 1))
bind_rows(df_test1a, df_test1b)
#> # A tibble: 4 × 4
#>    xvar  yvar     x     y
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     5    NA    NA
#> 2     2     1    NA    NA
#> 3    NA    NA     1     5
#> 4    NA    NA     2     1

11.2 Mutating Joins

If the goal is to combine two data sets using some common variable(s) that both data sets have, we need different tools than simply stacking rows or appending columns. When merging together two or more data sets, we need to have a matching identification variable in each data set. This variable is commonly called a key. A key can be an identification number, a name, a date, etc, but must be present in both data sets.

As a simple first example, consider

library(tidyverse)
df1 <- tibble(name = c("Emily", "Miguel", "Tonya"), fav_sport = c("Swimming", "Football", "Tennis"))
df2 <- tibble(name = c("Tonya", "Miguel", "Emily"),
              fav_colour = c("Robin's Egg Blue", "Tickle Me Pink", "Goldenrod"))

Our goal is to combine the two data sets so that the people’s favorite sports and favorite colours are in one data set.

Identify the key in the example above.

11.2.1 Keep All Rows of Data Set 1 with left_join()

Consider the babynames R package, which has the following data sets:

  • lifetables: cohort life tables for different sex and different year variables, starting at the year 1900.
  • births: the number of births in the United States in each year, since 1909
  • babynames: popularity of different baby names per year and sex since the year 1880.
##install.packages("babynames")
library(babynames)
life_df <- babynames::lifetables
birth_df <- babynames::births
babynames_df <- babynames::babynames

head(babynames)
head(births)
head(lifetables)

Read about each data set with ?babynames, ?births and ?lifetables.

Suppose that you want to combine the births data set with the babynames data set, so that each row of babynames now has the total number of births for that year. We first need to identify the key in each data set that we will use for the joining. In this case, each data set has a year variable, and we can use left_join() to keep all observations in babynames_df, even for years that are not in the births_df data set.

combined_left <- left_join(babynames_df, birth_df, by = join_by(year == year))
head(combined_left)
#> # A tibble: 6 × 6
#>    year sex   name          n   prop births
#>   <dbl> <chr> <chr>     <int>  <dbl>  <int>
#> 1  1880 F     Mary       7065 0.0724     NA
#> 2  1880 F     Anna       2604 0.0267     NA
#> 3  1880 F     Emma       2003 0.0205     NA
#> 4  1880 F     Elizabeth  1939 0.0199     NA
#> 5  1880 F     Minnie     1746 0.0179     NA
#> 6  1880 F     Margaret   1578 0.0162     NA
tail(combined_left)
#> # A tibble: 6 × 6
#>    year sex   name       n       prop  births
#>   <dbl> <chr> <chr>  <int>      <dbl>   <int>
#> 1  2017 M     Zyhier     5 0.00000255 3855500
#> 2  2017 M     Zykai      5 0.00000255 3855500
#> 3  2017 M     Zykeem     5 0.00000255 3855500
#> 4  2017 M     Zylin      5 0.00000255 3855500
#> 5  2017 M     Zylis      5 0.00000255 3855500
#> 6  2017 M     Zyrie      5 0.00000255 3855500

Why are births missing in head(combined_left) but not in tail(combined_left)?

11.2.2 Keep All Rows of Data Set 2 with right_join()

Recall from the accompanying handout that there is no need to ever use right_join() because it is the same as using a left_join() with the first two data set arguments switched:

## these will always do the same exact thing
right_join(babynames_df, birth_df, by = join_by(year == year))
#> # A tibble: 1,839,952 × 6
#>    year sex   name         n   prop  births
#>   <dbl> <chr> <chr>    <int>  <dbl>   <int>
#> 1  1909 F     Mary     19259 0.0523 2718000
#> 2  1909 F     Helen     9250 0.0251 2718000
#> 3  1909 F     Margaret  7359 0.0200 2718000
#> 4  1909 F     Ruth      6509 0.0177 2718000
#> 5  1909 F     Dorothy   6253 0.0170 2718000
#> 6  1909 F     Anna      5804 0.0158 2718000
#> # ℹ 1,839,946 more rows
left_join(birth_df, babynames_df, by = join_by(year == year))
#> # A tibble: 1,839,952 × 6
#>    year  births sex   name         n   prop
#>   <dbl>   <int> <chr> <chr>    <int>  <dbl>
#> 1  1909 2718000 F     Mary     19259 0.0523
#> 2  1909 2718000 F     Helen     9250 0.0251
#> 3  1909 2718000 F     Margaret  7359 0.0200
#> 4  1909 2718000 F     Ruth      6509 0.0177
#> 5  1909 2718000 F     Dorothy   6253 0.0170
#> 6  1909 2718000 F     Anna      5804 0.0158
#> # ℹ 1,839,946 more rows

Therefore, it’s usually easier to just always use left_join() and ignore right_join() completely.

11.2.3 Keep All Rows of Both Data Sets with full_join()

In addition to keeping any rows with a matching key in the other data frame, a full_join() will keep all rows in data set 1 that don’t have a matching key in data set 2, and will also keep all rows in data set 2 that don’t have a matching key in data set 1, filling in NA for missing values when necessary. For our example of merging babynames_df with birth_df,

full_join(babynames_df, birth_df, by = join_by(year == year))

11.2.4 Keep Only Rows with Matching Keys with inner_join()

We can also keep only rows with matching keys with inner_join(). For this join, any row in data set 1 without a matching key in data set 2 is dropped, and any row in data set 2 without a matching key in data set 1 is also dropped.

inner_join(babynames_df, birth_df, by = join_by(year == year))
#> # A tibble: 1,839,952 × 6
#>    year sex   name         n   prop  births
#>   <dbl> <chr> <chr>    <int>  <dbl>   <int>
#> 1  1909 F     Mary     19259 0.0523 2718000
#> 2  1909 F     Helen     9250 0.0251 2718000
#> 3  1909 F     Margaret  7359 0.0200 2718000
#> 4  1909 F     Ruth      6509 0.0177 2718000
#> 5  1909 F     Dorothy   6253 0.0170 2718000
#> 6  1909 F     Anna      5804 0.0158 2718000
#> # ℹ 1,839,946 more rows


11.2.5 Which xxxx_join()?

Which join function we use will depend on the context of the data and what questions you will be answering in your analysis.

Important

If you’re using a left_join(), right_join() or inner_join(), you’re potentially cutting out some data. It’s important to be aware of what data you’re omitting. For example, with the babynames and births data, we would want to keep a note that a left_join() removed all observations before 1909 from joined data set.

Exercise 2. Examine the following two joins that we’ve done, and explain why one resulting data set has fewer observations (rows) than the other.

left_join(babynames_df, birth_df, by = join_by(year == year))
#> # A tibble: 1,924,665 × 6
#>    year sex   name          n   prop births
#>   <dbl> <chr> <chr>     <int>  <dbl>  <int>
#> 1  1880 F     Mary       7065 0.0724     NA
#> 2  1880 F     Anna       2604 0.0267     NA
#> 3  1880 F     Emma       2003 0.0205     NA
#> 4  1880 F     Elizabeth  1939 0.0199     NA
#> 5  1880 F     Minnie     1746 0.0179     NA
#> 6  1880 F     Margaret   1578 0.0162     NA
#> # ℹ 1,924,659 more rows
left_join(birth_df, babynames_df, by = join_by(year == year))
#> # A tibble: 1,839,952 × 6
#>    year  births sex   name         n   prop
#>   <dbl>   <int> <chr> <chr>    <int>  <dbl>
#> 1  1909 2718000 F     Mary     19259 0.0523
#> 2  1909 2718000 F     Helen     9250 0.0251
#> 3  1909 2718000 F     Margaret  7359 0.0200
#> 4  1909 2718000 F     Ruth      6509 0.0177
#> 5  1909 2718000 F     Dorothy   6253 0.0170
#> 6  1909 2718000 F     Anna      5804 0.0158
#> # ℹ 1,839,946 more rows

11.3 Filtering Joins

Filtering joins (semi_join() and anti_join()) are useful if we would only like to keep the variables in one data set, but we want to filter out observations by a variable in the second data set.

Consider again the two data sets on men’s tennis matches in 2018 and in 2019.

atp_2019 <- read_csv(here("data/atp_matches_2019.csv"))
atp_2018 <- read_csv(here("data/atp_matches_2018.csv"))
atp_2019
#> # A tibble: 2,781 × 49
#>   tourney_id tourney_name surface draw_size tourney_level tourney_date
#>   <chr>      <chr>        <chr>       <dbl> <chr>                <dbl>
#> 1 2019-M020  Brisbane     Hard           32 A                 20181231
#> 2 2019-M020  Brisbane     Hard           32 A                 20181231
#> 3 2019-M020  Brisbane     Hard           32 A                 20181231
#> 4 2019-M020  Brisbane     Hard           32 A                 20181231
#> 5 2019-M020  Brisbane     Hard           32 A                 20181231
#> 6 2019-M020  Brisbane     Hard           32 A                 20181231
#> # ℹ 2,775 more rows
#> # ℹ 43 more variables: match_num <dbl>, winner_id <dbl>, winner_seed <chr>,
#> #   winner_entry <chr>, winner_name <chr>, winner_hand <chr>,
#> #   winner_ht <dbl>, winner_ioc <chr>, winner_age <dbl>, loser_id <dbl>,
#> #   loser_seed <chr>, loser_entry <chr>, loser_name <chr>, loser_hand <chr>,
#> #   loser_ht <dbl>, loser_ioc <chr>, loser_age <dbl>, score <chr>,
#> #   best_of <dbl>, round <chr>, minutes <dbl>, w_ace <dbl>, w_df <dbl>, …
atp_2018
#> # A tibble: 2,889 × 49
#>   tourney_id tourney_name surface draw_size tourney_level tourney_date
#>   <chr>      <chr>        <chr>       <dbl> <chr>                <dbl>
#> 1 2018-M020  Brisbane     Hard           32 A                 20180101
#> 2 2018-M020  Brisbane     Hard           32 A                 20180101
#> 3 2018-M020  Brisbane     Hard           32 A                 20180101
#> 4 2018-M020  Brisbane     Hard           32 A                 20180101
#> 5 2018-M020  Brisbane     Hard           32 A                 20180101
#> 6 2018-M020  Brisbane     Hard           32 A                 20180101
#> # ℹ 2,883 more rows
#> # ℹ 43 more variables: match_num <dbl>, winner_id <dbl>, winner_seed <dbl>,
#> #   winner_entry <chr>, winner_name <chr>, winner_hand <chr>,
#> #   winner_ht <dbl>, winner_ioc <chr>, winner_age <dbl>, loser_id <dbl>,
#> #   loser_seed <dbl>, loser_entry <chr>, loser_name <chr>, loser_hand <chr>,
#> #   loser_ht <dbl>, loser_ioc <chr>, loser_age <dbl>, score <chr>,
#> #   best_of <dbl>, round <chr>, minutes <dbl>, w_ace <dbl>, w_df <dbl>, …

11.3.1 Filtering with semi_join()

Suppose that we only want to keep matches in 2019 where the winning player had 10 or more wins in 2018. This might be useful if we want to not consider players in 2018 that only played in a couple of matches, perhaps because they got injured or perhaps because they received a special wildcard into the draw of only one event.

To accomplish this, we can first create a data set that has the names of all of the players that won 10 or more matches in 2018, using functions that we learned from dplyr earlier in the semester:

win10 <- atp_2018 |> group_by(winner_name) |>
  summarise(nwin = n()) |> 
  filter(nwin >= 10)
win10
#> # A tibble: 93 × 2
#>   winner_name       nwin
#>   <chr>            <int>
#> 1 Adrian Mannarino    26
#> 2 Albert Ramos        21
#> 3 Alex De Minaur      29
#> 4 Alexander Zverev    58
#> 5 Aljaz Bedene        19
#> 6 Andreas Seppi       24
#> # ℹ 87 more rows

Next, we apply semi_join(), which takes the names of two data sets (the second is the one that contains information about how the first should be “filtered”). The third argument gives the name of the key (winner_name) in this case.

tennis_2019_10 <- semi_join(atp_2019, win10,
                            by = join_by(winner_name == winner_name))
tennis_2019_10$winner_name
Note

This strategy only keeps the matches in 2019 where the winner had 10 or more match wins in 2018. It drops any matches where the loser lost against someone who did not have 10 or more match wins in 2018. So this isn’t yet perfect and would take a little more thought into which matches we actually want to keep for a particular analysis.

11.3.2 Filtering with anti_join()

Now suppose that we want to only keep the matches in 2019 where the winning player did not have any wins in 2018. We might think of these players as “emerging players” in 2019, players who are coming back from an injury, etc.. To do this, we can use anti_join(), which only keeps the rows in the first data set that do not have a match in the second data set.

new_winners <- anti_join(atp_2019, atp_2018,
                         by = join_by(winner_name == winner_name)) 
new_winners$winner_name

We can then examine how many wins each of these “new” (or perhaps previously injured) players had in 2019:

new_winners |> group_by(winner_name) |>
  summarise(nwin = n()) |>
  arrange(desc(nwin))
#> # A tibble: 59 × 2
#>   winner_name           nwin
#>   <chr>                <int>
#> 1 Christian Garin         32
#> 2 Juan Ignacio Londero    22
#> 3 Miomir Kecmanovic       22
#> 4 Hugo Dellien            12
#> 5 Attila Balazs            7
#> 6 Cedrik Marcel Stebe      7
#> # ℹ 53 more rows

The filtering join functions are useful if you want to filter out observations by some criterion in a different data set.

11.4 Practice

11.4.1 Class Exercises

Class Exercise 0. Examine the following data sets (the first is df1 and the second is df2) and then, without running any code, answer the following questions.

id xvar
A 1
B 2
C 3
E 1
F 2
id yvar
A 2
C 1
D 2
E 1
G 1
H 4
  1. How many rows would be in the data set from left_join(df1, df2, by = join_by(id == id))?

  2. How many rows would be in the data set from left_join(df2, df1, by = join_by(id == id))?

  3. How many rows would be in the data set from full_join(df1, df2, by = join_by(id == id))?

  4. How many rows would be in the data set from inner_join(df1, df2, by = join_by(id == id))?

  5. How many rows would be in the data set from semi_join(df1, df2, by = join_by(id == id))?

  6. How many rows would be in the data set from anti_join(df1, df2, by = join_by(id == id))?

Class Exercise 1. Why might bind_cols() be a dangerous way to combine data? What must you be sure of about the way the data was collected in order to combine data in this way?

Class Exercise 2. The key variable is very important for joining and is not always available in a “perfect” form. Recall the college majors data sets we have, called slumajors_df, which information on majors at SLU. Another data set, collegemajors_df, has different statistics on college majors nationwide. There’s lots of interesting variables in these data sets, but we’ll focus on the Major variable here. Read in and examine the two data sets with:

slumajors_df <- read_csv(here("data/SLU_Majors_17_21.csv"))
collegemajors_df <- read_csv(here("data/college-majors.csv"))
head(slumajors_df)
#> # A tibble: 6 × 3
#>   Major                        nfemales nmales
#>   <chr>                           <dbl>  <dbl>
#> 1 Anthropology                       35     13
#> 2 Art & Art History                  62     11
#> 3 Biochemistry                       15      6
#> 4 Biology                           152     58
#> 5 Business in the Liberal Arts      192    301
#> 6 Chemistry                          28     20
head(collegemajors_df)
#> # A tibble: 6 × 12
#>   Major         Total   Men Women Major_category Employed Full_time Part_time
#>   <chr>         <dbl> <dbl> <dbl> <chr>             <dbl>     <dbl>     <dbl>
#> 1 PETROLEUM EN…  2339  2057   282 Engineering        1976      1849       270
#> 2 MINING AND M…   756   679    77 Engineering         640       556       170
#> 3 METALLURGICA…   856   725   131 Engineering         648       558       133
#> 4 NAVAL ARCHIT…  1258  1123   135 Engineering         758      1069       150
#> 5 CHEMICAL ENG… 32260 21239 11021 Engineering       25694     23170      5180
#> 6 NUCLEAR ENGI…  2573  2200   373 Engineering        1857      2038       264
#> # ℹ 4 more variables: Unemployed <dbl>, Median <dbl>, P25th <dbl>,
#> #   P75th <dbl>
  1. Attempt to join the two data sets together with a key.

  2. Why did the collegemajors_df give only NA values when we tried to join by major?

Note

This example underscores the importance of having a key that matches exactly. Some, but not all, of the issues involved in joining these two data sets can be solved with functions in the stringr package (discussed in a few weeks). But, the underlying problem remains and, if we wanted to combine the two data sets fully, we would need to recode some of the major names by hand.

Class Exercise 3. Read in the gun violence data set, and suppose that you want to add a row to this data set that has the statistics on gun ownership and mortality rate in the District of Columbia (Washington D.C., which is in the NE region, has 16.7 deaths per 100,000 people, and a gun ownership rate of 8.7%). To do so, create a tibble() that has a single row representing D.C. and then combine your new tibble with the overall gun violence data set. Name this new data set all_df.

library(tidyverse)
mortality_df <- read_csv(here("data/gun_violence_us.csv"))

Class Exercise 4. Examine the following data sets that are in R’s base library on demographic statistics about the U.S. states and state abbreviations:

df1 <- as_tibble(state.x77)
df2 <- as_tibble(state.abb)
df1
#> # A tibble: 50 × 8
#>   Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost   Area
#>        <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl>  <dbl>
#> 1       3615   3624        2.1       69.0   15.1      41.3    20  50708
#> 2        365   6315        1.5       69.3   11.3      66.7   152 566432
#> 3       2212   4530        1.8       70.6    7.8      58.1    15 113417
#> 4       2110   3378        1.9       70.7   10.1      39.9    65  51945
#> 5      21198   5114        1.1       71.7   10.3      62.6    20 156361
#> 6       2541   4884        0.7       72.1    6.8      63.9   166 103766
#> # ℹ 44 more rows
df2
#> # A tibble: 50 × 1
#>   value
#>   <chr>
#> 1 AL   
#> 2 AK   
#> 3 AZ   
#> 4 AR   
#> 5 CA   
#> 6 CO   
#> # ℹ 44 more rows

Combine the two data sets with bind_cols() and name the new data set states_df. What are you assuming about the data sets in order to use this function?

Class Exercise 5. Use a join function to combine the mortality data set with D.C. (which we named all_df) with the states data set from the previous exercise (states_df). For this exercise, keep the row with Washington D.C., having it take on NA values for any variable not observed in the states data.

Class Exercise 6. Repeat the previous exercise, but now drop Washington D.C. in your merging process. Practice doing this with a join function (as opposed to slice()-ing it out explicitly).

Class Exercise 7. Use semi_join() to create a subset of states_df that are in the NE region. Hint: You will need to filter all_df from Exercise 1 first to contain only states in the NE region.

Class Exercise 8. Do the same thing as in the previous exercise (that is, create a subset of states_df that are in the NE region), but this time, use anti_join().

11.4.2 Your Turn

Your Turn 1. In your group, evaluate whether each of the following statements are true or false:

  1. An inner_join() will always result in a data set with the same or fewer rows than a full_join().

  2. An inner_join() will always result in a data set with the same or fewer rows than a left_join().

  3. A left_join() will always result in a data set with the same number of rows as a semi_join() on the same two data sets.

Your Turn 2. Examine the following data sets (the first is df3 and the second is df4) and then, without running any code, answer the following questions in your group. Keep in mind that there are some duplicate keys!

id xvar
A 1
A 2
C 3
C 1
F 2
F 6
id yvar
A 2
B 1
C 2
D 1
D 1
D 4
  1. How many rows would be in the data set from left_join(df3, df4, by = join_by(id == id))?

  2. How many rows would be in the data set from left_join(df4, df3, by = join_by(id == id))?

  3. How many rows would be in the data set from full_join(df3, df4, by = join_by(id == id))?

  4. How many rows would be in the data set from inner_join(df3, df4, by = join_by(id == id))?

  5. How many rows would be in the data set from semi_join(df3, df4, by = join_by(id == id))?

  6. How many rows would be in the data set from anti_join(df3, df4, by = join_by(id == id))?

Your Turn 3. Examine again the gun violence data set, and explain why each attempt at combining the D.C. data with the overall data doesn’t work or is incorrect.

test1 <- tibble(state = "Washington D.C.", mortality_rate = 16.7,
                ownership_rate = 8.7, region = "NE")
bind_rows(mortality_df, test1)

test2 <- tibble(state = "Washington D.C.", mortality_rate = 16.7,
       ownership_rate = 0.087, region = NE)
#> Error: object 'NE' not found
bind_rows(mortality_df, test2)
#> Error in eval(expr, envir, enclos): object 'test2' not found

test3 <- tibble(state = "Washington D.C.", mortality_rate = "16.7",
       ownership_rate = "0.087", region = "NE")
bind_rows(mortality_df, test3)
#> Error in `bind_rows()`:
#> ! Can't combine `..1$mortality_rate` <double> and `..2$mortality_rate` <character>.

Your Turn 4. We will do a short activity with joining using slips of paper. After we complete the activity, consider the following. Suppose that you have a data frame called courses_df that contains the course schedule for the fall semester for 100 different professors at SLU, each with 3 different courses (so that there are 300 total rows in the data set: 3 rows per professor). Of the 100 professors, 5 are in the Math-CS-Stat department.

You have another data frame called info_df that contains information about the 15 Math-CS-Stat professors at SLU (with just one row per professor). Again, 5 of these professors also appear in courses_df while the other 10 do not.

How many rows would result in the following joins, assuming that you use professor as the key in the join_by() function?

  1. left_join(courses_df, info_df)

  2. semi_join(info_df, courses_df)

  3. anti_join(courses_df, info_df)

  4. inner_join(courses_df, info_df)

  5. full_join(courses_df, info_df)