11 Merging with dplyr
Goals:
use
bind_rows()
to stack two data sets andbind_cols()
to merge two data sets.identify keys in two related data sets.
use the mutating join functions in
dplyr
to merge two data sets by a key.use the filtering join functions in
dplyr
to filter one data set by values in another data set.apply the appropriate
join()
function for a given problem and context.
11.1 Stacking Rows and Appending Columns
11.1.1 Stacking with bind_rows()
First, we will talk about combining two data sets by “stacking” them on top of each other to form one new data set. The bind_rows()
function can be used for this purpose if the two data sets have identical column names.
A common instance where this is useful is if two data sets come from the same source and have different locations or years, but the same exact column names.
For example, examine the following website and notice how there are .csv files given for each year of matches in the ATP (Association of (men’s) Tennis Professionals). https://github.com/JeffSackmann/tennis_atp.
Then, read in the data sets, and look at how many columns each has.
To combine results from both data sets,
atp_df <- bind_rows(atp_2018, atp_2019)
#> Error in `bind_rows()`:
#> ! Can't combine `..1$winner_seed` <double> and `..2$winner_seed` <character>.
The issue is that the winner_seed
and loser_seed
variables are of different variable classes in the atp_2018
data vs. the atp_2019
data. The following code forces the seed variables in the 2018 data set to be characters.
atp_2018 <- read_csv(here("data/atp_matches_2018.csv"),
col_types = cols(winner_seed = col_character(),
loser_seed = col_character()))
We can try combining the data sets now.
atp_df <- bind_rows(atp_2018, atp_2019)
atp_df
Do a quick check to make sure the number of rows in atp_2018
plus the number of rows in atp_2019
equals the number of rows in atp_df
.
It might seem a little annoying, but, by default bind_rows()
will only combine two data sets by stacking rows if the data sets have identical column names and identical column classes, as we saw in the previous example.
Now run the following and look at the output.
We can see from this toy example that, if a variable is missing entirely in one of the data sets we are binding together, then NA
values are filled in for that variable.
11.1.2 Binding Columns with bind_cols()
bind_cols()
combines two data sets by combining their columns into one data set. We won’t spend much time talking about how to bind together columns because it’s generally a little dangerous.
We will use a couple of test data sets, df_test1a
and df_test1b
, to see it in action:
Exercise 1. Run the following and explain why R
does not simply stack the rows. Then, fix the issue with the rename()
function.
11.2 Mutating Joins
If the goal is to combine two data sets using some common variable(s) that both data sets have, we need different tools than simply stacking rows or appending columns. When merging together two or more data sets, we need to have a matching identification variable in each data set. This variable is commonly called a key. A key can be an identification number, a name, a date, etc, but must be present in both data sets.
As a simple first example, consider
Our goal is to combine the two data sets so that the people’s favorite sports and favorite colours are in one data set.
Identify the key
in the example above.
11.2.1 Keep All Rows of Data Set 1 with left_join()
Consider the babynames
R
package, which has the following data sets:
-
lifetables
: cohort life tables for differentsex
and differentyear
variables, starting at the year 1900. -
births
: the number of births in the United States in each year, since 1909 -
babynames
: popularity of different baby names per year and sex since the year 1880.
Read about each data set with ?babynames
, ?births
and ?lifetables
.
Suppose that you want to combine the births
data set with the babynames
data set, so that each row of babynames
now has the total number of births for that year. We first need to identify the key in each data set that we will use for the joining. In this case, each data set has a year
variable, and we can use left_join()
to keep all observations in babynames_df
, even for years that are not in the births_df
data set.
combined_left <- left_join(babynames_df, birth_df, by = join_by(year == year))
head(combined_left)
#> # A tibble: 6 × 6
#> year sex name n prop births
#> <dbl> <chr> <chr> <int> <dbl> <int>
#> 1 1880 F Mary 7065 0.0724 NA
#> 2 1880 F Anna 2604 0.0267 NA
#> 3 1880 F Emma 2003 0.0205 NA
#> 4 1880 F Elizabeth 1939 0.0199 NA
#> 5 1880 F Minnie 1746 0.0179 NA
#> 6 1880 F Margaret 1578 0.0162 NA
tail(combined_left)
#> # A tibble: 6 × 6
#> year sex name n prop births
#> <dbl> <chr> <chr> <int> <dbl> <int>
#> 1 2017 M Zyhier 5 0.00000255 3855500
#> 2 2017 M Zykai 5 0.00000255 3855500
#> 3 2017 M Zykeem 5 0.00000255 3855500
#> 4 2017 M Zylin 5 0.00000255 3855500
#> 5 2017 M Zylis 5 0.00000255 3855500
#> 6 2017 M Zyrie 5 0.00000255 3855500
Why are births
missing in head(combined_left)
but not in tail(combined_left)
?
11.2.2 Keep All Rows of Data Set 2 with right_join()
Recall from the accompanying handout that there is no need to ever use right_join()
because it is the same as using a left_join()
with the first two data set arguments switched:
## these will always do the same exact thing
right_join(babynames_df, birth_df, by = join_by(year == year))
#> # A tibble: 1,839,952 × 6
#> year sex name n prop births
#> <dbl> <chr> <chr> <int> <dbl> <int>
#> 1 1909 F Mary 19259 0.0523 2718000
#> 2 1909 F Helen 9250 0.0251 2718000
#> 3 1909 F Margaret 7359 0.0200 2718000
#> 4 1909 F Ruth 6509 0.0177 2718000
#> 5 1909 F Dorothy 6253 0.0170 2718000
#> 6 1909 F Anna 5804 0.0158 2718000
#> # ℹ 1,839,946 more rows
left_join(birth_df, babynames_df, by = join_by(year == year))
#> # A tibble: 1,839,952 × 6
#> year births sex name n prop
#> <dbl> <int> <chr> <chr> <int> <dbl>
#> 1 1909 2718000 F Mary 19259 0.0523
#> 2 1909 2718000 F Helen 9250 0.0251
#> 3 1909 2718000 F Margaret 7359 0.0200
#> 4 1909 2718000 F Ruth 6509 0.0177
#> 5 1909 2718000 F Dorothy 6253 0.0170
#> 6 1909 2718000 F Anna 5804 0.0158
#> # ℹ 1,839,946 more rows
Therefore, it’s usually easier to just always use left_join()
and ignore right_join()
completely.
11.2.3 Keep All Rows of Both Data Sets with full_join()
In addition to keeping any rows with a matching key in the other data frame, a full_join()
will keep all rows in data set 1 that don’t have a matching key in data set 2, and will also keep all rows in data set 2 that don’t have a matching key in data set 1, filling in NA
for missing values when necessary. For our example of merging babynames_df
with birth_df
,
11.2.4 Keep Only Rows with Matching Keys with inner_join()
We can also keep only rows with matching keys with inner_join()
. For this join, any row in data set 1 without a matching key in data set 2 is dropped, and any row in data set 2 without a matching key in data set 1 is also dropped.
inner_join(babynames_df, birth_df, by = join_by(year == year))
#> # A tibble: 1,839,952 × 6
#> year sex name n prop births
#> <dbl> <chr> <chr> <int> <dbl> <int>
#> 1 1909 F Mary 19259 0.0523 2718000
#> 2 1909 F Helen 9250 0.0251 2718000
#> 3 1909 F Margaret 7359 0.0200 2718000
#> 4 1909 F Ruth 6509 0.0177 2718000
#> 5 1909 F Dorothy 6253 0.0170 2718000
#> 6 1909 F Anna 5804 0.0158 2718000
#> # ℹ 1,839,946 more rows
11.2.5 Which xxxx_join()
?
Which join function we use will depend on the context of the data and what questions you will be answering in your analysis.
If you’re using a left_join()
, right_join()
or inner_join()
, you’re potentially cutting out some data. It’s important to be aware of what data you’re omitting. For example, with the babynames
and births
data, we would want to keep a note that a left_join()
removed all observations before 1909 from joined data set.
Exercise 2. Examine the following two joins that we’ve done, and explain why one resulting data set has fewer observations (rows) than the other.
left_join(babynames_df, birth_df, by = join_by(year == year))
#> # A tibble: 1,924,665 × 6
#> year sex name n prop births
#> <dbl> <chr> <chr> <int> <dbl> <int>
#> 1 1880 F Mary 7065 0.0724 NA
#> 2 1880 F Anna 2604 0.0267 NA
#> 3 1880 F Emma 2003 0.0205 NA
#> 4 1880 F Elizabeth 1939 0.0199 NA
#> 5 1880 F Minnie 1746 0.0179 NA
#> 6 1880 F Margaret 1578 0.0162 NA
#> # ℹ 1,924,659 more rows
left_join(birth_df, babynames_df, by = join_by(year == year))
#> # A tibble: 1,839,952 × 6
#> year births sex name n prop
#> <dbl> <int> <chr> <chr> <int> <dbl>
#> 1 1909 2718000 F Mary 19259 0.0523
#> 2 1909 2718000 F Helen 9250 0.0251
#> 3 1909 2718000 F Margaret 7359 0.0200
#> 4 1909 2718000 F Ruth 6509 0.0177
#> 5 1909 2718000 F Dorothy 6253 0.0170
#> 6 1909 2718000 F Anna 5804 0.0158
#> # ℹ 1,839,946 more rows
11.3 Filtering Joins
Filtering joins (semi_join()
and anti_join()
) are useful if we would only like to keep the variables in one data set, but we want to filter out observations by a variable in the second data set.
Consider again the two data sets on men’s tennis matches in 2018 and in 2019.
atp_2019 <- read_csv(here("data/atp_matches_2019.csv"))
atp_2018 <- read_csv(here("data/atp_matches_2018.csv"))
atp_2019
#> # A tibble: 2,781 × 49
#> tourney_id tourney_name surface draw_size tourney_level tourney_date
#> <chr> <chr> <chr> <dbl> <chr> <dbl>
#> 1 2019-M020 Brisbane Hard 32 A 20181231
#> 2 2019-M020 Brisbane Hard 32 A 20181231
#> 3 2019-M020 Brisbane Hard 32 A 20181231
#> 4 2019-M020 Brisbane Hard 32 A 20181231
#> 5 2019-M020 Brisbane Hard 32 A 20181231
#> 6 2019-M020 Brisbane Hard 32 A 20181231
#> # ℹ 2,775 more rows
#> # ℹ 43 more variables: match_num <dbl>, winner_id <dbl>, winner_seed <chr>,
#> # winner_entry <chr>, winner_name <chr>, winner_hand <chr>,
#> # winner_ht <dbl>, winner_ioc <chr>, winner_age <dbl>, loser_id <dbl>,
#> # loser_seed <chr>, loser_entry <chr>, loser_name <chr>, loser_hand <chr>,
#> # loser_ht <dbl>, loser_ioc <chr>, loser_age <dbl>, score <chr>,
#> # best_of <dbl>, round <chr>, minutes <dbl>, w_ace <dbl>, w_df <dbl>, …
atp_2018
#> # A tibble: 2,889 × 49
#> tourney_id tourney_name surface draw_size tourney_level tourney_date
#> <chr> <chr> <chr> <dbl> <chr> <dbl>
#> 1 2018-M020 Brisbane Hard 32 A 20180101
#> 2 2018-M020 Brisbane Hard 32 A 20180101
#> 3 2018-M020 Brisbane Hard 32 A 20180101
#> 4 2018-M020 Brisbane Hard 32 A 20180101
#> 5 2018-M020 Brisbane Hard 32 A 20180101
#> 6 2018-M020 Brisbane Hard 32 A 20180101
#> # ℹ 2,883 more rows
#> # ℹ 43 more variables: match_num <dbl>, winner_id <dbl>, winner_seed <dbl>,
#> # winner_entry <chr>, winner_name <chr>, winner_hand <chr>,
#> # winner_ht <dbl>, winner_ioc <chr>, winner_age <dbl>, loser_id <dbl>,
#> # loser_seed <dbl>, loser_entry <chr>, loser_name <chr>, loser_hand <chr>,
#> # loser_ht <dbl>, loser_ioc <chr>, loser_age <dbl>, score <chr>,
#> # best_of <dbl>, round <chr>, minutes <dbl>, w_ace <dbl>, w_df <dbl>, …
11.3.1 Filtering with semi_join()
Suppose that we only want to keep matches in 2019 where the winning player had 10 or more wins in 2018. This might be useful if we want to not consider players in 2018 that only played in a couple of matches, perhaps because they got injured or perhaps because they received a special wildcard into the draw of only one event.
To accomplish this, we can first create a data set that has the names of all of the players that won 10 or more matches in 2018, using functions that we learned from dplyr
earlier in the semester:
win10 <- atp_2018 |> group_by(winner_name) |>
summarise(nwin = n()) |>
filter(nwin >= 10)
win10
#> # A tibble: 93 × 2
#> winner_name nwin
#> <chr> <int>
#> 1 Adrian Mannarino 26
#> 2 Albert Ramos 21
#> 3 Alex De Minaur 29
#> 4 Alexander Zverev 58
#> 5 Aljaz Bedene 19
#> 6 Andreas Seppi 24
#> # ℹ 87 more rows
Next, we apply semi_join()
, which takes the names of two data sets (the second is the one that contains information about how the first should be “filtered”). The third argument gives the name of the key (winner_name
) in this case.
This strategy only keeps the matches in 2019 where the winner had 10 or more match wins in 2018. It drops any matches where the loser lost against someone who did not have 10 or more match wins in 2018. So this isn’t yet perfect and would take a little more thought into which matches we actually want to keep for a particular analysis.
11.3.2 Filtering with anti_join()
Now suppose that we want to only keep the matches in 2019 where the winning player did not have any wins in 2018. We might think of these players as “emerging players” in 2019, players who are coming back from an injury, etc.. To do this, we can use anti_join()
, which only keeps the rows in the first data set that do not have a match in the second data set.
We can then examine how many wins each of these “new” (or perhaps previously injured) players had in 2019:
new_winners |> group_by(winner_name) |>
summarise(nwin = n()) |>
arrange(desc(nwin))
#> # A tibble: 59 × 2
#> winner_name nwin
#> <chr> <int>
#> 1 Christian Garin 32
#> 2 Juan Ignacio Londero 22
#> 3 Miomir Kecmanovic 22
#> 4 Hugo Dellien 12
#> 5 Attila Balazs 7
#> 6 Cedrik Marcel Stebe 7
#> # ℹ 53 more rows
The filtering join functions are useful if you want to filter out observations by some criterion in a different data set.
11.4 Practice
11.4.1 Class Exercises
Class Exercise 0. Examine the following data sets (the first is df1
and the second is df2
) and then, without running any code, answer the following questions.
id | xvar |
---|---|
A | 1 |
B | 2 |
C | 3 |
E | 1 |
F | 2 |
id | yvar |
---|---|
A | 2 |
C | 1 |
D | 2 |
E | 1 |
G | 1 |
H | 4 |
How many rows would be in the data set from
left_join(df1, df2, by = join_by(id == id))
?How many rows would be in the data set from
left_join(df2, df1, by = join_by(id == id))
?How many rows would be in the data set from
full_join(df1, df2, by = join_by(id == id))
?How many rows would be in the data set from
inner_join(df1, df2, by = join_by(id == id))
?How many rows would be in the data set from
semi_join(df1, df2, by = join_by(id == id))
?How many rows would be in the data set from
anti_join(df1, df2, by = join_by(id == id))
?
Class Exercise 1. Why might bind_cols()
be a dangerous way to combine data? What must you be sure of about the way the data was collected in order to combine data in this way?
Class Exercise 2. The key variable is very important for joining and is not always available in a “perfect” form. Recall the college majors data sets we have, called slumajors_df
, which information on majors at SLU. Another data set, collegemajors_df
, has different statistics on college majors nationwide. There’s lots of interesting variables in these data sets, but we’ll focus on the Major
variable here. Read in and examine the two data sets with:
slumajors_df <- read_csv(here("data/SLU_Majors_17_21.csv"))
collegemajors_df <- read_csv(here("data/college-majors.csv"))
head(slumajors_df)
#> # A tibble: 6 × 3
#> Major nfemales nmales
#> <chr> <dbl> <dbl>
#> 1 Anthropology 35 13
#> 2 Art & Art History 62 11
#> 3 Biochemistry 15 6
#> 4 Biology 152 58
#> 5 Business in the Liberal Arts 192 301
#> 6 Chemistry 28 20
head(collegemajors_df)
#> # A tibble: 6 × 12
#> Major Total Men Women Major_category Employed Full_time Part_time
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 PETROLEUM EN… 2339 2057 282 Engineering 1976 1849 270
#> 2 MINING AND M… 756 679 77 Engineering 640 556 170
#> 3 METALLURGICA… 856 725 131 Engineering 648 558 133
#> 4 NAVAL ARCHIT… 1258 1123 135 Engineering 758 1069 150
#> 5 CHEMICAL ENG… 32260 21239 11021 Engineering 25694 23170 5180
#> 6 NUCLEAR ENGI… 2573 2200 373 Engineering 1857 2038 264
#> # ℹ 4 more variables: Unemployed <dbl>, Median <dbl>, P25th <dbl>,
#> # P75th <dbl>
Attempt to join the two data sets together with a key.
Why did the
collegemajors_df
give onlyNA
values when we tried to join by major?
This example underscores the importance of having a key that matches exactly. Some, but not all, of the issues involved in joining these two data sets can be solved with functions in the stringr
package (discussed in a few weeks). But, the underlying problem remains and, if we wanted to combine the two data sets fully, we would need to recode some of the major names by hand.
Class Exercise 3. Read in the gun violence data set, and suppose that you want to add a row to this data set that has the statistics on gun ownership and mortality rate in the District of Columbia (Washington D.C., which is in the NE region, has 16.7 deaths per 100,000 people, and a gun ownership rate of 8.7%). To do so, create a tibble()
that has a single row representing D.C. and then combine your new tibble
with the overall gun violence data set. Name this new data set all_df
.
Class Exercise 4. Examine the following data sets that are in R
’s base library on demographic statistics about the U.S. states and state abbreviations:
df1 <- as_tibble(state.x77)
df2 <- as_tibble(state.abb)
df1
#> # A tibble: 50 × 8
#> Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3615 3624 2.1 69.0 15.1 41.3 20 50708
#> 2 365 6315 1.5 69.3 11.3 66.7 152 566432
#> 3 2212 4530 1.8 70.6 7.8 58.1 15 113417
#> 4 2110 3378 1.9 70.7 10.1 39.9 65 51945
#> 5 21198 5114 1.1 71.7 10.3 62.6 20 156361
#> 6 2541 4884 0.7 72.1 6.8 63.9 166 103766
#> # ℹ 44 more rows
df2
#> # A tibble: 50 × 1
#> value
#> <chr>
#> 1 AL
#> 2 AK
#> 3 AZ
#> 4 AR
#> 5 CA
#> 6 CO
#> # ℹ 44 more rows
Combine the two data sets with bind_cols()
and name the new data set states_df
. What are you assuming about the data sets in order to use this function?
Class Exercise 5. Use a join function to combine the mortality data set with D.C. (which we named all_df
) with the states data set from the previous exercise (states_df
). For this exercise, keep the row with Washington D.C., having it take on NA
values for any variable not observed in the states data.
Class Exercise 6. Repeat the previous exercise, but now drop Washington D.C. in your merging process. Practice doing this with a join function (as opposed to slice()
-ing it out explicitly).
Class Exercise 7. Use semi_join()
to create a subset of states_df
that are in the NE
region. Hint: You will need to filter all_df
from Exercise 1 first to contain only states in the NE
region.
Class Exercise 8. Do the same thing as in the previous exercise (that is, create a subset of states_df
that are in the NE
region), but this time, use anti_join()
.
11.4.2 Your Turn
Your Turn 1. In your group, evaluate whether each of the following statements are true or false:
An
inner_join()
will always result in a data set with the same or fewer rows than afull_join()
.An
inner_join()
will always result in a data set with the same or fewer rows than aleft_join()
.A
left_join()
will always result in a data set with the same number of rows as asemi_join()
on the same two data sets.
Your Turn 2. Examine the following data sets (the first is df3
and the second is df4
) and then, without running any code, answer the following questions in your group. Keep in mind that there are some duplicate keys!
id | xvar |
---|---|
A | 1 |
A | 2 |
C | 3 |
C | 1 |
F | 2 |
F | 6 |
id | yvar |
---|---|
A | 2 |
B | 1 |
C | 2 |
D | 1 |
D | 1 |
D | 4 |
How many rows would be in the data set from
left_join(df3, df4, by = join_by(id == id))
?How many rows would be in the data set from
left_join(df4, df3, by = join_by(id == id))
?How many rows would be in the data set from
full_join(df3, df4, by = join_by(id == id))
?How many rows would be in the data set from
inner_join(df3, df4, by = join_by(id == id))
?How many rows would be in the data set from
semi_join(df3, df4, by = join_by(id == id))
?How many rows would be in the data set from
anti_join(df3, df4, by = join_by(id == id))
?
Your Turn 3. Examine again the gun violence data set, and explain why each attempt at combining the D.C. data with the overall data doesn’t work or is incorrect.
test1 <- tibble(state = "Washington D.C.", mortality_rate = 16.7,
ownership_rate = 8.7, region = "NE")
bind_rows(mortality_df, test1)
test2 <- tibble(state = "Washington D.C.", mortality_rate = 16.7,
ownership_rate = 0.087, region = NE)
#> Error: object 'NE' not found
bind_rows(mortality_df, test2)
#> Error in eval(expr, envir, enclos): object 'test2' not found
test3 <- tibble(state = "Washington D.C.", mortality_rate = "16.7",
ownership_rate = "0.087", region = "NE")
bind_rows(mortality_df, test3)
#> Error in `bind_rows()`:
#> ! Can't combine `..1$mortality_rate` <double> and `..2$mortality_rate` <character>.
Your Turn 4. We will do a short activity with joining using slips of paper. After we complete the activity, consider the following. Suppose that you have a data frame called courses_df
that contains the course schedule for the fall semester for 100 different professors at SLU, each with 3 different courses (so that there are 300 total rows in the data set: 3 rows per professor). Of the 100 professors, 5 are in the Math-CS-Stat department.
You have another data frame called info_df
that contains information about the 15 Math-CS-Stat professors at SLU (with just one row per professor). Again, 5 of these professors also appear in courses_df
while the other 10 do not.
How many rows would result in the following joins, assuming that you use professor as the key in the join_by()
function?
left_join(courses_df, info_df)
semi_join(info_df, courses_df)
anti_join(courses_df, info_df)
inner_join(courses_df, info_df)
full_join(courses_df, info_df)