16 Introduction to SQL with dbplyr
Goals:
- explain what a database is, how it is different from a data set, and why you might use a database.
- use the
dbplyr
to translateR
code withdplyr
toSQL
queries on database tables. - draw parallels between
dplyr
functions and the syntax used inSQL
.
All of the dplyr
functions we’ve used (both the ones from early in the semester and from the xxxx_join()
family more recently) have corresponding components in SQL
. SQL
stands for Structured Query Language and is a very common language used with databases.
Compared to dplyr
, in general, SQL
code is much harder to read, as SQL isn’t designed specifically for data analysis like dplyr
is.
In this section, we will introduce databases and give a brief introduction to SQL
for analyzing data from a database.
16.1 What is a Database
The R for Data Science textbook defines a database as “a collection of data frames,” each called a database table. There a few key differences between a data frame (what we’ve been using the entire semester) and a database table. They are summarised from R for Data Science here as:
- a database table can be larger and is stored on disk while a data frame is stored in memory so their size is more limited.
- many, but not all, data base tables are “row-oriented” while tidy data frames are “column-oriented.” However, more often, databases are column-oriented (similar to all of the data frames we have worked with in this class).
Databases are run through Database Management Systems. The R for Data Science textbook divides Database Management Systems into 3 types:
- client-server like PostgreSQL and SQL Server
- Cloud-based like Amazon’s Redshift
- In-process like SQLite
We won’t really discuss these any further, but an advanced course in database systems through the CS department would give more information about Database Management Systems (and databases in general).
How to connect to a database from R
depends on the type of database management system. For our purposes, because how to connect to a Database management system depends so heavily on the type, we will focus on using a local (not remote) data base that has been put into your /data
folder called imdb_tv.sqlite
.
To start, we connect to the database using the dbConnect()
function from the DBI
package. Note that this connection process would be different if your database was on a remote server or was cloud-based.
This section on connecting to a database management systems may be confusing, particularly if you do not have a computer science background. But don’t let that derail your learning for this rest of this chapter, which will consist of primarily of R
code from here on! The take-home message is that we need a way to connect to the system within R
. It’s challenging to give specific directions because the connection depends on the type of system.
What’s in our newly connected database, which we have named imdb_db
? We can find out with:
dbListTables(conn = imdb_db)
#> [1] "episode_rating" "show_info" "show_ratings"
The imdb_db
database consists of three table. We can think of a table as an individual data frame. Each table has its own set of variables that we can check with:
dbListFields(conn = imdb_db, name = "episode_rating")
#> [1] "tconst" "parentTconst" "seasonNumber" "episodeNumber"
#> [5] "averageRating" "numVotes" "Episode_Name"
dbListFields(conn = imdb_db, name = "show_info")
#> [1] "parentTconst" "Show_Name" "startYear"
dbListFields(conn = imdb_db, name = "show_ratings")
#> [1] "parentTconst" "averageRating" "numVotes"
The episode_rating
table (data frame) has 7 variables (or, fields) in it, the show_info
table has three variables, and the show_ratings
table has three variables. Moreover, these three tables all share one common variable called parentTconst
.
To use raw SQL code and query the database that we just created, we can create a string of SQL code, name it sql
, and pass it to the dbGetQuery()
function. We also load in the tidyverse
package here to use the as_tibble()
function to convert the data.frame
to a tibble
.
library(tidyverse)
sql <- "
SELECT parentTconst, averageRating, numVotes
FROM show_ratings
WHERE numVotes > 100000
"
dbGetQuery(imdb_db, sql) |>
as_tibble()
#> # A tibble: 25 × 3
#> parentTconst averageRating numVotes
#> <chr> <dbl> <dbl>
#> 1 tt0475784 8.5 513884
#> 2 tt8740790 7.4 144503
#> 3 tt9140554 8.2 335306
#> 4 tt7221388 8.5 192307
#> 5 tt5753856 8.7 405433
#> 6 tt4052886 8.1 333969
#> # ℹ 19 more rows
Exercise 1. Though we do not know SQL code, we can probably figure out what the code (the string that is being assigned to sql
) above is doing. Which matches are being returned from our query?
Exercise 2. What is the dplyr
equivalent function to WHERE
in the SQL code above? What is the dplyr
equivalent function to SELECT
in the SQL code above?
16.2 dbplyr
: A Database Version of dplyr
dbplyr
is a package that will allow us to continue to write dplyr
-style code to query databases instead of writing native SQL
, as in the code-chunk above.
We begin by loading in the package and creating a database table object with the tbl()
function. In this case, we create a database table with the episode_rating
data and name it episode_rating_tbl
:
library(dbplyr)
episode_rating_tbl <- tbl(imdb_db, "episode_rating")
episode_rating_tbl
#> # Source: table<`episode_rating`> [?? x 7]
#> # Database: sqlite 3.50.3 [/Users/highamm/git_repos/courses/ds234_quarto/data/imdb_tv.sqlite]
#> tconst parentTconst seasonNumber episodeNumber averageRating numVotes
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 tt10011150 tt0475784 3 3 8.1 8770
#> 2 tt10011152 tt0475784 3 4 8.7 10071
#> 3 tt10011154 tt0475784 3 5 7.9 9511
#> 4 tt10011156 tt0475784 3 6 8.2 8323
#> 5 tt10011158 tt0475784 3 7 7.8 7853
#> 6 tt10011160 tt0475784 3 8 7.2 9652
#> # ℹ more rows
#> # ℹ 1 more variable: Episode_Name <chr>
Examine the print for episode_rating_tbl
, which should look similar to the print for a tibble
or data.frame
. Let’s use some dplyr
code to obtain only the ratings that are higher than a 9.0
and keep only a few of the columns. We will name the result episode_query1
:
episode_query1 <- episode_rating_tbl |>
filter(averageRating > 9.0) |>
select(parentTconst, Episode_Name, numVotes, averageRating)
episode_query1
#> # Source: SQL [?? x 4]
#> # Database: sqlite 3.50.3 [/Users/highamm/git_repos/courses/ds234_quarto/data/imdb_tv.sqlite]
#> parentTconst Episode_Name numVotes averageRating
#> <chr> <chr> <dbl> <dbl>
#> 1 tt7221388 December 19 6245 9.4
#> 2 tt5753856 Die Reisenden 13000 9.4
#> 3 tt5753856 Vom Suchen und Finden 11905 9.2
#> 4 tt5753856 Ein unendlicher kreis 14775 9.5
#> 5 tt5753856 Der weiße Teufel 11809 9.1
#> 6 tt5753856 Enden und Anfänge 14651 9.4
#> # ℹ more rows
We should note that the result is still a database object: it’s not our “usual” tibble
. One major difference between the database object and the usual tibble
is that our episode_query1
does not tell us how many rows are in the data (see the ??
and the specification with more rows
). The code that we wrote is not actually looking in the entire data set for ratings that are higher than 9.0: it is saving time by only performing our query on part of the database table. This is very useful behaviour for database tables that are very, very large, where code might take a long time to run.
If we want to obtain the result of our query as a tibble
, we can use the collect()
function:
episode_query1 |>
collect()
#> # A tibble: 173 × 4
#> parentTconst Episode_Name numVotes averageRating
#> <chr> <chr> <dbl> <dbl>
#> 1 tt7221388 December 19 6245 9.4
#> 2 tt5753856 Die Reisenden 13000 9.4
#> 3 tt5753856 Vom Suchen und Finden 11905 9.2
#> 4 tt5753856 Ein unendlicher kreis 14775 9.5
#> 5 tt5753856 Der weiße Teufel 11809 9.1
#> 6 tt5753856 Enden und Anfänge 14651 9.4
#> # ℹ 167 more rows
The result is a tibble
that we can now use any R
functions on (not just functions from dplyr
and a few other packages).
The show_query()
function can be used on our episode_query1
to give the SQL code that was executed:
episode_query1 |>
show_query()
#> <SQL>
#> SELECT `parentTconst`, `Episode_Name`, `numVotes`, `averageRating`
#> FROM `episode_rating`
#> WHERE (`averageRating` > 9.0)
We’ll do one more query. Note that the parentTconst
variable is a show identification variable.
mean_query1 <- episode_rating_tbl |>
group_by(parentTconst) |>
summarise(mean_rating = mean(averageRating)) |>
arrange(desc(mean_rating))
mean_query1
#> # Source: SQL [?? x 2]
#> # Database: sqlite 3.50.3 [/Users/highamm/git_repos/courses/ds234_quarto/data/imdb_tv.sqlite]
#> # Ordered by: desc(mean_rating)
#> parentTconst mean_rating
#> <chr> <dbl>
#> 1 tt27350436 10
#> 2 tt24248284 10
#> 3 tt27178946 9.8
#> 4 tt8560994 9.46
#> 5 tt14392248 9.24
#> 6 tt15202102 9.2
#> # ℹ more rows
mean_query1 |> show_query()
#> <SQL>
#> SELECT `parentTconst`, AVG(`averageRating`) AS `mean_rating`
#> FROM `episode_rating`
#> GROUP BY `parentTconst`
#> ORDER BY `mean_rating` DESC
Can you match some of the SQL
code with the corresponding dbplyr
functions used? Note that the parentTconst
variable is a show identification variable.
Exercise 3. Figure out how many episodes there are for each television show using dbplyr
functions on episode_rating_tbl
. Then, use show_query()
to show the corresponding SQL
code.
16.3 SQL
The purpose of this section is to explore SQL
syntax a little more, focusing on its connections to dplyr
.
Knowing dplyr
is quite helpful in learning this SQL
syntax because, while the syntax differs, the concepts are quite similar.
Much of the text in this section is paraphrased from the R for Data Science textbook. There are five core components of an SQL
query. The two most basic are a SELECT
statement (similar to select()
, and, as discussed below, mutate()
and summarise()
) and a FROM
statement (similar to the data
argument). Using the show_query()
function directly on episode_rating_tbl
shows an SQL
query that SELECT
s all columns (denoted by the *
), FROM
the episode_rating
database.
episode_rating_tbl |> show_query()
#> <SQL>
#> SELECT *
#> FROM `episode_rating`
The WHERE
and ORDER BY
statements control which rows are returned (similar to filter()
) and in what order those rows get returned (similar to arrange()
):
episode_rating_tbl |> filter(seasonNumber == 1) |>
arrange(desc(numVotes)) |>
show_query()
#> <SQL>
#> SELECT `episode_rating`.*
#> FROM `episode_rating`
#> WHERE (`seasonNumber` = 1.0)
#> ORDER BY `numVotes` DESC
Finally, GROUP BY
is used for aggregation (similar to the dplyr
group_by()
and summarise()
combination).
episode_rating_tbl |>
group_by(parentTconst) |>
summarise(total_votes = sum(numVotes, na.rm = TRUE)) |>
show_query()
#> <SQL>
#> SELECT `parentTconst`, SUM(`numVotes`) AS `total_votes`
#> FROM `episode_rating`
#> GROUP BY `parentTconst`
In the above code chunk, remove the na.rm = TRUE
argument and run the query. What do you learn?
The SQL
syntax must always follow the order SELECT, FROM, WHERE, GROUP BY, ORDER BY
, even though the operations can be performed in a different order than what is specified. This is one aspect that makes SQL
harder to pick up than something like dplyr
, where we specify what we want done in the order that we want.
Below we give a little more detail about the 5 operations.
SELECT
: SELECT
covers a lot of dplyr
functions. In the code below, we explore how it is used in SQL
to choose which columns get returned, rename columns, and create new variables:
-
SELECT
to choose which columns to return:
episode_rating_tbl |> select(1:4) |> show_query()
#> <SQL>
#> SELECT `tconst`, `parentTconst`, `seasonNumber`, `episodeNumber`
#> FROM `episode_rating`
-
SELECT
to rename columns:
episode_rating_tbl |> rename(show_id = parentTconst) |>
show_query()
#> <SQL>
#> SELECT
#> `tconst`,
#> `parentTconst` AS `show_id`,
#> `seasonNumber`,
#> `episodeNumber`,
#> `averageRating`,
#> `numVotes`,
#> `Episode_Name`
#> FROM `episode_rating`
-
SELECT
to create a new variable
episode_rating_tbl |> mutate(total_rating_points = averageRating * numVotes) |>
select(total_rating_points, parentTconst) |>
show_query()
#> <SQL>
#> SELECT `averageRating` * `numVotes` AS `total_rating_points`, `parentTconst`
#> FROM `episode_rating`
-
SELECT
to create a new variable that is a summary:
episode_rating_tbl |> summarise(median_rating = median(averageRating)) |>
show_query()
#> <SQL>
#> SELECT MEDIAN(`averageRating`) AS `median_rating`
#> FROM `episode_rating`
GROUP BY
: GROUP BY
covers aggregation in a similar way as dplyr
’s group_by()
function:
episode_rating_tbl |> group_by(parentTconst) |>
summarise(median_rating = median(averageRating)) |>
show_query()
#> <SQL>
#> SELECT `parentTconst`, MEDIAN(`averageRating`) AS `median_rating`
#> FROM `episode_rating`
#> GROUP BY `parentTconst`
WHERE
: WHERE
is used for filter()
, though SQL
uses different Boolean operators than R
(for example, &
becomes AND
, |
becomes or
).
episode_rating_tbl |> filter(averageRating > 8 | averageRating < 5) |>
show_query()
#> <SQL>
#> SELECT `episode_rating`.*
#> FROM `episode_rating`
#> WHERE (`averageRating` > 8.0 OR `averageRating` < 5.0)
ORDER BY
: ORDER BY
is used for arrange()
. This one is quite straightforward:
episode_rating_tbl |> arrange(desc(averageRating)) |>
show_query()
#> <SQL>
#> SELECT `episode_rating`.*
#> FROM `episode_rating`
#> ORDER BY `averageRating` DESC
Finally, you may have noticed that the episode_rating
data table does not actually contain the name of the show as a variable. But, it does have a show identification variable (parentTconst
), and, we do have another data table called show_info
that has both the id (parentTconst
) and the show name (Show_Name
). What if we wanted to use the show name information? We can combine the two data sets using joins, just as we did in dplyr
.
Let’s first grab the show_info
table and store it as an object:
show_info_tbl <- tbl(imdb_db, "show_info")
Next, we join the two tables together with dbplyr
and use the show_query()
function to obtain the corresponding SQL
code:
left_join(episode_rating_tbl, show_info_tbl,
join_by(parentTconst == parentTconst)) |>
show_query()
#> <SQL>
#> SELECT `episode_rating`.*, `Show_Name`, `startYear`
#> FROM `episode_rating`
#> LEFT JOIN `show_info`
#> ON (`episode_rating`.`parentTconst` = `show_info`.`parentTconst`)
Note that we have really just scratched the surface of SQL
. There are entire courses devoted to learning SQL
syntax and more about databases in general. If you ever do find yourself in a situation where you need to learn SQL
, either for a course or for a job, you should have a major head-start with your dplyr
knowledge!
In much of this section, we have created code with dbplyr
and seen how that code translates to SQL
. In this exercise, you will instead be given SQL
code and asked to write dbplyr
code that achieves the same thing.
Exercise 4. Examine the SQL
code below and write equivalent dbplyr
code using show_info_tbl
.
*
SELECT "show_info_tbl"
FROM WHERE ("Show_name" = 'Ozark')
16.4 Practice
16.4.1 Class Exercises
Class Exercise 1. Examine the SQL
code below and write equivalent dbplyr
code using show_info_tbl
.
"startYear", "parentTconst", "Show_Name" AS "show_name"
SELECT "show_info_tbl"
FROM "startYear" DESC ORDER BY
16.4.2 Your Turn
Your Turn 1. Examine the SQL
code below and write equivalent dbplyr
code on episode_rating_tbl
.
<SQL>
"parentTconst", MAX("averageRating") AS "best_episode"
SELECT "episode_rating_tbl"
FROM "parentTconst" GROUP BY
Your Turn 2. Try to run a function from forcats
on episode_rating_tbl
with mutate()
. Does the function work?
Your Turn 3. Run the following code and write how the !
is translated to SQL
, how the %in%
symbol is translated to SQL
, and how distinct()
is translated to SQL
.
episode_rating_tbl |> filter(seasonNumber != "1") |>
show_query()
episode_rating_tbl |>
filter(seasonNumber %in% c("1", "2")) |>
show_query()
episode_rating_tbl |> distinct(parentTconst) |>
show_query()
Your Turn 4. Examine the following SQL
code and write equivalent dbplyr
code on episode_rating_tbl
.
<SQL>
"parentTconst", COUNT(*) AS "n_episodes"
SELECT "episode_rating_tbl"
FROM WHERE ("seasonNumber" = '1')
"parentTconst"
GROUP BY "n_episodes" DESC ORDER BY
Your Turn 5. Use the joined data set (with the show names) to choose a show that interests you. Use collect()
to convert the data frame to a tibble()
. Then, construct a graph that lets you explore that shows average imdb rating through time. Finally, add a label with the episode name (use a search engine to figure out the name) of the highest rated episode for that show according to imdb.