library(tidyverse)
library(here)
cars_df <- read_csv(here("data/mtcarsex.csv"))
head(cars_df)
#> # A tibble: 6 × 11
#> This is a data set about …¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "I'm a na\x95ve data input… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 "mpg" cyl disp hp drat wt qsec vs am
#> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 "-999" 6 160 110 3.9 2.62 16.46 0 1
#> 6 "21" 6 160 110 3.9 2.875 17.02 0 1
#> # ℹ abbreviated name: ¹`This is a data set about cars.`
#> # ℹ 2 more variables: ...10 <chr>, ...11 <chr>
10 Data Import
Goals:
Use
readr
to read in data toR
from .csv, .txt, and .tsv files.Use
rvest
to scrape data from public websites.Use
jsonlite
to read in data in JSON (Java Script Object Notation) files.
10.1 readr
to Read in Data
Up to now, we have mostly worked with data that was “R
Ready”: meaning that it was in a nice .csv file that could be read into R
easily with read_csv()
from the readr
package. We will begin by looking at some options in the read_csv()
function and then move into formats other than .csv that data are commonly stored in.
10.1.1 read_csv()
Options
The mtcarsex.csv
has observations on different car models with variables that include things like gas mileage, number of cylinders, etc. Read in the mtcarsex.csv
data set with the following code. Then, examine the data set with head()
.
What do you notice about the data set that seems odd? Open the .csv file with Excel or some other program to examine the data set outside of R
.
Type in ?read_csv
in the bottom-left window and look at some of the options in read_csv()
. In particular, we will use the na
and the skip
arguments to fix up our reading.
Let’s start with skip
so that we aren’t reading in the first two rows of the data set:
cars_df <- read_csv(here("data/mtcarsex.csv"), skip = 2)
## first two lines will be skipped
cars_df
#> # A tibble: 34 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 NA NA NA NA NA NA NA NA NA NA NA
#> 2 NA NA NA NA NA NA NA NA NA NA NA
#> 3 -999 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 4 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 5 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 6 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> # ℹ 28 more rows
That looks better, but there are still a couple of problems. What do you notice?
Go to the help and read about the na
argument. Let’s add that as an option to fix the missing value issue.
cars_df <- read_csv(here("data/mtcarsex.csv"), na = c("NA", "-999"), skip = 2)
head(cars_df)
#> # A tibble: 6 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 NA NA NA NA NA NA NA NA NA NA NA
#> 2 NA NA NA NA NA NA NA NA NA NA NA
#> 3 NA 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 4 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 5 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 6 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
Now look at the classes of each variable. Which classes look like they are incorrect?
We’ve talked about how to re-specify classes of variables using mutate()
and the as.factor()
or as.Date()
or as.numeric()
functions, but sometimes it’s easier just to respecify the class when we are reading in the data. Notice how, when we use read_csv()
, R
gives us a message about each of the column types. This is actually an argument in read_csv()
called col_types
. We can add a |> spec()
piping statement after a read_csv()
statement to tell R
to print the col_types
so that it’s easy for us to copy and paste it into read_csv()
and change any classes.
read_csv(here("data/mtcarsex.csv"), na = c("NA", "-999"), skip = 2) |>
spec()
#> cols(
#> mpg = col_double(),
#> cyl = col_double(),
#> disp = col_double(),
#> hp = col_double(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_double(),
#> am = col_double(),
#> gear = col_double(),
#> carb = col_double()
#> )
For example, notice how cyl = col_double()
is changed to cyl = col_factor()
in the code chunk below:
cars_df <- read_csv(here("data/mtcarsex.csv"), na = c(NA, "-999"), skip = 2,
col_types = cols(
mpg = col_double(),
cyl = col_factor(),
disp = col_double(),
hp = col_double(),
drat = col_double(),
wt = col_double(),
qsec = col_double(),
vs = col_factor(),
am = col_double(),
gear = col_double(),
carb = col_double()
))
Finally, there are two rows with all missing values. These aren’t providing anything useful so we can slice()
them out:
cars_df <- read_csv(here("data/mtcarsex.csv"), na = c("NA", "-999"), skip = 2,
col_types = cols(
mpg = col_double(),
cyl = col_factor(),
disp = col_double(),
hp = col_double(),
drat = col_double(),
wt = col_double(),
qsec = col_double(),
vs = col_factor(),
am = col_double(),
gear = col_double(),
carb = col_double()
)) |>
slice(-(1:2))
head(cars_df)
#> # A tibble: 6 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 NA 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 NA 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
There are many other possible file formats for data storage. For example, there is a data set called oscars.tsv
, which is a tab-separated file. You can read it in with read_tsv()
instead of read_csv()
.
oscars_df <- read_tsv(here("data/oscars.tsv"))
head(oscars_df)
#> # A tibble: 6 × 51
#> FilmName OscarYear Duration Rating DirectorName DirectorGender OscarWinner
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 Crash 2006 113 4 Haggis 0 1
#> 2 Brokebac… 2006 134 4 Lee 0 0
#> 3 Capote 2006 114 4 Miller 0 0
#> 4 Good Nig… 2006 93 2 Clooney 0 0
#> 5 Munich 2006 164 4 Spielberg 0 0
#> 6 The Depa… 2007 151 4 Scorsese 0 1
#> # ℹ 44 more variables: GenreName <chr>, Genre_Drama <dbl>, Genre_Bio <dbl>,
#> # CountryName <chr>, ForeignandUSA <dbl>, ProductionName <chr>,
#> # ProductionCompany <dbl>, BudgetRevised <chr>, Budget <chr>,
#> # DomesticBoxOffice <dbl>, WorldwideRevised <dbl>,
#> # WorldwideBoxOffice <dbl>, DomesticPercent <dbl>,
#> # LimitedOpeningWnd <dbl>, LimitedTheaters <dbl>,
#> # LimitedAveragePThtr <dbl>, WideOpeningWkd <dbl>, WideTheaters <dbl>, …
We’ll be able to work with .txt files and Excel files in the Exercises. Check out https://rawgit.com/rstudio/cheatsheets/master/data-import.pdf for a data import cheatsheet.
The final issue that we will discuss in this section occurs when a data set has units within its cells. Consider the earlier example that we used in the reprex section:
The parse_number()
function is really useful if we just want the number (no commas, no units, etc.). The function is often paired with mutate()
since we are creating a new variable:
test_df |> mutate(x2 = parse_number(x))
#> # A tibble: 3 × 3
#> x y x2
#> <chr> <dbl> <dbl>
#> 1 20,000 dollars 1 20000
#> 2 40 dollars 2 40
#> 3 only 13 dollars 3 13
Exercise 1. Recall the fitness data set.
Use the col_types
argument in read_csv()
so that stepgoal
is read in as a logical variable with col_logical()
and so that month
and weekday
are both read in as factors.
10.2 Data Scraping with rvest
Sometimes, we might want data from a public website that isn’t provided in a file format. To obtain this data, we’ll need to use web scraping, a term which just means “getting data from a website.” The easiest way to do this in R
is with the rvest
package. Note that we could spend an entire semester talking about web scraping, but we will focus only on websites where the scraping of data is “easy” and won’t give us any major errors.
Go to the following website and suppose that you wanted the table of gun violence statistics in R
: https://en.wikipedia.org/wiki/Gun_violence_in_the_United_States_by_state. We could try copy-pasting the table into Excel and reading the data set in with read_excel()
. Depending on the format of the table, that strategy may work but it may not. Another way is to scrape it directly with rvest
. Additionally, if the website continually updates (standings for a sports league, enrollment data for a school, best-selling products for a company, etc.), then scraping is much more convenient, as we don’t need to continually copy-paste for updated data.
In the following code chunk, read_html()
reads in the entire html file from the url provided while html_nodes()
extracts only the tables on the website.
library(tidyverse)
library(rvest)
## provide the URL and name it something (in this case, url).
url <- "https://en.wikipedia.org/wiki/Gun_violence_in_the_United_States_by_state"
## read_html() convert the html code from the URL into something R can read
tab <- read_html(url) |>
html_nodes("table") ## html_nodes can grab only the tables
We see that, for this example, there are 3 tables provided. The tables are stored in a list
and we can reference the first table using [[1]]
, the second table using [[2]]
, etc.
For the purposes of this class, we will figure out which of the 3 tables is the one we actually want using trial and error.
The html_table()
function converts the table into a data.frame
object.
test1 <- tab[[1]] |> html_table()
test1
test2 <- tab[[2]] |> html_table()
test2
test3 <- tab[[3]] |> html_table()
test3
Which of the 3 tables is the one that we would want to use for an analysis on gun violence in the United States?
As another example, consider scraping data from SLU’s athletics page. In particular, suppose we want to do an analysis on SLU’s baseball team.
Go to the following website to look at the table of data that we want to scrape: https://saintsathletics.com/sports/baseball/stats/2023.
After looking at the website, use the following code to scrape the data set.
url <- "https://saintsathletics.com/sports/baseball/stats/2023"
tab <- read_html(url) |> html_nodes("table")
tab
obj <- tab[[1]] |> html_table()
obj
obj2 <- tab[[2]] |> html_table()
obj2
There’s now 72 different tables! See if you can figure out where the first few tables are coming from on the website.
Exercise 2. SLU keeps track of diversity of students and makes this data public on the following website: https://www.stlawu.edu/offices/institutional-research/student-diversity-2021. Use rvest
to scrape one of the data tables into R
.
10.3 JSON Files with jsonlite
A final common data format that we will discuss is JSON (JavaScript Object Notation). We will only cover the very basics of JSON data and use the jsonlite
package in R
to read in some .json files. JSON files are read in to R
as a list
object.
10.3.1 Everything Working Well
First, consider data from the mobile game Clash Royale. Install the jsonlite
package and then use it to read in the json
file with the function fromJSON()
:
Next, type View(cr_cards)
in your console (bottom-left) window to look at the data. See if you can pull out the data set by clicking on some things in the View()
window.
The following give a couple of ways to grab the data using code. The as_tibble()
function converts a rectangular object into our familiar tibble
.
The first option specifies the name of the table that’s in the JSON file (in this case, the name is "cards"
):
library(tidyverse)
cr_cards_flat <- cr_cards[["cards"]]
cr_cards_df <- as_tibble(cr_cards_flat)
cr_cards_df
#> # A tibble: 93 × 8
#> key name elixir type rarity arena description id
#> <chr> <chr> <int> <chr> <chr> <int> <chr> <int>
#> 1 knight Knight 3 Troop Common 0 A tough melee fighter. … 2.6 e7
#> 2 archers Archers 3 Troop Common 0 A pair of lightly armor… 2.60e7
#> 3 goblins Goblins 2 Troop Common 1 Three fast, unarmored m… 2.60e7
#> 4 giant Giant 5 Troop Rare 0 Slow but durable, only … 2.60e7
#> 5 pekka P.E.K.K.A 7 Troop Epic 4 A heavily armored, slow… 2.60e7
#> 6 minions Minions 3 Troop Common 0 Three fast, unarmored f… 2.60e7
#> # ℹ 87 more rows
The second method uses the flatten()
function from the purrr
package, the only package in the core tidyverse
that we do not talk about in detail in this class. There is also a different flatten()
function in the jsonlite
package. In the code below, we specify that we want to use flatten()
from purrr
with purrr::flatten()
. If we wanted to use flatten()
from jsonlite
, we’d use jsonlite::flatten()
cr_cards_flat2 <- purrr::flatten(cr_cards)
cr_cards_df2 <- as_tibble(cr_cards_flat2)
cr_cards_df2
#> # A tibble: 93 × 8
#> key name elixir type rarity arena description id
#> <chr> <chr> <int> <chr> <chr> <int> <chr> <int>
#> 1 knight Knight 3 Troop Common 0 A tough melee fighter. … 2.6 e7
#> 2 archers Archers 3 Troop Common 0 A pair of lightly armor… 2.60e7
#> 3 goblins Goblins 2 Troop Common 1 Three fast, unarmored m… 2.60e7
#> 4 giant Giant 5 Troop Rare 0 Slow but durable, only … 2.60e7
#> 5 pekka P.E.K.K.A 7 Troop Epic 4 A heavily armored, slow… 2.60e7
#> 6 minions Minions 3 Troop Common 0 Three fast, unarmored f… 2.60e7
#> # ℹ 87 more rows
Both methods give a tibble
that we can then use our usual tidyverse
tools ggplot2
, dplyr
, tidyr
, etc. on.
10.3.2 Things Aren’t Always So Easy
Now let’s try to look at some animal crossing data that were obtained from https://github.com/jefflomacy/villagerdb. We first just want to look at the data from one individual villager (ace) in the file ace.json
.
acedata <- fromJSON(here("data/ace.json"))
aceflat <- purrr::flatten(acedata)
aceflat
#> $gender
#> [1] "male"
#>
#> $species
#> [1] "bird"
#>
#> $birthday
#> [1] "3-13"
#>
#> $ac
#> $ac$personality
#> [1] "jock"
#>
#> $ac$clothes
#> [1] "spade-shirt"
#>
#> $ac$song
#> [1] "K.K. Parade"
#>
#> $ac$phrase
#> [1] "ace"
#>
#>
#> $`afe+`
#> $`afe+`$personality
#> [1] "jock"
#>
#> $`afe+`$clothes
#> [1] "spade-shirt"
#>
#> $`afe+`$song
#> [1] "K.K. Parade"
#>
#>
#> $name
#> [1] "Ace"
#>
#> $id
#> [1] "ace"
Things are now….more complicated. This example is just to show that it’s not always easy working with JSON data. Lists can be nested and that creates problems when trying to convert a deeply nested list into our “rectangular” format that’s easy to work with.
There’s also the added problem of reading in the .json
files from all villagers at the same time We could do this with a for loop or a mapping function from purrr
to download and read in the JSON files for all villagers. We won’t delve any more deeply into this, but there’s a lot more to all of the file formats that we’ve discussed this week, particularly web scraping and .json files.
10.4 Practice
10.4.1 Class Exercises
Class Exercise 1. The birthdays.txt
file has information on the birthdays of various animals on my Animal Crossing island. There are also columns for the Animal’s Name, Animal Type, and how long the animal has lived on the island (in weeks). Click on the file to open it to look at the format of the data.
Start with the following code chunk and use the options of read_delim()
to read in the data (?read_delim
). The delim
argument that’s already provided specifies that the delimiter (separator) that you’ll use is a -
, as opposed to, for example, a ,
in a .csv file. Arguments that you may need to change include
skip
col_names
na
trim_ws
col_types
library(tidyverse)
df <- read_delim(here("data/birthdays.txt"), delim = " - ")
head(df)
Class Exercise 2. Another common format for data to be stored in is an Excel file. Often, it’s easiest just to save the Excel file as a .csv file and read it in using read_csv()
. But, sometimes this route can be difficult (for example, if your Excel file has thousands of sheets). To read in directly from Excel, you’ll need to install the readxl
with install.packages("readxl")
. Once installed, load the package with library(readxl)
, and read in the first sheet evals_prof.xlsx
data set, a similar data set as the one that will be used for Project 2, with the read_excel()
function.
Class Exercise 3. Now, read in the second sheet in the Excel file, using the help file for ?read_excel
to change one of the arguments.
Class Exercise 4. A common issue when scraping tables with rvest
is that there are tables that have two rows of headers, often with duplicate names in one or both of the rows. Examine this issue at https://en.wikipedia.org/wiki/Josh_Allen_(quarterback), scrolling down to the NFL career statistics subsection.
Try to scrape Josh Allen’s stats using the “usual” data scraping method with rvest
:
url1 <- "https://en.wikipedia.org/wiki/Josh_Allen_(quarterback)"
tab_allen_stats <- read_html(url1) |> html_nodes("table")
allen_df <- tab_allen_stats[[6]] |> html_table()
allen_df
Try to perform any operation on allen_df
(a filter()
, mutate()
, arrange()
, or anything else). What error do you get?
To fix this issue, we can use the header = FALSE
argument to html_table()
and then “manually” construct the column names:
allen_stats <- tab_allen_stats[[6]] |> html_table(header = FALSE)
allen_stats
newnames <- paste(allen_stats[1, ], allen_stats[2, ])
allen_stats |> set_names(newnames) |>
slice(-1, -2)
Explain what the code above is doing to fix the duplicate name issue.
10.4.2 Your Turn
Your Turn 1. Choose a topic/person/place/etc. that interests you that has tables on Wikipedia and scrape the table that is related to that topic.
Your Turn 2. Choose a sports team at SLU, and go to that team’s website (by simply googling SLU name_of_sport). Scrape the data tables from the “Results” or “Statistics” section of this sport. After you scrape the data, tidy the data set. Then, choose one of the following options (different options might make more/less sense for different sports)
Summarise different team statistics, either numerically or graphically. Perhaps make some graphs showing different statistics through time.
Summarise different individual statistics, either numerically or graphically.
A few sports (men’s and women’s golf, for example), give results in PDF format. PDF format is generally a horrible way to record and share data, as it’s very difficult to read in to almost any program. Therefore, avoid sports with PDF results for the purposes of this exercise.
Your Turn 3. For either your topic of choice data set or your sports team data set, ask and answer any other questions that make sense for the particular topic or sport that you are looking at!