10  Data Import

Goals:

10.1 readr to Read in Data

Up to now, we have mostly worked with data that was “R Ready”: meaning that it was in a nice .csv file that could be read into R easily with read_csv() from the readr package. We will begin by looking at some options in the read_csv() function and then move into formats other than .csv that data are commonly stored in.

10.1.1 read_csv() Options

The mtcarsex.csv has observations on different car models with variables that include things like gas mileage, number of cylinders, etc. Read in the mtcarsex.csv data set with the following code. Then, examine the data set with head().

library(tidyverse)
library(here)
cars_df <- read_csv(here("data/mtcarsex.csv"))
head(cars_df)
#> # A tibble: 6 × 11
#>   This is a data set about …¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
#>   <chr>                       <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "I'm a na\x95ve data input… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 2 "mpg"                       cyl   disp  hp    drat  wt    qsec  vs    am   
#> 3  <NA>                       <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 4  <NA>                       <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 5 "-999"                      6     160   110   3.9   2.62  16.46 0     1    
#> 6 "21"                        6     160   110   3.9   2.875 17.02 0     1    
#> # ℹ abbreviated name: ¹​`This is a data set about cars.`
#> # ℹ 2 more variables: ...10 <chr>, ...11 <chr>

What do you notice about the data set that seems odd? Open the .csv file with Excel or some other program to examine the data set outside of R.

Type in ?read_csv in the bottom-left window and look at some of the options in read_csv(). In particular, we will use the na and the skip arguments to fix up our reading.

Let’s start with skip so that we aren’t reading in the first two rows of the data set:

cars_df <- read_csv(here("data/mtcarsex.csv"), skip = 2)
## first two lines will be skipped
cars_df
#> # A tibble: 34 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1   NA      NA    NA    NA NA    NA     NA      NA    NA    NA    NA
#> 2   NA      NA    NA    NA NA    NA     NA      NA    NA    NA    NA
#> 3 -999       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 4   21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 5   22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 6   21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> # ℹ 28 more rows

That looks better, but there are still a couple of problems. What do you notice?

Go to the help and read about the na argument. Let’s add that as an option to fix the missing value issue.

cars_df <- read_csv(here("data/mtcarsex.csv"), na = c("NA", "-999"), skip = 2)
head(cars_df)
#> # A tibble: 6 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  NA      NA    NA    NA NA    NA     NA      NA    NA    NA    NA
#> 2  NA      NA    NA    NA NA    NA     NA      NA    NA    NA    NA
#> 3  NA       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 4  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 5  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 6  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1

Now look at the classes of each variable. Which classes look like they are incorrect?

We’ve talked about how to re-specify classes of variables using mutate() and the as.factor() or as.Date() or as.numeric() functions, but sometimes it’s easier just to respecify the class when we are reading in the data. Notice how, when we use read_csv(), R gives us a message about each of the column types. This is actually an argument in read_csv() called col_types. We can add a |> spec() piping statement after a read_csv() statement to tell R to print the col_types so that it’s easy for us to copy and paste it into read_csv() and change any classes.

read_csv(here("data/mtcarsex.csv"), na = c("NA", "-999"), skip = 2) |>
  spec()
#> cols(
#>   mpg = col_double(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )

For example, notice how cyl = col_double() is changed to cyl = col_factor() in the code chunk below:

cars_df <- read_csv(here("data/mtcarsex.csv"), na = c(NA, "-999"), skip = 2,
  col_types = cols(
  mpg = col_double(),
  cyl = col_factor(),
  disp = col_double(),
  hp = col_double(),
  drat = col_double(),
  wt = col_double(),
  qsec = col_double(),
  vs = col_factor(),
  am = col_double(),
  gear = col_double(),
  carb = col_double()
))

Finally, there are two rows with all missing values. These aren’t providing anything useful so we can slice() them out:

cars_df <- read_csv(here("data/mtcarsex.csv"), na = c("NA", "-999"), skip = 2,
  col_types = cols(
  mpg = col_double(),
  cyl = col_factor(),
  disp = col_double(),
  hp = col_double(),
  drat = col_double(),
  wt = col_double(),
  qsec = col_double(),
  vs = col_factor(),
  am = col_double(),
  gear = col_double(),
  carb = col_double()
)) |>
  slice(-(1:2))
head(cars_df)
#> # A tibble: 6 × 11
#>     mpg cyl    disp    hp  drat    wt  qsec vs       am  gear  carb
#>   <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1  NA   6       160   110  3.9   2.62  16.5 0         1     4     4
#> 2  21   6       160   110  3.9   2.88  17.0 0         1     4     4
#> 3  22.8 4       108    93  3.85  2.32  18.6 1         1     4     1
#> 4  21.4 6       258   110  3.08  3.22  19.4 1         0     3     1
#> 5  NA   8       360   175  3.15  3.44  17.0 0         0     3     2
#> 6  18.1 6       225   105  2.76  3.46  20.2 1         0     3     1

There are many other possible file formats for data storage. For example, there is a data set called oscars.tsv, which is a tab-separated file. You can read it in with read_tsv() instead of read_csv().

oscars_df <- read_tsv(here("data/oscars.tsv"))
head(oscars_df)
#> # A tibble: 6 × 51
#>   FilmName  OscarYear Duration Rating DirectorName DirectorGender OscarWinner
#>   <chr>         <dbl>    <dbl>  <dbl> <chr>                 <dbl>       <dbl>
#> 1 Crash          2006      113      4 Haggis                    0           1
#> 2 Brokebac…      2006      134      4 Lee                       0           0
#> 3 Capote         2006      114      4 Miller                    0           0
#> 4 Good Nig…      2006       93      2 Clooney                   0           0
#> 5 Munich         2006      164      4 Spielberg                 0           0
#> 6 The Depa…      2007      151      4 Scorsese                  0           1
#> # ℹ 44 more variables: GenreName <chr>, Genre_Drama <dbl>, Genre_Bio <dbl>,
#> #   CountryName <chr>, ForeignandUSA <dbl>, ProductionName <chr>,
#> #   ProductionCompany <dbl>, BudgetRevised <chr>, Budget <chr>,
#> #   DomesticBoxOffice <dbl>, WorldwideRevised <dbl>,
#> #   WorldwideBoxOffice <dbl>, DomesticPercent <dbl>,
#> #   LimitedOpeningWnd <dbl>, LimitedTheaters <dbl>,
#> #   LimitedAveragePThtr <dbl>, WideOpeningWkd <dbl>, WideTheaters <dbl>, …
Note

We’ll be able to work with .txt files and Excel files in the Exercises. Check out https://rawgit.com/rstudio/cheatsheets/master/data-import.pdf for a data import cheatsheet.

The final issue that we will discuss in this section occurs when a data set has units within its cells. Consider the earlier example that we used in the reprex section:

test_df <- read_csv(here("data/parsedf.csv"))
head(test_df)
#> # A tibble: 3 × 2
#>   x                   y
#>   <chr>           <dbl>
#> 1 20,000 dollars      1
#> 2 40 dollars          2
#> 3 only 13 dollars     3

The parse_number() function is really useful if we just want the number (no commas, no units, etc.). The function is often paired with mutate() since we are creating a new variable:

test_df |> mutate(x2 = parse_number(x))
#> # A tibble: 3 × 3
#>   x                   y    x2
#>   <chr>           <dbl> <dbl>
#> 1 20,000 dollars      1 20000
#> 2 40 dollars          2    40
#> 3 only 13 dollars     3    13

Exercise 1. Recall the fitness data set.

fitness_df <- read_csv(here::here("data/higham_fitness_notclean.csv"))

Use the col_types argument in read_csv() so that stepgoal is read in as a logical variable with col_logical() and so that month and weekday are both read in as factors.

10.2 Data Scraping with rvest

Sometimes, we might want data from a public website that isn’t provided in a file format. To obtain this data, we’ll need to use web scraping, a term which just means “getting data from a website.” The easiest way to do this in R is with the rvest package. Note that we could spend an entire semester talking about web scraping, but we will focus only on websites where the scraping of data is “easy” and won’t give us any major errors.

Go to the following website and suppose that you wanted the table of gun violence statistics in R: https://en.wikipedia.org/wiki/Gun_violence_in_the_United_States_by_state. We could try copy-pasting the table into Excel and reading the data set in with read_excel(). Depending on the format of the table, that strategy may work but it may not. Another way is to scrape it directly with rvest. Additionally, if the website continually updates (standings for a sports league, enrollment data for a school, best-selling products for a company, etc.), then scraping is much more convenient, as we don’t need to continually copy-paste for updated data.

In the following code chunk, read_html() reads in the entire html file from the url provided while html_nodes() extracts only the tables on the website.

library(tidyverse)
library(rvest)

## provide the URL and name it something (in this case, url).
url <- "https://en.wikipedia.org/wiki/Gun_violence_in_the_United_States_by_state"

## read_html() convert the html code from the URL into something R can read
tab <- read_html(url) |> 
  html_nodes("table") ## html_nodes can grab only the tables 

We see that, for this example, there are 3 tables provided. The tables are stored in a list and we can reference the first table using [[1]], the second table using [[2]], etc.

Important

For the purposes of this class, we will figure out which of the 3 tables is the one we actually want using trial and error.

The html_table() function converts the table into a data.frame object.

test1 <- tab[[1]] |> html_table()
test1
test2 <- tab[[2]] |> html_table()
test2
test3 <- tab[[3]] |> html_table()
test3

Which of the 3 tables is the one that we would want to use for an analysis on gun violence in the United States?

As another example, consider scraping data from SLU’s athletics page. In particular, suppose we want to do an analysis on SLU’s baseball team.

Go to the following website to look at the table of data that we want to scrape: https://saintsathletics.com/sports/baseball/stats/2023.

After looking at the website, use the following code to scrape the data set.

url <- "https://saintsathletics.com/sports/baseball/stats/2023"

tab <- read_html(url) |> html_nodes("table")
tab
obj <- tab[[1]] |> html_table()
obj
obj2 <- tab[[2]] |> html_table()
obj2

There’s now 72 different tables! See if you can figure out where the first few tables are coming from on the website.

Exercise 2. SLU keeps track of diversity of students and makes this data public on the following website: https://www.stlawu.edu/offices/institutional-research/student-diversity-2021. Use rvest to scrape one of the data tables into R.

10.3 JSON Files with jsonlite

A final common data format that we will discuss is JSON (JavaScript Object Notation). We will only cover the very basics of JSON data and use the jsonlite package in R to read in some .json files. JSON files are read in to R as a list object.

10.3.1 Everything Working Well

First, consider data from the mobile game Clash Royale. Install the jsonlite package and then use it to read in the json file with the function fromJSON():

## install.packages("jsonlite")
library(jsonlite)
cr_cards <- fromJSON(here("data/clash_royale_card_info.json"))

Next, type View(cr_cards) in your console (bottom-left) window to look at the data. See if you can pull out the data set by clicking on some things in the View() window.

The following give a couple of ways to grab the data using code. The as_tibble() function converts a rectangular object into our familiar tibble.

The first option specifies the name of the table that’s in the JSON file (in this case, the name is "cards"):

library(tidyverse)
cr_cards_flat <- cr_cards[["cards"]]
cr_cards_df <- as_tibble(cr_cards_flat)
cr_cards_df
#> # A tibble: 93 × 8
#>   key     name      elixir type  rarity arena description                  id
#>   <chr>   <chr>      <int> <chr> <chr>  <int> <chr>                     <int>
#> 1 knight  Knight         3 Troop Common     0 A tough melee fighter. … 2.6 e7
#> 2 archers Archers        3 Troop Common     0 A pair of lightly armor… 2.60e7
#> 3 goblins Goblins        2 Troop Common     1 Three fast, unarmored m… 2.60e7
#> 4 giant   Giant          5 Troop Rare       0 Slow but durable, only … 2.60e7
#> 5 pekka   P.E.K.K.A      7 Troop Epic       4 A heavily armored, slow… 2.60e7
#> 6 minions Minions        3 Troop Common     0 Three fast, unarmored f… 2.60e7
#> # ℹ 87 more rows

The second method uses the flatten() function from the purrr package, the only package in the core tidyverse that we do not talk about in detail in this class. There is also a different flatten() function in the jsonlite package. In the code below, we specify that we want to use flatten() from purrr with purrr::flatten(). If we wanted to use flatten() from jsonlite, we’d use jsonlite::flatten()

cr_cards_flat2 <- purrr::flatten(cr_cards)
cr_cards_df2 <- as_tibble(cr_cards_flat2)
cr_cards_df2
#> # A tibble: 93 × 8
#>   key     name      elixir type  rarity arena description                  id
#>   <chr>   <chr>      <int> <chr> <chr>  <int> <chr>                     <int>
#> 1 knight  Knight         3 Troop Common     0 A tough melee fighter. … 2.6 e7
#> 2 archers Archers        3 Troop Common     0 A pair of lightly armor… 2.60e7
#> 3 goblins Goblins        2 Troop Common     1 Three fast, unarmored m… 2.60e7
#> 4 giant   Giant          5 Troop Rare       0 Slow but durable, only … 2.60e7
#> 5 pekka   P.E.K.K.A      7 Troop Epic       4 A heavily armored, slow… 2.60e7
#> 6 minions Minions        3 Troop Common     0 Three fast, unarmored f… 2.60e7
#> # ℹ 87 more rows

Both methods give a tibble that we can then use our usual tidyverse tools ggplot2, dplyr, tidyr, etc. on.

10.3.2 Things Aren’t Always So Easy

Now let’s try to look at some animal crossing data that were obtained from https://github.com/jefflomacy/villagerdb. We first just want to look at the data from one individual villager (ace) in the file ace.json.

acedata <- fromJSON(here("data/ace.json"))
aceflat <- purrr::flatten(acedata)
aceflat
#> $gender
#> [1] "male"
#> 
#> $species
#> [1] "bird"
#> 
#> $birthday
#> [1] "3-13"
#> 
#> $ac
#> $ac$personality
#> [1] "jock"
#> 
#> $ac$clothes
#> [1] "spade-shirt"
#> 
#> $ac$song
#> [1] "K.K. Parade"
#> 
#> $ac$phrase
#> [1] "ace"
#> 
#> 
#> $`afe+`
#> $`afe+`$personality
#> [1] "jock"
#> 
#> $`afe+`$clothes
#> [1] "spade-shirt"
#> 
#> $`afe+`$song
#> [1] "K.K. Parade"
#> 
#> 
#> $name
#> [1] "Ace"
#> 
#> $id
#> [1] "ace"

Things are now….more complicated. This example is just to show that it’s not always easy working with JSON data. Lists can be nested and that creates problems when trying to convert a deeply nested list into our “rectangular” format that’s easy to work with.

There’s also the added problem of reading in the .json files from all villagers at the same time We could do this with a for loop or a mapping function from purrr to download and read in the JSON files for all villagers. We won’t delve any more deeply into this, but there’s a lot more to all of the file formats that we’ve discussed this week, particularly web scraping and .json files.

10.4 Practice

10.4.1 Class Exercises

Class Exercise 1. The birthdays.txt file has information on the birthdays of various animals on my Animal Crossing island. There are also columns for the Animal’s Name, Animal Type, and how long the animal has lived on the island (in weeks). Click on the file to open it to look at the format of the data.

Start with the following code chunk and use the options of read_delim() to read in the data (?read_delim). The delim argument that’s already provided specifies that the delimiter (separator) that you’ll use is a -, as opposed to, for example, a , in a .csv file. Arguments that you may need to change include

  • skip
  • col_names
  • na
  • trim_ws
  • col_types
library(tidyverse)
df <- read_delim(here("data/birthdays.txt"), delim = " - ")
head(df)

Class Exercise 2. Another common format for data to be stored in is an Excel file. Often, it’s easiest just to save the Excel file as a .csv file and read it in using read_csv(). But, sometimes this route can be difficult (for example, if your Excel file has thousands of sheets). To read in directly from Excel, you’ll need to install the readxl with install.packages("readxl"). Once installed, load the package with library(readxl), and read in the first sheet evals_prof.xlsx data set, a similar data set as the one that will be used for Project 2, with the read_excel() function.

Class Exercise 3. Now, read in the second sheet in the Excel file, using the help file for ?read_excel to change one of the arguments.

Class Exercise 4. A common issue when scraping tables with rvest is that there are tables that have two rows of headers, often with duplicate names in one or both of the rows. Examine this issue at https://en.wikipedia.org/wiki/Josh_Allen_(quarterback), scrolling down to the NFL career statistics subsection.

Try to scrape Josh Allen’s stats using the “usual” data scraping method with rvest:

url1 <- "https://en.wikipedia.org/wiki/Josh_Allen_(quarterback)"
tab_allen_stats <- read_html(url1) |> html_nodes("table")
allen_df <- tab_allen_stats[[6]] |> html_table()
allen_df

Try to perform any operation on allen_df (a filter(), mutate(), arrange(), or anything else). What error do you get?

To fix this issue, we can use the header = FALSE argument to html_table() and then “manually” construct the column names:

allen_stats <- tab_allen_stats[[6]] |> html_table(header = FALSE) 
allen_stats

newnames <- paste(allen_stats[1, ], allen_stats[2, ])
allen_stats |> set_names(newnames) |>
  slice(-1, -2)

Explain what the code above is doing to fix the duplicate name issue.

10.4.2 Your Turn

Your Turn 1. Choose a topic/person/place/etc. that interests you that has tables on Wikipedia and scrape the table that is related to that topic.

Your Turn 2. Choose a sports team at SLU, and go to that team’s website (by simply googling SLU name_of_sport). Scrape the data tables from the “Results” or “Statistics” section of this sport. After you scrape the data, tidy the data set. Then, choose one of the following options (different options might make more/less sense for different sports)

  1. Summarise different team statistics, either numerically or graphically. Perhaps make some graphs showing different statistics through time.

  2. Summarise different individual statistics, either numerically or graphically.

Note

A few sports (men’s and women’s golf, for example), give results in PDF format. PDF format is generally a horrible way to record and share data, as it’s very difficult to read in to almost any program. Therefore, avoid sports with PDF results for the purposes of this exercise.

Your Turn 3. For either your topic of choice data set or your sports team data set, ask and answer any other questions that make sense for the particular topic or sport that you are looking at!