13 Text Data with `tidytext` and `stringr`

Goals:

use functions in the stringr package and in the tidytext package to analyze text data.
introduce some of the issues with manipulating strings that don’t pertain to numeric or factor data.
perform a basic sentiment analysis.

13.1 Text Analysis

Beyonce is a legend. For this example, we will work through a text analysis on lyrics from songs from Beyonce’s albums, utilizing functions from both stringr to parse strings and tidytext to convert text data into a tidy format. To begin, read in a data set of Beyonce’s lyrics:

library(tidyverse)
library(here)
beyonce <- read_csv(here("data/beyonce_lyrics.csv"))
head(beyonce)

We will be most focused on the line variable, as each value for this variable contains a line from a Beyonce song. There’s other variables present as well, such as the song_name and the artist_name (the data set originally came from a data set with artists other than Beyonce).

You can look at the first 4 values of line with

beyonce$line[1:4]
#> [1] "If I ain't got nothing, I got you"                       
#> [2] "If I ain't got something, I don't give a damn"           
#> [3] "'Cause I got it with you"                                
#> [4] "I don't know much about algebra, but I know 1+1 equals 2"

Our end goal is to construct a plot that shows the most popular words in Beyonce’s albums. This is much more challenging than it sounds because we will have to deal with the nuances of working with text data.

The tidytext package makes it a lot easier to work with text data in many regards. Let’s use the unnest_tokens() functions from tidytext to separate the lines into individual words. We’ll name this new data set beyonce_unnest:

library(tidytext)
beyonce_unnest <- beyonce |> unnest_tokens(output = "word", input = "line")
beyonce_unnest
#> # A tibble: 164,740 × 6
#>   song_id song_name artist_id artist_name song_line word   
#>     <dbl> <chr>         <dbl> <chr>           <dbl> <chr>  
#> 1   50396 1+1             498 Beyoncé             1 if     
#> 2   50396 1+1             498 Beyoncé             1 i      
#> 3   50396 1+1             498 Beyoncé             1 ain't  
#> 4   50396 1+1             498 Beyoncé             1 got    
#> 5   50396 1+1             498 Beyoncé             1 nothing
#> 6   50396 1+1             498 Beyoncé             1 i      
#> # ℹ 164,734 more rows

We’ll want to make sure that either all words are capitalized or no words are capitalized, for consistency (remember that R is case-sensitive). To that end, we’ll modify the word variable and use stringr’s str_to_lower() to change all letters to lower-case:

beyonce_unnest <- beyonce_unnest |> mutate(word = str_to_lower(word))

Let’s try counting up Beyonce’s most popular words from the data set we just made:

beyonce_unnest |> group_by(word) |>
  summarise(n = n()) |>
  arrange(desc(n))
#> # A tibble: 6,469 × 2
#>   word      n
#>   <chr> <int>
#> 1 you    7693
#> 2 i      6669
#> 3 the    4719
#> 4 me     3774
#> 5 to     3070
#> 6 it     2999
#> # ℹ 6,463 more rows

What’s the issue here?

To remedy this, we can use what are called stop words: words that are very common and carry little to no meaningful information. For example the, it, are, etc. are all stop words. We need to eliminate these from the data set before we continue on. Luckily, the tidytext package also provides a data set of common stop words in a data set named stop_words:

head(stop_words)
#> # A tibble: 6 × 2
#>   word      lexicon
#>   <chr>     <chr>  
#> 1 a         SMART  
#> 2 a's       SMART  
#> 3 able      SMART  
#> 4 about     SMART  
#> 5 above     SMART  
#> 6 according SMART

Let’s join the Beyonce lyrics data set to the stop words data set and elminate any stop words:

beyonce_stop <- anti_join(beyonce_unnest, stop_words, by = join_by(word == word))

Then, we can re-make the table with the stop words removed:

beyonce_sum <- beyonce_stop |> group_by(word) |>
  summarise(n = n()) |>
  arrange(desc(n)) |>
  print(n = 25)
#> # A tibble: 5,937 × 2
#>    word        n
#>    <chr>   <int>
#>  1 love     1362
#>  2 baby     1024
#>  3 girl      592
#>  4 wanna     564
#>  5 hey       499
#>  6 boy       494
#>  7 yeah      491
#>  8 feel      488
#>  9 time      452
#> 10 uh        408
#> 11 halo      383
#> 12 check     366
#> 13 tonight   342
#> 14 girls     341
#> 15 ya        327
#> 16 run       325
#> 17 crazy     308
#> 18 world     301
#> 19 body      287
#> 20 ooh       281
#> 21 ladies    269
#> 22 top       241
#> 23 gotta     240
#> 24 beyoncé   238
#> 25 night     213
#> # ℹ 5,912 more rows
beyonce_sum
#> # A tibble: 5,937 × 2
#>   word      n
#>   <chr> <int>
#> 1 love   1362
#> 2 baby   1024
#> 3 girl    592
#> 4 wanna   564
#> 5 hey     499
#> 6 boy     494
#> # ℹ 5,931 more rows

Looking through the list, there are still some stop words in there that were not picked up on in the stop_words data set.

Exercise 1. Look at the remaining words. Do any of them look like stop words that were missed with the stop words from the tidytext package? Create a tibble with a few of the remaining stop words (like ooh, gotta, ya, uh, and yeah) not picked up by the tidytext package, and use a join function to drop these words from the data set.

Hint (only if you get stuck)

The join function you will need to use here is anti_join().

Exercise 2. With the new data set, construct a lollipop plot or a bar plot that shows the 20 most common words Beyonce uses, as well as the number of times each word is used.

Exercise 3. Use the wordcloud() function in the wordcloud library and the code below to make a wordcloud of Beyonce’s words.

## install.packages("wordcloud")
library(wordcloud)
beyonce_small <- beyonce_sum |> filter(n > 50)
wordcloud(beyonce_small$word, beyonce_small$n, 
          colors = brewer.pal(8, "Dark2"), scale = c(5, .2),
          random.order = FALSE, random.color = FALSE)

There is not anything else you need to do for this exercise: just make the word cloud!

Note

If you want to delve into text data more, you’ll need to learn about regular expressions , or regexes. If interested, you can read more in the R4DS textbook. Starting out is not too bad, but learning about escaping special characters in R can be much more challenging!

We analyzed a short text data set, but, you can imagine extending this type of analysis to things like:

song lyrics, if you have the lyrics to all of the songs from an artist https://rpubs.com/RosieB/taylorswiftlyricanalysis
book analysis, if you have the text of an entire book or series of books
tv analysis, if you have the scripts to all episodes of a tv show

If you were doing one of these analyses, there are lots of cool functions in tidytext to help you out!

13.2 Introduction to `stringr`

In the previous examples, the string data that we had consisted primarily of words. The tools in tidytext make working with data consisting of words not too painful. However, some data exists as strings that are not words. For a non-trivial example, consider data sets obtained from https://github.com/JeffSackmann/tennis_MatchChartingProject, a repository for professional tennis match charting put together by Jeff Sackmann. Some of the following code was modified from a project completed by a now-graduated student, James Wolpe.

From this repository, I have put together a data set on one particular tennis match to make it a bit easier for us to get started. The match I have chosen is the 2021 U.S. Open Final between Daniil Medvedev and Novak Djokovic. Why this match? This was arguably the most important match of Djokovic’s career: if he won, he would win all four grand slams in a calendar year. I don’t like Djokovic and he lost so looking back at the match brings me joy. Read in the data set with:

library(here)
library(tidyverse)
med_djok_df <- read_csv(here("data/med_djok.csv"))
med_djok_df
#> # A tibble: 182 × 46
#>   point      Serving match_id    Pt  Set1  Set2   Gm1   Gm2 Pts   `Gm#` TbSet
#>   <chr>      <chr>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 4f2d@      ND      2021091…     1     0     0     0     0 0-0   1 (1)     1
#> 2 6d         ND      2021091…     2     0     0     0     0 15-0  1 (2)     1
#> 3 6b29f3b3b… ND      2021091…     3     0     0     0     0 15-15 1 (3)     1
#> 4 4b28f1f2f… ND      2021091…     4     0     0     0     0 30-15 1 (4)     1
#> 5 5b37b3b3b… ND      2021091…     5     0     0     0     0 40-15 1 (5)     1
#> 6 6f28f1f1f… ND      2021091…     6     0     0     0     0 40-30 1 (6)     1
#> # ℹ 176 more rows
#> # ℹ 35 more variables: `TB?` <dbl>, TBpt <lgl>, Svr <dbl>, Ret <dbl>,
#> #   `1st` <chr>, `2nd` <chr>, Notes <chr>, `1stSV` <dbl>, `2ndSV` <dbl>,
#> #   `1stIn` <dbl>, `2ndIn` <dbl>, isAce <lgl>, isUnret <lgl>,
#> #   isRallyWinner <lgl>, isForced <lgl>, isUnforced <lgl>, isDouble <lgl>,
#> #   PtWinner <dbl>, isSvrWinner <dbl>, rallyCount <dbl>, `Player 1` <chr>,
#> #   `Player 2` <chr>, `Pl 1 hand` <chr>, `Pl 2 hand` <chr>, Gender <chr>, …

The observations of the data set correspond to points played (so there is one row per point). There are a ton of variables in this data set, but the most important variable is the first variable, point, which contains a string with information about the types of shots that were played during the point. The coding of the point variable includes:

4 for serve out wide, 5 for serve into the body, and 6 for a serve “down the t (center)”.
f for forehand stroke, b for backhand stroke.
1 to a right-hander’s forehand side, 2 for down the middle of the court, and 3 to a right-hander’s backhand side.
d for a ball hit deep, w for a ball hit wide, and n for a ball hit into the net
@ symbol at the end if the point ended in an unforced error
and there’s lots of other numbers and symbols that correspond to other things (volleys, return depths, hitting the top of the net, etc.)

For example, Djokovic served the 7th point of the match, which has a point value of 4f18f1f2b3b2f1w@. This reads that

4: Djokovic served out wide,
f18: Medvedev hit a forehand cross-court to Djokovic’s forehand side
f1: Djokovic hit a forehand cross-court to Medvedev’s forehand side
f2: Medvedev hit a forehand to the center of the court
b3: Djokovic hit a backhand to Medvedev’s backhand side
b2: Medvedev hit a backhand to the center of the court
f1w@: Djokovic hit a forehand to Medvedev’s forehand side, but the shot landed wide and was recorded as an unforced error.

Clearly, there is a lot of data encoded in the point variable. We are going to introduce stringr by answering a relatively simple question: what are the serving patterns of Medvedev and Djokovic during this match?

13.2.1 Regular Expressions

A regex, or regular expression, is a string used to identify particular patterns in string data.

Note

Regular expressions are used in many languages (so, if you google something about a regular expression, you do not need to limit yourself to just looking at resources pertaining to R).

Regex’s can be used with the functions in the stringr package.

Note

The functions in the stringr package begin with str_(), much like the functions in forcats began with fct_().

We will first focus on the str_detect() function, which detects whether a particular regex is present in a string variable. str_detect() takes the name of the string as its first argument and the regex as the second argument. For example,

str_detect(med_djok_df$point, pattern = "f")
#>   [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
#>  [13]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#>  [25]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
#>  [37]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
#>  [49] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE
#>  [61]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
#>  [73] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
#>  [85]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
#>  [97]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [109] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
#> [121]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
#> [133]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
#> [145] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
#> [157]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
#> [169] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [181] FALSE FALSE

returns a TRUE if the letter f appears anywhere in the string and a FALSE if it does not. So, we can examine how many points a forehand was hit in the Medvedev-Djokovic match.

13.2.2 `stringr` Functions with `dplyr`

We can combine the stringr functions with dplyr functions that we already know and love. For example, if we are only interested in points the end in an unforced error (so points that have the @ symbol in them), we can filter out the points that don’t have an @:

med_djok_df |> filter(str_detect(point, pattern = "@") == TRUE)
#> # A tibble: 63 × 46
#>   point      Serving match_id    Pt  Set1  Set2   Gm1   Gm2 Pts   `Gm#` TbSet
#>   <chr>      <chr>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 4f2d@      ND      2021091…     1     0     0     0     0 0-0   1 (1)     1
#> 2 6b29f3b3b… ND      2021091…     3     0     0     0     0 15-15 1 (3)     1
#> 3 5b37b3b3b… ND      2021091…     5     0     0     0     0 40-15 1 (5)     1
#> 4 4f18f1f2b… ND      2021091…     7     0     0     0     0 40-40 1 (7)     1
#> 5 5b28f1f2f… ND      2021091…     8     0     0     0     0 40-AD 1 (8)     1
#> 6 6b27f2f2b… DM      2021091…    13     0     0     0     1 40-15 2 (5)     1
#> # ℹ 57 more rows
#> # ℹ 35 more variables: `TB?` <dbl>, TBpt <lgl>, Svr <dbl>, Ret <dbl>,
#> #   `1st` <chr>, `2nd` <chr>, Notes <chr>, `1stSV` <dbl>, `2ndSV` <dbl>,
#> #   `1stIn` <dbl>, `2ndIn` <dbl>, isAce <lgl>, isUnret <lgl>,
#> #   isRallyWinner <lgl>, isForced <lgl>, isUnforced <lgl>, isDouble <lgl>,
#> #   PtWinner <dbl>, isSvrWinner <dbl>, rallyCount <dbl>, `Player 1` <chr>,
#> #   `Player 2` <chr>, `Pl 1 hand` <chr>, `Pl 2 hand` <chr>, Gender <chr>, …

We can then use mutate() with case_when() to create a variable corresponding to error type and then summarise() the error types made from the two players.

med_djok_df |> filter(str_detect(point, pattern = "@") == TRUE) |>
  mutate(error_type = case_when(str_detect(point, pattern = "d@") ~ "deep error",
                                str_detect(point, pattern = "w@") ~ "wide error",
                                str_detect(point, pattern = "n@") ~ "net error")) |>
  group_by(PtWinner, error_type) |>
  summarise(n_errors = n())
#> # A tibble: 7 × 3
#> # Groups:   PtWinner [2]
#>   PtWinner error_type n_errors
#>      <dbl> <chr>         <int>
#> 1        1 deep error        9
#> 2        1 net error         6
#> 3        1 wide error       10
#> 4        2 deep error       12
#> 5        2 net error        16
#> 6        2 wide error        9
#> # ℹ 1 more row

In the output above, a PtWinner of 1 corresponds to points that Djokovic won (and therefore points where Medvedev made the unforced error) while a PtWinner of 2 corresponds to points that Medvedev won (and therefore points where Djokovic made the unforced error). So we see that, in this match, Djokovic had more unforced errors overall. Medvedev’s unforced errors tended to be deep or wide while the highest proportion of Djokovic’s unforced errors were balls that went into the net.

We will explore our original “service patterns” question in the exercises. To close out this section, we will just emphasize that we have done a very simple introduction into regexes. These can get very cumbersome, especially as the patterns you want to extract get more complicated. Consider the examples below.

Detect which points are aces, which are coded in the variable as *. Regexes have “special characters, like \, *, ., which, if present in the variable need to be”escaped” with a backslash. But, the backslash is a special character, so it needs to be escaped too: so we need two \\ in front of * to pull the points with a *.

str_detect(med_djok_df$point, pattern = "\\*")

Detect which points start with a 4 using ^ to denote “at the beginning”:

str_detect(med_djok_df$point, pattern = "^4")

Detect which points end with an @ using $ to denote “at the end” (this is safer than what we did in the code above, where we just assumed that @ did not appear anywhere else in the string except at the end).

str_detect(med_djok_df$point, pattern = "@$")

Extract all of the forehand shots that were hit with str_extract_all(). The regex here says to extract anything with an f followed by any number of digits before another non-digit symbol.

str_extract_all(med_djok_df$point, pattern = "f[:digit:]+")

The purpose of these examples is just to show that things can get more complicated with strings.

Exercise 4. Use str_detect() and the dplyr functions mutate() and case_when() to create a variable for serve_location that is either "wide" if the point starts with a 4, "body" if the point starts with a 5, and "down the center" if the point starts with a 6.

13.3 Practice

13.3.1 Class Exercises

Class Exercise 1. Two of the more popular television shows, the office and friends, have R packages containing clean-up scripts for the entire series. For the show The Office, choose a character (e.g., Michael, Jim, Pam, Dwight, Kevin, Angela, Kelly, Andy, etc.), and construct a plot that lets you explore the most common words that character said in the series.

Then, assess whether the character you chose seems to speak more or speak less, on average, from season to season.

## install.packages("schrute")
library(schrute)
schrute::theoffice

## install.packages("friends")
library(friends)
friends::friends

13.3.2 Your Turn

Your Turn 1. Using the tennis data, construct a plot that lets you explore how often each of the two players hit their serve “down the t,” “into the body,” and “out wide.”

Your Turn 2. Can you modify your plot to additionally visualize how often each player ended up winning the point when they served to each of the three locations (down the t, into the body, and out wide).

Note

The isSrvWinner variable is coded as a 1 if the serving player won the point and 0 if the serving player lost the point.

Your Turn 3. The letters v, z, i, and k denote volleys (of different types). Use str_detect() and dplyr functions to figure out the proportion of points where a volley was hit.

13.1 Text Analysis

13.2 Introduction to stringr

13.2.1 Regular Expressions

13.2.2 stringr Functions with dplyr

13.3 Practice

13.3.1 Class Exercises

13.3.2 Your Turn

13.2 Introduction to `stringr`

13.2.2 `stringr` Functions with `dplyr`