13 Text Data with tidytext
and stringr
Goals:
- use functions in the
stringr
package and in thetidytext
package to analyze text data. - introduce some of the issues with manipulating strings that don’t pertain to numeric or factor data.
- perform a basic sentiment analysis.
13.1 Text Analysis
Beyonce is a legend. For this example, we will work through a text analysis on lyrics from songs from Beyonce’s albums, utilizing functions from both stringr
to parse strings and tidytext
to convert text data into a tidy format. To begin, read in a data set of Beyonce’s lyrics:
We will be most focused on the line
variable, as each value for this variable contains a line from a Beyonce song. There’s other variables present as well, such as the song_name
and the artist_name
(the data set originally came from a data set with artists other than Beyonce).
You can look at the first 4 values of line
with
beyonce$line[1:4]
#> [1] "If I ain't got nothing, I got you"
#> [2] "If I ain't got something, I don't give a damn"
#> [3] "'Cause I got it with you"
#> [4] "I don't know much about algebra, but I know 1+1 equals 2"
Our end goal is to construct a plot that shows the most popular words in Beyonce’s albums. This is much more challenging than it sounds because we will have to deal with the nuances of working with text data.
The tidytext
package makes it a lot easier to work with text data in many regards. Let’s use the unnest_tokens()
functions from tidytext
to separate the lines into individual words. We’ll name this new data set beyonce_unnest
:
library(tidytext)
beyonce_unnest <- beyonce |> unnest_tokens(output = "word", input = "line")
beyonce_unnest
#> # A tibble: 164,740 × 6
#> song_id song_name artist_id artist_name song_line word
#> <dbl> <chr> <dbl> <chr> <dbl> <chr>
#> 1 50396 1+1 498 Beyoncé 1 if
#> 2 50396 1+1 498 Beyoncé 1 i
#> 3 50396 1+1 498 Beyoncé 1 ain't
#> 4 50396 1+1 498 Beyoncé 1 got
#> 5 50396 1+1 498 Beyoncé 1 nothing
#> 6 50396 1+1 498 Beyoncé 1 i
#> # ℹ 164,734 more rows
We’ll want to make sure that either all words are capitalized or no words are capitalized, for consistency (remember that R
is case-sensitive). To that end, we’ll modify the word
variable and use stringr
’s str_to_lower()
to change all letters to lower-case:
beyonce_unnest <- beyonce_unnest |> mutate(word = str_to_lower(word))
Let’s try counting up Beyonce’s most popular words from the data set we just made:
What’s the issue here?
To remedy this, we can use what are called stop words: words that are very common and carry little to no meaningful information. For example the, it, are, etc. are all stop words. We need to eliminate these from the data set before we continue on. Luckily, the tidytext
package also provides a data set of common stop words in a data set named stop_words
:
head(stop_words)
#> # A tibble: 6 × 2
#> word lexicon
#> <chr> <chr>
#> 1 a SMART
#> 2 a's SMART
#> 3 able SMART
#> 4 about SMART
#> 5 above SMART
#> 6 according SMART
Let’s join the Beyonce lyrics data set to the stop words data set and elminate any stop words:
Then, we can re-make the table with the stop words removed:
beyonce_sum <- beyonce_stop |> group_by(word) |>
summarise(n = n()) |>
arrange(desc(n)) |>
print(n = 25)
#> # A tibble: 5,937 × 2
#> word n
#> <chr> <int>
#> 1 love 1362
#> 2 baby 1024
#> 3 girl 592
#> 4 wanna 564
#> 5 hey 499
#> 6 boy 494
#> 7 yeah 491
#> 8 feel 488
#> 9 time 452
#> 10 uh 408
#> 11 halo 383
#> 12 check 366
#> 13 tonight 342
#> 14 girls 341
#> 15 ya 327
#> 16 run 325
#> 17 crazy 308
#> 18 world 301
#> 19 body 287
#> 20 ooh 281
#> 21 ladies 269
#> 22 top 241
#> 23 gotta 240
#> 24 beyoncé 238
#> 25 night 213
#> # ℹ 5,912 more rows
beyonce_sum
#> # A tibble: 5,937 × 2
#> word n
#> <chr> <int>
#> 1 love 1362
#> 2 baby 1024
#> 3 girl 592
#> 4 wanna 564
#> 5 hey 499
#> 6 boy 494
#> # ℹ 5,931 more rows
Looking through the list, there are still some stop words in there that were not picked up on in the stop_words
data set.
Exercise 1. Look at the remaining words. Do any of them look like stop words that were missed with the stop words from the tidytext
package? Create a tibble with a few of the remaining stop words (like ooh
, gotta
, ya
, uh
, and yeah
) not picked up by the tidytext
package, and use a join function to drop these words from the data set.
The join function you will need to use here is anti_join()
.
Exercise 2. With the new data set, construct a lollipop plot or a bar plot that shows the 20 most common words Beyonce uses, as well as the number of times each word is used.
Exercise 3. Use the wordcloud()
function in the wordcloud
library and the code below to make a wordcloud of Beyonce’s words.
There is not anything else you need to do for this exercise: just make the word cloud!
If you want to delve into text data more, you’ll need to learn about regular expressions , or regexes. If interested, you can read more in the R4DS textbook. Starting out is not too bad, but learning about escaping special characters in R
can be much more challenging!
We analyzed a short text data set, but, you can imagine extending this type of analysis to things like:
- song lyrics, if you have the lyrics to all of the songs from an artist https://rpubs.com/RosieB/taylorswiftlyricanalysis
- book analysis, if you have the text of an entire book or series of books
- tv analysis, if you have the scripts to all episodes of a tv show
If you were doing one of these analyses, there are lots of cool functions in tidytext
to help you out!
13.2 Introduction to stringr
In the previous examples, the string data that we had consisted primarily of words. The tools in tidytext
make working with data consisting of words not too painful. However, some data exists as strings that are not words. For a non-trivial example, consider data sets obtained from https://github.com/JeffSackmann/tennis_MatchChartingProject, a repository for professional tennis match charting put together by Jeff Sackmann. Some of the following code was modified from a project completed by a now-graduated student, James Wolpe.
From this repository, I have put together a data set on one particular tennis match to make it a bit easier for us to get started. The match I have chosen is the 2021 U.S. Open Final between Daniil Medvedev and Novak Djokovic. Why this match? This was arguably the most important match of Djokovic’s career: if he won, he would win all four grand slams in a calendar year. I don’t like Djokovic and he lost so looking back at the match brings me joy. Read in the data set with:
library(here)
library(tidyverse)
med_djok_df <- read_csv(here("data/med_djok.csv"))
med_djok_df
#> # A tibble: 182 × 46
#> point Serving match_id Pt Set1 Set2 Gm1 Gm2 Pts `Gm#` TbSet
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 4f2d@ ND 2021091… 1 0 0 0 0 0-0 1 (1) 1
#> 2 6d ND 2021091… 2 0 0 0 0 15-0 1 (2) 1
#> 3 6b29f3b3b… ND 2021091… 3 0 0 0 0 15-15 1 (3) 1
#> 4 4b28f1f2f… ND 2021091… 4 0 0 0 0 30-15 1 (4) 1
#> 5 5b37b3b3b… ND 2021091… 5 0 0 0 0 40-15 1 (5) 1
#> 6 6f28f1f1f… ND 2021091… 6 0 0 0 0 40-30 1 (6) 1
#> # ℹ 176 more rows
#> # ℹ 35 more variables: `TB?` <dbl>, TBpt <lgl>, Svr <dbl>, Ret <dbl>,
#> # `1st` <chr>, `2nd` <chr>, Notes <chr>, `1stSV` <dbl>, `2ndSV` <dbl>,
#> # `1stIn` <dbl>, `2ndIn` <dbl>, isAce <lgl>, isUnret <lgl>,
#> # isRallyWinner <lgl>, isForced <lgl>, isUnforced <lgl>, isDouble <lgl>,
#> # PtWinner <dbl>, isSvrWinner <dbl>, rallyCount <dbl>, `Player 1` <chr>,
#> # `Player 2` <chr>, `Pl 1 hand` <chr>, `Pl 2 hand` <chr>, Gender <chr>, …
The observations of the data set correspond to points played (so there is one row per point). There are a ton of variables in this data set, but the most important variable is the first variable, point
, which contains a string with information about the types of shots that were played during the point. The coding of the point
variable includes:
-
4
for serve out wide,5
for serve into the body, and6
for a serve “down the t (center)”. -
f
for forehand stroke,b
for backhand stroke. -
1
to a right-hander’s forehand side,2
for down the middle of the court, and3
to a right-hander’s backhand side. -
d
for a ball hit deep,w
for a ball hit wide, andn
for a ball hit into the net -
@
symbol at the end if the point ended in an unforced error - and there’s lots of other numbers and symbols that correspond to other things (volleys, return depths, hitting the top of the net, etc.)
For example, Djokovic served the 7th point of the match, which has a point
value of 4f18f1f2b3b2f1w@
. This reads that
-
4
: Djokovic served out wide, -
f18
: Medvedev hit a forehand cross-court to Djokovic’s forehand side -
f1
: Djokovic hit a forehand cross-court to Medvedev’s forehand side -
f2
: Medvedev hit a forehand to the center of the court -
b3
: Djokovic hit a backhand to Medvedev’s backhand side -
b2
: Medvedev hit a backhand to the center of the court -
f1w@
: Djokovic hit a forehand to Medvedev’s forehand side, but the shot landed wide and was recorded as an unforced error.
Clearly, there is a lot of data encoded in the point
variable. We are going to introduce stringr
by answering a relatively simple question: what are the serving patterns of Medvedev and Djokovic during this match?
13.2.1 Regular Expressions
A regex, or regular expression, is a string used to identify particular patterns in string data.
Regular expressions are used in many languages (so, if you google something about a regular expression, you do not need to limit yourself to just looking at resources pertaining to R
).
Regex’s can be used with the functions in the stringr
package.
The functions in the stringr
package begin with str_()
, much like the functions in forcats
began with fct_()
.
We will first focus on the str_detect()
function, which detects whether a particular regex is present in a string variable. str_detect()
takes the name of the string as its first argument and the regex as the second argument. For example,
str_detect(med_djok_df$point, pattern = "f")
#> [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
#> [13] TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#> [25] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
#> [37] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
#> [49] FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
#> [61] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
#> [73] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
#> [85] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
#> [97] TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
#> [109] FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE
#> [121] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
#> [133] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
#> [145] FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
#> [157] TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
#> [169] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [181] FALSE FALSE
returns a TRUE
if the letter f
appears anywhere in the string and a FALSE
if it does not. So, we can examine how many points a forehand was hit in the Medvedev-Djokovic match.
13.2.2 stringr
Functions with dplyr
We can combine the stringr
functions with dplyr
functions that we already know and love. For example, if we are only interested in points the end in an unforced error (so points that have the @
symbol in them), we can filter out the points that don’t have an @
:
med_djok_df |> filter(str_detect(point, pattern = "@") == TRUE)
#> # A tibble: 63 × 46
#> point Serving match_id Pt Set1 Set2 Gm1 Gm2 Pts `Gm#` TbSet
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 4f2d@ ND 2021091… 1 0 0 0 0 0-0 1 (1) 1
#> 2 6b29f3b3b… ND 2021091… 3 0 0 0 0 15-15 1 (3) 1
#> 3 5b37b3b3b… ND 2021091… 5 0 0 0 0 40-15 1 (5) 1
#> 4 4f18f1f2b… ND 2021091… 7 0 0 0 0 40-40 1 (7) 1
#> 5 5b28f1f2f… ND 2021091… 8 0 0 0 0 40-AD 1 (8) 1
#> 6 6b27f2f2b… DM 2021091… 13 0 0 0 1 40-15 2 (5) 1
#> # ℹ 57 more rows
#> # ℹ 35 more variables: `TB?` <dbl>, TBpt <lgl>, Svr <dbl>, Ret <dbl>,
#> # `1st` <chr>, `2nd` <chr>, Notes <chr>, `1stSV` <dbl>, `2ndSV` <dbl>,
#> # `1stIn` <dbl>, `2ndIn` <dbl>, isAce <lgl>, isUnret <lgl>,
#> # isRallyWinner <lgl>, isForced <lgl>, isUnforced <lgl>, isDouble <lgl>,
#> # PtWinner <dbl>, isSvrWinner <dbl>, rallyCount <dbl>, `Player 1` <chr>,
#> # `Player 2` <chr>, `Pl 1 hand` <chr>, `Pl 2 hand` <chr>, Gender <chr>, …
We can then use mutate()
with case_when()
to create a variable corresponding to error type and then summarise()
the error types made from the two players.
med_djok_df |> filter(str_detect(point, pattern = "@") == TRUE) |>
mutate(error_type = case_when(str_detect(point, pattern = "d@") ~ "deep error",
str_detect(point, pattern = "w@") ~ "wide error",
str_detect(point, pattern = "n@") ~ "net error")) |>
group_by(PtWinner, error_type) |>
summarise(n_errors = n())
#> # A tibble: 7 × 3
#> # Groups: PtWinner [2]
#> PtWinner error_type n_errors
#> <dbl> <chr> <int>
#> 1 1 deep error 9
#> 2 1 net error 6
#> 3 1 wide error 10
#> 4 2 deep error 12
#> 5 2 net error 16
#> 6 2 wide error 9
#> # ℹ 1 more row
In the output above, a PtWinner
of 1
corresponds to points that Djokovic won (and therefore points where Medvedev made the unforced error) while a PtWinner
of 2
corresponds to points that Medvedev won (and therefore points where Djokovic made the unforced error). So we see that, in this match, Djokovic had more unforced errors overall. Medvedev’s unforced errors tended to be deep or wide while the highest proportion of Djokovic’s unforced errors were balls that went into the net.
We will explore our original “service patterns” question in the exercises. To close out this section, we will just emphasize that we have done a very simple introduction into regexes. These can get very cumbersome, especially as the patterns you want to extract get more complicated. Consider the examples below.
- Detect which points are aces, which are coded in the variable as
*
. Regexes have “special characters, like\
,*
,.
, which, if present in the variable need to be”escaped” with a backslash. But, the backslash is a special character, so it needs to be escaped too: so we need two\\
in front of*
to pull the points with a*
.
str_detect(med_djok_df$point, pattern = "\\*")
- Detect which points start with a
4
using^
to denote “at the beginning”:
str_detect(med_djok_df$point, pattern = "^4")
- Detect which points end with an
@
using$
to denote “at the end” (this is safer than what we did in the code above, where we just assumed that@
did not appear anywhere else in the string except at the end).
str_detect(med_djok_df$point, pattern = "@$")
- Extract all of the forehand shots that were hit with
str_extract_all()
. The regex here says to extract anything with an f followed by any number of digits before another non-digit symbol.
str_extract_all(med_djok_df$point, pattern = "f[:digit:]+")
The purpose of these examples is just to show that things can get more complicated with strings.
Exercise 4. Use str_detect()
and the dplyr
functions mutate()
and case_when()
to create a variable for serve_location
that is either "wide"
if the point
starts with a 4
, "body"
if the point
starts with a 5
, and "down the center"
if the point
starts with a 6
.
13.3 Practice
13.3.1 Class Exercises
Class Exercise 1. You can often find scripts for television shows. Two of the more popular television shows, the office and friends, have R
packages containing clean-up scripts for the entire series. Construct a plot that shows the 20 most common words Michael says that are not classified as “stop” words.
13.3.2 Your Turn
Your Turn 1. Use dplyr
functions and the Serving
variable to count the number of serve locations for each player. (i.e. for how many points did Medvedev hit a serve out wide?).
Your Turn 2. Use dplyr
functions, the Serving
variable, and the isSrvWinner
variable to find the proportion of points each player won for each of their serving locations (i.e. if Medvedev won 5 points while serving out wide and lost 3 points, the proportion would be 5 / 8 = 0.625
).
The isSrvWinner
variable is coded as a 1
if the serving player won the point and 0
if the serving player lost the point.
Your Turn 3. The letters v
, z
, i
, and k
denote volleys (of different types). Use str_detect()
and dplyr
functions to figure out the proportion of points where a volley was hit.