4  Communication with Quarto

Note

Quarto and .qmd files and R Markdown and .qmd files are extremely similar. In the previous version of this section, I used R Markdown. Quarto has an advantage in that .qmd files also work with Python and Julia, so they are generally better for an all-purpose data scientist. If you spot any references to R Markdown or a .qmd file, just mentally convert R Markdown to Quarto and .rmd to qmd.

Goals:

Overall, if we are making some quick plots just for us, some of the things on communication won’t apply. But, if we’re planning on sharing results (usually we are, eventually), then communication tools become much more important.

4.1 Reproducibility

We’ve been using Quarto for a while now, but have not yet talked about any of its features or how to do anything except insert a new code chunk. By the end of this section, we want to be able to use some of the Quarto options to make a nice-looking document (so that you can implement some of these options in your first mini-project).

Reproducibility is a concept that has recently gained popularity in the sciences for describing analyses that another researcher is able to repeat. That is, an analysis is reproducible if you provide enough information that the person sitting next to you can obtain identical results as long as they follow your procedures. An analysis is not reproducible if this isn’t the case.

Quarto makes it easy for you to make your analysis reproducible for a couple of reasons:

  • a Quarto file will not render unless all of your code runs, meaning that you won’t accidentally give someone code that doesn’t work.

  • Quarto combines the “coding” steps with the “write-up” steps into one coherent document that contains the code, all figures and tables, and any explanations.

We will “demo” a reproducible analysis in class.

Note

If using Quarto for communication, you may want to utilize its spell-check feature. Go to Edit -> Check Spelling, and you’ll be presented with a spell-checker that lets you change the spelling of any words you may have misspelled.

4.1.1 R Scripts vs. Quarto

We’ve been using Quarto for the entirety of this course. But, you may have noticed that when you go to File -> New File to open a new Quarto Document file, there are a ton of other options. The first option is R Script. Go ahead and open a new R Script file now.

The file you open should be completely blank.

Important

An R Script is a file that contains only R code.

It cannot have any text in it at all, unless that text is commented out with a #. For example, you could copy and paste all of the code that is inside a code chunk in a .qmd file to the .R file and run it line by line.

So, what are the advantages and disadvantages of using an R Script file compared to using an Quarto file? Let’s start with the advantages of Quarto. Quarto allows you to fully integrate text explanations of the code and results, the actual tables and figures themselves, and the code to make those tables and figures in one cohesive document. As we will see, if using R Scripts to write-up an analysis in Word, there is a lot of copy-pasting involved of results. For this reason, using Quarto often results in more reproducible analyses.

The advantage of an R Script would be in a situation where you really aren’t presenting results to anyone and you also don’t need any text explanations. This often occurs in two situations. (1) There are a lot of data preparation steps. In this case, you would typically complete all of these data prep steps in an R script and then write the resulting clean data to a .csv that you’d import in an Quarto file. (2) What you’re doing is complicated statistically. If this is the case, then the code is much more of a focus than the text or creating figures so you’d use an R Script.

Exercise 1. What’s the difference between R and Quarto?

Exercise 2. Why is a Quarto analysis more reproducible than a base R script analysis or a standard Excel analysis?

4.2 Quarto Files

Let’s talk a bit more about the components of the Quarto file used to make the reproducible analysis shown in class.

First, open a new Quarto file by clicking File -> New File -> Quarto Document and keep the new file so that it renders to HTML for now.

The first four to five lines at the top of the file make up the YAML (Yet Another Markup Language) header. We’ll come back to this at the end, as it’s the more frustrating part to learn.

Delete the code below the YAML header and then paste the following code chunks to your clean .qmd file in two separate chunks:

library(tidyverse)
head(cars)
ggplot(data = cars, aes(x = speed, y = dist)) +
  geom_point()
summary(cars)
Note

The cars data set is built into R so there’s no need to do anything to read it in (it already exists in R itself).

4.2.1 Code Chunk Options

First, render your new file (and give it a name, when prompted). You should see some code, a couple of results tables, and a scatterplot.

Chunk options allow you to have some control over what gets printed to the file that you render. For example, you may or may not want: the code to be printed, the figure to be printed, the tables to be printed, the tidyverse message to be printed, etc. There are a ton of chunk options to give us control over the code and output that is shown! We are going to just focus on a few that are more commonly used.

The options below are common execute options in Quarto.

  • echo. This is set to either true to print the code or false to not print the code. In a blank line after ```{r}, insert the following, which tells Quarto not to print the code in that chunk: #| echo: false. Then, re-render your document to make sure that the code making the plot is actually hidden in the .html output.

You can keep adding other options on new lines. Some other options include:

  • warning. This is set to either true to print warnings and messages or false to not print warnings and messages. For example, when we load in the tidyverse, a message automatically prints out. In that same code chunk, add a new line with #| warning: false to get rid of the message. Re-render to make sure the message is actually gone.

  • output. By default, this is set to true and shows output of tables and figures. Change this to false to not print any output from running code. Practice adding a #| output: false to the code chunk in your Quarto file with summary(cars) and re-render to make sure the output from summary(cars) is gone.

  • eval. eval is set to true if the code is to be evaluated and false if not.

Besides execute options, there are also options pertaining to the size of figures and figure captions. Some common examples include

  • fig-height and fig-width control the height and width of figures. By default, these are both 7, but we can change the fig-height and fig-width to make figures take up less space in the rendered .html document (fig-height: 5, for example).

  • fig-cap adds a figure caption to your figure. Try inserting #| fig-cap: "Figure 1: caption text blah blah blah" to your chunk options for the chunk with the plot.

4.2.2 Global Options

What we have discussed so far is how to change code and output options for individual chunks of code. But, it can be a pain to add a certain option to every single chunk of code that we want the option to apply to. What we can do instead is change the global option for the code and/or output options so that they apply to all code chunks, unless specifically overwritten in that chunk.

We can change the execute options (echo, warning, eval, output, and more) globally by adding a line to the YAML header at the top of your Quarto file. Try adding

execute: 
  echo: false

as two new lines in between the title and format lines of the YAML header. This tells Quarto to not echo any code in any code chunk. However, note that you can change a local code chunk to have #| echo: true to override the global setting for that chunk.

Additional global execute options go on new lines:

execute: 
  echo: false
  warning: false

The non-execute options pertaining to figure size can be changed globally by specifying the option after the html part of the YAML header. The following changes all figure heights to be 2, unless a chunk overrides the global setting:

---
title: "Quarto Test"
execute: 
  echo: false
format: 
  html:
    fig-height: 2
---
Important

We need to pay particular attention to the spacing of things in the YAML header. Notice, for example, that echo: false is indented by exactly two spaces. Try adding a space or deleting a space, and you’ll get an error!

4.2.3 Figures and Tables

We’ve already seen that Figures will pop up automatically (unless we set output: false), which is quite convenient. Making tables that look nice requires one extra step.

Delete the output: false option that you added earlier to the chunk with summary(cars). When you render your .qmd file now, results tables from head(cars) and summary(cars) look kind of ugly. We will focus on using the kable() function from the knitr package to make these tables much more aesthetically pleasing. Another option is to use the pander() function in the pander package. Both pander() and kable() are very simple functions to generate tables but will be more than sufficient for our purposes. To generate more complicated tables, see the xtable package or the kableExtra package.

To use these functions, simply add a |> pipe with the name of the table function you want to use. head(cars) |> kable() will make a nice-looking table with kable and head(cars) |> pander() will use pander(). Before using kable() from the knitr package, you’ll need to install the knitr package with install.packages("knitr") and load its library by adding the line library(knitr) above head(cars) |> kable(). Before using pander(), you’ll need to install the pander package with install.packages("pander") and then load its library by adding the line library(pander) above head(cars) |> pander(). Try these out in your Quarto file.

Which table do you like better in this case?

There are plenty of options for making tables look presentable, which we will discuss in the Exercises. Keep in mind that youmay not use these when making tables for yourself. They’re much more useful when you’re writing a report that you want to share with others.

4.2.4 Non-Code Options

Quarto combines R (in the code chunks, which we’ve already discussed) with the Markdown syntax, which comprises the stuff outside the code chunks, like what you’re reading right now!

Note

There are so many Markdown options, but most of the time, if you want to do something specific, you can just Google it. The purpose of what follows is just to get us familiar with the very basics and things you will probably use most often.

Bullet Points and Sub-bullet Points: Denoted with a * and -, respectively. The sub bullets should be indented by 4 spaces. Note that bullet points are not code and should not appear in a code chunk.

* Bullet 1
* Bullet 2
    - Sub bullet 1
    - Sub bullet 2
    - Sub bullet 3

Everything in Markdown is very particular with spacing. Things often have to be very precise. I personally just love it, but it can be frustrating sometimes. For example, indenting a sub-bullet by 3 spaces instead of 4 spaces will not make a sub-bullet.

* Bullet 1
   - Sub bullet 1

Numbered Lists are the same as bulleted ones, except * is replaced with numbers 1., 2., etc.

Bold, Italics, Code. Surround text with __bold text__ to make text bold, _italic text_ to make text Italics, and backticks to make text look like Code.

Links: The simplest way to create a link to something on the web is to surround it with < > as in <https://www.youtube.com/watch?v=gJf_DDAfDXs>

If you want to name you link something other than the web address, use [name of link](https://www.youtube.com/watch?v=gJf_DDAfDXs), which should show up in your rendered document as “name of link” and, when clicked on, take you to the youtube video.

Headers: Headers are created with ## with fewer hashtags resulting in a bigger Header. Typing in #Big Header at the beginning of a line would make a big header, ### Medium Header would make a medium header, and ##### Small Header would make a small header. Headers are important because they get mapped to a table of contents.

There’s a lot of other stuff to explore: <a href=“https://rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf” target=“blank> https://rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf .

But, if you want to do something other than the basics, Google will definitely help.

4.2.5 YAML

We have briefly discussed what’s given at the top of every .qmd file: the YAML header. The YAML header is the most frustrating part to change because it’s the most particular with spacing.

In addition to controlling global chunk options, we can also use the YAML header to specify a theme from Bootswatch.

There are 25 themes in the Bootswatch project. The YAML header below uses the darkly theme. Try pasting in this YAML header and rendering the document to see the outputted theme.

---
title: "Quarto Test"
execute: 
  echo: false
  warning: false
format: 
  html:
    fig-height: 2
    theme: darkly
    embed-resources: true
---

We can also add a table of contents, which will create a table of contents based on headers you have created with ##.

---
title: "Quarto Test"
execute: 
  echo: false
  warning: false
format: 
  html:
    fig-height: 2
    theme: darkly
    toc: true
    embed-resources: true
---

There are so many other options available for theming, and, if you know any css, you can provide your own .css file to customize the theme.

For the following exercises, we will use the built-in R data set mtcars, which has observations on makes and models of cars. The variables we will be using are:

  • cyl, the number of cylinders a car has
  • mpg, the mileage of the car, in miles per gallon

Because the data set is loaded every time R is started up, there is no need to have a line that reads in the data set. We can examine the first few observations with

head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Exercise 3. Create a table showing the mean mpg for each cyl group (cyl stands for cylinder and can be 4-cylinder, 6-cylinder, or 8-cylinder) with both kable() and pander().

Remember to call the knitr library if using kable() and the pander library if using pander().

Exercise 4. Type ?kable into your console window and scroll through the Help file. Change the rounding of the mean so that it only displays one number after the decimal. Then, add a caption to the table that says “My First Table Caption!!”

The arguments ou will need to change are digits and caption.

Exercise 5. Create a new R chunk and copy and paste the following into your new R chunk. Don’t worry about what factor() is doing: we will cover that soon!

library(tidyverse)
head(mtcars)
ggplot(data = mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot()

Modify the R chunk so that: (a) the figure height is 3, (b) the code from the R chunk shows in the .html file, (c) the table from running head(cars) is hidden in the .html file. Make (b) and (c) a local chunk option, but set (a) as a global option that applies to all of your R chunks. Hint: Instead of setting #| output: false, which would cause both the table and figure to not render in the .html output, add #| results: false as a local chunk option, which says to include figures in the .html output, but not tables or other printed output.

4.3 Communication Options for ggplot2 Plots

When we first introduced plotting, we used histograms, boxplots, frequency plots, bar plots, scatterplots, line plots, and more to help us explore our data set. You will probably make many different plots in a single analysis, and, when exploring, it’s fine to keep these plots unlabeled and untitled with the default colour scheme and theme. They’re just for you, and you typically understand the data and what each variable means.

However, when you’ve finished exploring and you’d like to communicate your results, both graphically and numerically, you’ll likely want to tweak your plots to look more aesthetically pleasing. You certainly wouldn’t be presenting every exploratory plot you made so this tweaking needs to be done on only a few plots. You might consider:

  • changing the x-axis and y-axis labels, changing the legend title, adding a title, adding a subtitle, and adding a caption with + labs()

  • changing the limits of the x-axis and y-axis with + xlim() and + ylim()

  • changing the colour scheme to be more visually appealing and easy to see for people with colour-vision-deficiency (CVD)

  • labeling certain points or lines with + geom_text() or + geom_label()

  • changing from the default theme with + theme_<name_of_theme>()

The bullet about labeling only certain points is the data set is one reason why we are doing this second ggplot2 section now, as opposed to immediately after the first ggplot2 section. As we will see, we’ll make use of combining what we’ve learned in dplyr to help us label interesting observations in our plots.

The Data

The Happy Planet Index (HPI) is a measure of how efficiently a country uses its ecological resources to give its citizens long “happy” lives. You can read more about this data here: here.

But, the basic idea is that the HPI is a metric that computes how happy and healthy a country’s citizens are, but adjusts that by that country’s ecological footprint (how much “damage” the country does to planet Earth). The data set was obtained from https://github.com/aepoetry/happy_planet_index_2016. Variables in the data set are:

  • HPIRank, the rank of the country’s Happy Planet Index (lower is better)
  • Country, the name of the country
  • LifeExpectancy, the average life expectancy of a citizen (in years)
  • Wellbeing, the average well being score (on a scale from 1 - 10). See the ladder question in the documentation for how this was calculated.
  • HappyLifeYears, a combination of LifeExpectancy and Wellbeing
  • Footprint, the ecological footprint per person (higher footprint means the average person in the country is less ecologically friendly)

Read in the data set with

library(tidyverse)
hpi_df <- read_csv("https://raw.githubusercontent.com/highamm/ds234_quarto/main/data_online/hpi-tidy.csv")
head(hpi_df)
#> # A tibble: 6 × 11
#>   HPIRank Country     LifeExpectancy Wellbeing HappyLifeYears Footprint
#>     <dbl> <chr>                <dbl>     <dbl>          <dbl>     <dbl>
#> 1     109 Afghanistan           48.7      4.76           29.0     0.540
#> 2      18 Albania               76.9      5.27           48.8     1.81 
#> 3      26 Algeria               73.1      5.24           46.2     1.65 
#> 4     127 Angola                51.1      4.21           28.2     0.891
#> 5      17 Argentina             75.9      6.44           55.0     2.71 
#> 6      53 Armenia               74.2      4.37           41.9     1.73 
#> # ℹ 5 more variables: HappyPlanetIndex <dbl>, Population <dbl>,
#> #   GDPcapita <dbl>, GovernanceRank <chr>, Region <chr>

Let’s look at the relationship between HappyLifeYears and Footprint for countries of different Regions of the world.

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point()

Which region seems to have the most variability in their Ecological Footprint?

4.3.0.1 Change Labels and Titles

We can add + labs() to change various labels and titles throughout the plot:

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  labs(title = "Countries with a Higher Ecological Footprint Tend to Have Citizens with Longer, Happier Lives", 
       ## add title
       subtitle = "HappyLifeYears is a Combination of Life Expectancy and Citizen Well-Being", 
       ## add subtitle (smaller text size than the title)
       caption = "Data Source: http://happyplanetindex.org/countries", 
       ## add caption to the bottom of the figure
       x = "Ecological Footprint", ## change x axis label
       y = "Happy Life Years", ## change y axis label
       colour = "World Region") ## change label of colour legend

Any aes() that you use in your plot gets its own label and can be changed by name_of_aethetic = "Your Label". In the example above, we changed all three aes() labels: x, y, and colour.

What is the only text on the plot that we aren’t able to change with labs()?

4.3.0.2 Changing Axis Limits

We can also change the x-axis limits and the y-axis limits to, for example, start at 0 for the y-axis:

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  ylim(c(0, 70))

In this case, it makes the points on the plot a bit harder to see. You can also change where and how often tick marks appear on the x and y-axes. For special things like this, it is often best to just resort to Google (“ggplot how to change x-axis breaks tick marks” should help).

4.3.0.3 Changing A Colour Scale

We want to use our graphics to communicate with others as clearly as possible. We also want to be as inclusive as possible in our communications. This means that, if we choose to use colour, our graphics should be made so that a colour-vision-deficient (CVD) person can read our graphs.

Note

About 4.5% of people are colour vision deficient, so it’s actually quite likely that a CVD person will view the graphics that you make (depending on how many people you share it with) More Information on CVD.

The colour scales from R Colour Brewer are readable for common types of CVD. A list of scales can be found here.

We would typically use the top scales if the variable you are colouring by is ordered sequentially (called seq for sequential, like grades in a course: A, B, C, D, F), the bottom scales if the variable is diverging (called div for diverging, like Republican / Democrat lean so that the middle is colourless), and the middle set of scales if the variable is not unordered and is categorical (called qual for qualitative like the names of different treatment drugs for a medical experiment).

In which of those 3 situations are we in for the World Region graph?

If we want to use one of these colour scales, we just need to add scale_colour_brewer() with the name of the scale we want to use.

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  scale_colour_brewer(palette = "Accent")

Try changing the palette to something else besides "Accent". Do you like the new palette better or worse?

One more option to easily change the colour scale is to use the viridis package. The base viridis functions automatically load with ggplot2 so there’s no need to call the package with library(viridis). The viridis colour scales were made to be both aesthetically pleasing and CVD-friendly.

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  scale_colour_viridis_d(option = "plasma")

A drawback of the viridis package is that the yellow can be really hard to see (at least for me).

Read the examples section of the Help file for ?scale_colour_viridis_d. What’s the difference between scale_colour_viridis_d(), ?scale_colour_viridis_c(), and scale_colour_viridis_b()?

4.3.0.4 Labeling Points or Lines of Interest

One goal we might have with communication is highlighting particular points in a data set that show something interesting. For example, we might want to label the points on the graph corresponding to the countries with the highest HPI in each region: these countries are doing the best in terms of using resources efficiently to maximize citizen happiness. Or, we might want to highlight some “bad” example of countries that are the least efficient in each region. Or, we might want to label the country that we are from on the graph.

All of this can be done with geom_text(). Let’s start by labeling all of the points. geom_text() needs one aesthetic called label which is the name of the column in the data set with the labels you want to use.

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2") +
  geom_text(aes(label = Country))

Yikes! It’s quite uncommon to want to label all of the points. Let’s see if we can instead label each country with the best HPI in that country’s region. To do so, we first need to use our dplyr skills to create a new data set that has these 7 “best” countries. When we used group_by(), we typically used summarise() afterward. But, group_by() works with filter() as well!

plot_df <- hpi_df |> group_by(Region) |>
  filter(HPIRank == min(HPIRank))

plot_df is a data frame that only has the rows with the smallest HPIRank in each Region.

Now that we have this new data set, we can use it within geom_text(). Recall that the data = argument in ggplot() carries on through all geoms unless we specify otherwise. Now is our chance to “specify otherwise” by including another data = argument within geom_text():

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2") +
  geom_text(data = plot_df, aes(label = Country))

The show.legend = FALSE argument below tells geom_text() that we do not want its aesthetics to appear in the legend.

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point(aes(colour = Region)) +
  scale_colour_brewer(palette = "Dark2") +
  geom_text(data = plot_df, aes(label = Country), show.legend = FALSE)

A common issue, even with few labels, is that some of the labels could overlap. The ggrepel package solves this problem by including a geom_text_repel() geom that automatically repels any overlapping labels:

library(ggrepel)
ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2") +
  geom_text_repel(data = plot_df, aes(label = Country),
                   show.legend = FALSE) 

And a final issue with the plot is that it’s not always very clear which point on the plot is being labeled. A trick used in the R for Data Science book is to surround the points that are being labeled with an open circle using an extra geom_point() function:

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears, colour = Region)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2") +
  geom_text_repel(data = plot_df, aes(label = Country), show.legend = FALSE) +
  geom_point(data = plot_df, size = 3, shape = 1, show.legend = FALSE) 

In the code above, shape = 1 says that the new point should be an open circle and size = 3 makes the point bigger, ensuring that it goes around the original point. show.legend = FALSE ensures that the larger open circles don’t become part of the legend.

We can use this same strategy to label specific countries. I’m interested in where the United States of America falls on this graph because I’m from the U.S. I’m also interested in where Denmark falls because that’s the country I’m most interested in visiting. Feel free to replace those countries with any that you’re interested in!

plot_df_us <- hpi_df |>
  filter(Country == "United States of America" | Country == "Denmark")

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2") +
  geom_point(data = plot_df_us, size = 3, shape = 1,
             show.legend = FALSE) +
  geom_text_repel(data = plot_df_us, aes(label = Country),
                   show.legend = FALSE)

4.3.0.5 Plot Themes

Plot themes are an easy way to change many aspects of your plot with an overall theme that someone developed. The default theme for ggplot2 graphs is theme_grey(), which is the graph with the grey background that we’ve been using in the entirety of this class. The 7 other themes are given in R for Data Science in Figure 28.3.

Note

My favorite theme is definitely theme_minimal(). This theme gives a really clean look, so we will likely start using this theme in plots much more often now.

However, there are many more choices in the ggthemes package. Load the package with library(ggthemes) and check out https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/ for a few of the themes in the package. My personal favorites, all given below, are theme_solarized(), theme_fivethirtyeight(), and theme_economist(), but choosing a theme is mostly a matter of personal taste.

library(ggthemes)
ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2") +
  geom_point(data = plot_df_us, size = 3, shape = 1, show.legend = FALSE) +
  geom_text_repel(data = plot_df_us, aes(label = Country), show.legend = FALSE) +
  theme_solarized()

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2") +
  geom_point(data = plot_df_us, size = 3, shape = 1, show.legend = FALSE) +
  geom_text_repel(data = plot_df_us, aes(label = Country), show.legend = FALSE) +
  theme_fivethirtyeight()

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2") +
  geom_point(data = plot_df_us, size = 3, shape = 1, show.legend = FALSE) +
  geom_text_repel(data = plot_df_us, aes(label = Country), show.legend = FALSE) +
  theme_economist()

There’s still much more we can do with ggplot2. In fact, there are entire books on it. But, for most other specializations, we can usually use Google to help us!

Exercise 6. Change the following aspects of the plot given below.

library(palmerpenguins) ## will need to install.packages("palmerpenguins")
## if we have not used this package in class yet.
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm,
                            colour = species)) +
  geom_point() +
  geom_smooth(se = FALSE)

  1. Change the x axis label to be "Bill Length (mm)", the y axis label to be "Bill Depth (mm)", and the colour scale to be "Species".

  2. Change the default colour scale with scale_colour_viridis_d().

  3. Change the default theme to a theme of your choice.

  4. Add a title to the plot.

4.4 Practice

In general, practice exercises will be split between exercises that will be done together as a class and exercises that you will do in groups or on your own. The purpose of the class exercises is to give some guidance on how we might think logically through some of the code and the results. The purpose of the group exercises and the on your own exercises is so that you have a chance to practice what you’ve learned with your table or on your own.

4.4.1 Class Exercises

Some data sets exist within specific R packages. For example, Jenny Bryan, who is quite famous in the stats/data science community, has put together the gapminder package so that users in R have access to a specific data set on countries throughout the world. https://github.com/jennybc/gapminder.

To load a data set within a specific R package, you first need to install the package with install.packages("gapminder") and then load the package itself:

Then, name the data set something. In this case, the name of the data set is gapminder, but it’s not always the same name as the package itself. We will name the data set country_df.

country_df <- gapminder

Explore the data set with head(country_df) and ?gapminder before proceeding to the following exercises.

Class Exercise 1. Make a line graph that shows the relationship between lifeExp and year for each of the countries in the data set, faceting the graph by continent and also colouring by continent (though this is redundant). Add an x-axis label, a y-axis label, a legend label, and a title to the graph.

Class Exercise 2. Change the colour palette to be CVD-friendly using either scale_colour_brewer() or scale_colour_viridis_d().

Class Exercise 3. We can see a couple of interesting trends in life expectancy. There is one country in Africa and one country in Asia that sees a sharp decline in life expectancy at one point. In Europe, there is one country that has a substantially lower life expectancy than the rest in the 1950s but catches up to other European countries by the 2000s. Use filter() to create a data set that only has these 3 countries. Then, use geom_text() to label all three countries on your plot.

Class Exercise 4. Google the history of the countries in Africa and Asia that you just labeled. Add a very short description of why each country experienced a dip in life expectancy as a caption in your graph.

Class Exercise 5. Suppose that we want the legend to appear on the bottom of the graph. Without using an entirely different theme, use Google to figure out how to move the legend from the right-hand side to the bottom.

Class Exercise 6. If there are a lot of overlapping points or overlapping lines, we can use alpha to control the transparency of the lines. Google “change transparency of lines in ggplot” and change the alpha so that the lines are more transparent.

Class Exercise 7. Change the theme of your plot to a theme in the ggthemes package. Then, change the order of your two commands to change the legend position and to change the overall theme. What happens?

Class Exercise 8. Modify your .qmd file so that:

  • only the figure that you made in the previous exercise renders on your .html file. (Hint: use global options to help with this).

  • none of the code gets printed.

  • the warnings/messages that R prints by default are hidden in all code chunks.

  • the figure height is 4 instead of the default 7.

4.4.2 Your Turn

Your Turn 1. Your friend Chaz is doing a data analysis project in Excel to compare the average GPA of student athletes with the average GPA of non-student athletes. He has two variables: whether or not a student is a student athlete and GPA. He decides that a two-sample t-test is an appropriate procedure for this data (recall from Intro Stat that this procedure is appropriate for comparing a quantitative response (GPA) across two groups). Here are the steps of his analysis.

  1. He writes the null and alternative hypotheses in words and in statistical notation.

  2. He uses Excel to make a set of side-by-side boxplots. He changes the labels and the limits on the y-axis using Point-and-Click Excel operations.

  3. From his boxplots, he see that there are 3 outliers in the non-athlete group. These three students have GPAs of 0 because they were suspended for repeatedly refusing to wear masks indoors. Chaz decides that these 3 students should be removed from the analysis because, if they had stayed enrolled, their GPAs would have been different than 0. He deletes these 3 rows in Excel.

  4. Chaz uses the t.test function in Excel to run the test. He writes down the degrees of freedom, the T-stat, and the p-value.

  5. Chaz copies his graph to Word and writes a conclusion in context of the problem.

With your group, come up with two or more aspects of Chaz’s analysis that are not reproducible.

Your Turn 2. Go back to the line plot that we made with the fitness data set of steps vs. Start date. Using geom_text(), add a label to the peak at April 14, 2019 that says "half-marathon" and a label to the peak at May 24, 2021 that say "grand canyon".

Your Turn 3. Examine the mtcars data set again. The code chunk below adds the car_name variable to the data set (no need to worry about what this code is doing specifically at this point). With mtcars, create a scatterplot of mpg on the y-axis and wt on the x-axis. Then, label the point for the car in the data set that has the highest mpg with its car_name.

mtcars <- mtcars |> rownames_to_column() |> rename(car_name = rowname)
mtcars
#>               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> 6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> 8            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> 9             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 10            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 11           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 12          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> 13          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> 14         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> 15  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> 16 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> 17   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> 18            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> 19         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> 20      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 21       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 22    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> 23         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> 24          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> 25    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> 26           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> 27       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> 28        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> 29      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> 30        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> 31       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> 32          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Your Turn 4. The theme() function is a way to really specialise your plot. We will explore some of these in this exercise. Using only options in theme() or options to change colours, shapes, sizes, etc., create the ugliest possible ggplot2 graph that you can make. You may not change the underlying data for this graph, but your goal is to investigate some of the options given in theme(). A list of theme options is given in this link.

ggplot(data = hpi_df, aes(x = Footprint, y = HappyLifeYears,
                          colour = Region)) +
  geom_point()