Section 7 Homework

Reminder: You are allowed to work with other students on the homework assignments but you must acknowledge who you worked with at the top of your homework assignment.

At the top of you assignment, List Any Collaborators (if any):

Statement of Integrity: All work submitted is my own, and I have followed all rules for collaboration.

Signature:

On the top of your assignment, copy the entire statement of integrity or just write the phrase “Statement of Integrity” and sign your name to it.

All homeworks should be handwritten (unless otherwise noted).

Exercise 1. Suppose that we are interested in the relationship between driving distance and driving accuracy for all male professional golfers. We obtain data on 193 professional golfers from the 2018 season and record their driving distance (measured in yards) as well as their driving accuracy (measured in percent of fairways hit). We now want to formally assess whether there is evidence for association between driving distance (the response) and accuracy (the predictor), and we want to obtain a confidence interval for the slope parameter.

Fitting linear model: driving_distance ~ driving_accuracy
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	336.9	6.808	49.49	6.94e-111
driving_accuracy	-0.6554	0.1104	-5.939	1.335e-08

Prepare. Write down the hypotheses for the question of interest in statistical notation and in words.

Checkpoint (Check after completing part a)

\(H_0: \beta_1 = 0\). There is no association between driving distance and driving accuracy of professional male golfers.
\(H_a: \beta_1 \neq 0\). There is an association between driving distance and driving accuracy of professional male golfers.

Write down the fitted regression equation from the table of output.

Checkpoint (Check after completing part b)

\(\hat{Y} = 336.9 + -0.6554 X\),

where \(\hat{Y}\) is predicted driving distance and \(X\) is driving accuracy.

We will skip the Check step because we have not discussed what to check in class yet.

Calculate. State the degrees of freedom for the T-distribution. Then, use StatKey to find the \(t^*\) value for a 90% confidence interval.
Using the regression output and what you found on StatKey, calculate a 90% confidence interval for the slope parameter.

Checkpoint (Check after completing part d)

90% confidence interval for the slope parameter: (-0.8379, -0.4729)

Verify that the T-statistic for testing the slope in the regression output is the (point estimate minus the null hypothesized value) divided by the (standard error).

Checkpoint (Check after completing part e)

\(T = \frac{-0.6554 - 0}{0.1104} = -5.939\), which matches the T-statistic given in the output.

Write down the p-value given in the regression output for a hypothesis test on the slope.
Conclude. Interpret your confidence interval in context of the problem.
Write a conclusion for your hypothesis test (you can skip an interpretation of the point estimate in your conclusion because you just interpreted a confidence interval for the slope).

Checkpoint (Check after completing part h)

There is strong evidence that driving distance and driving accuracy are associated (\(T = -5.939\), p-value \(< 0.0001\)).

Exercise 2. Sakai was the university’s “old” Learning Management System. You can think of Sakai as being similar in purpose to Canvas. Sakai kept track of how many times students visited the site for a particular course. For this exercise, we will practice a little bit more with regression and look at the relationship between current grade in STAT 113 (in points) and how many times a student visited the Sakai site (in number of times) for 54 students in a STAT 113 class from a couple of years ago. Below is a scatterplot of the data along with output from a linear regression with grade (points) as the response and number of visits to Sakai as the predictor.

Fitting linear model: Current_Grade ~ Total
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	58.31	6.118	9.532	5.297e-13
Total	0.1614	0.03652	4.418	5.072e-05

Describe the relationship between the Grade and the Number of Visits to Sakai.
Prepare. Write the null and alternative hypotheses in statistical notation for assessing whether there is an association between grade and number of visits to Sakai.
Write the theoretical/population model. Then write the fitted regression equation from the table of output.

Check. We will again skip check because we have not discussed this in class yet. One of the assumptions, however, is linearity, which is slightly violated based on the scatterplot. Regardless, we will continue with the steps for the confidence interval and hypothesis test for practice.

Calculate. State the degrees of freedom for the T-distribution. Then, use StatKey to find the \(t^*\) value for a 90% confidence interval.
Using the regression output and what you found on StatKey, calculate a 90% confidence interval for the slope parameter.
Write down the p-value given in the regression output for a hypothesis test on the slope.
Conclude. Interpret your confidence interval in context of the problem.
Write a conclusion for your hypothesis test (you can skip an interpretation of the point estimate in your conclusion because you just interpreted a confidence interval for the slope).

Exercise 3. Suppose that we want to investigate the association between valence, a measure of song positivity, and song genre, either pop or rock in a sample of 2166 songs collected from spotify in 2019 and 2020.

For this example, \(Y\), the response variable, will be valence, measured in points by Spotify, and genre, the explanatory variable, will be a categorical variable that is equal to 1 if a song is rock and 0 if a song is pop.

The first three observations of the data set are given below:

track_name	track_artist	playlist_genre	valence
I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	pop	0.518
Memories - Dillon Francis Remix	Maroon 5	pop	0.693
All the Time - Don Diablo Remix	Zara Larsson	pop	0.613

The output from the regression model, as well as side-by-side boxplots, are given below.

Fitting linear model: valence ~ playlist_genre
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.4829	0.005158	93.64	0
playlist_genrerock	-0.03965	0.009601	-4.13	3.766e-05

Define an indicator variable that is equal to a 0 if a song is in the pop genre and a 1 if a song is in the rock genre.

Checkpoint (Check after completing part a)

\[ genre_{ind} = \begin{cases} 0, \text{ song is pop}, \\ 1, \text{ song is rock} \end{cases} \]

Write down the fitted regression equation using the output above and your indicator variable.

Checkpoint (Check after completing part b)

\(\hat{Y} = 0.4829 - 0.03965 \text{genre}_{ind}\)

Predict the song valence (positivity) for a song that is a pop song. This is the same as the mean valence for all pop songs.

Checkpoint (Check after completing part c)

0.4829 points

Predict the song valence for a song that is a rock song. This is the same as the mean valence for all rock songs.
Perform a hypothesis test to determine if the mean valence for pop songs is different than the mean valence for rock songs.

Prepare: Write the null and alternative hypotheses in notation.

Check: We will again skip check until we discuss this in class.

Calculate: Find the appropriate T-statistic and p-value from the output above. Also, state the degrees of freedom.

Conclude: Write a full conclusion in context of the problem.

Find a 95% confidence interval for the difference in mean valence scores between rock songs and pop songs.

Prepare: What is your point estimate and standard error? Is your point estimate for the difference in sample means of (pop song valence minus rock song valence) or is it for (rock song valence minus pop song valence)?

Calculate: What are the degrees of freedom? Using the degrees of freedom, find the appropriate \(t^*\) value for your interval with StatKey. Then calculate the 95% confidence interval for the difference in mean valence.

Checkpoint (Check after completing the Calculate step)

\(t^* = 1.9611\)
your 95% confidence interval should be: (-0.058479, -0.02082)

Conclude: Interpret the interval in context of the problem.

Exercise 4. Recall the stroop experiment that we performed in class a while ago. In this experiment, we separated into two groups: a group that completed a test where the colour of the words matched the word meaning (the "control" group) and a group that completed a test where the colour of the words did not necessarily match the word meaning (the "stroop" group. The response variable we recorded was time that it took to complete the test (in seconds).

We will now formally assess (with a statistical test) whether the time to complete the test and the test group are associated. In other words, we will assess whether there is evidence that the average time to complete the test for the control group is different than the average time to complete the test for the stroop group.

Below is a side-by-side boxplot of the data we collected, along with output from a fitted regression model with the indicator variable defined with control = 0 and stroop = 1.

Fitting linear model: time ~ test
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	19.09	2.206	8.652	7.672e-09
teststroop	12.47	3.006	4.148	0.000362

Note that there are 26 students in the data set.

Write down the fitted regression equation using the output above. Define the indicator variable in your model.
Complete the hypothesis test described in the problem using the following steps.

Prepare: Write the null and alternative hypotheses in notation.

Prepare: What is your point estimate for the difference in means?

Skip Check for now and assume the assumptions for the test hold.

Calculate: Find the appropriate T-statistic and p-value from the output above.

Conclude: Write a full conclusion in context of the problem. You can skip an interpretation of your point estimate in your conclusion because you will interpret a confidence interval for the difference in means next.

Find a 90% confidence interval for the slope parameter, which represents a difference in means for this example.

Prepare: What is your point estimate and standard error? Is your point estimate for the difference in sample means of (time for control group minus time for stroop group) or is it for (time for stroop group minus time for control group)?

Calculate What are the degrees of freedom? What is the appropriate t* value for your interval?

Calculate: Calculate the 95% confidence interval for the difference in mean time.

Conclude: Interpret the interval in context of the problem.

Exercise 5. In this exercise, we will formally assess whether there is evidence for an association between type of Award preferred (Nobel, Olympic, or Academy) and GPA, using data collected from the STAT 113 survey.

Side-by-side boxplots, some summary information, and output from a model are shown below.

Award	GPA_mean	sample_size
Academy Award	3.382	7
Nobel Prize	3.477	30
Olympic Gold	3.377	36

Analysis of Variance Table
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Award	2	0.176	0.08798	0.6519	0.5242
Residuals	70	9.447	0.135	NA	NA

Write the null and alternative hypotheses for the overall ANOVA testing association between award preference and GPA.
Write the sample mean GPA for students who prefer an Olympic medal in statistical notation.
Using the output, write the p-value for the test of association.
Using the p-value, write a conclusion in context of the problem.

Below is output from a Tukey’s Honestly Significant Difference procedure on the data. Complete the following regardless of whether or not you found evidence of association in the overall F-test.

	diff	lwr	upr	p adj
Nobel Prize-Academy Award	0.0955190	-0.2737303	0.4647684	0.8099886
Olympic Gold-Academy Award	-0.0050476	-0.3684288	0.3583336	0.9993902
Olympic Gold-Nobel Prize	-0.1005667	-0.3180313	0.1168979	0.5127744

Locate the Olympic Gold-Nobel Prize row. Write the null and alternative hypotheses, in statistical notation, for this particular hypothesis test.
Interpret the value of diff for this row, in context of the problem.
Interpret the value of lwr and upr for this row, in context of the problem.
Write a conclusion using the p adj value for this row.

Exercise 6. We will now go back to some of the earlier exercises and check some of the conditions for inference with these models. Note that you can skip checking independence for each of the models below, as often you would need a little bit more information about how the data was collected to assess this assumption.

Using the graphs below (scatterplot, residuals vs. total visits, and histogram of residuals), assess the linearity, normality / sample size, and constant variance assumptions for the model with course grade as the response and number of visits to Sakai as the predictor for the 54 students in the data set.

Using the graphs below (residuals vs. noise type and histogram of residuals), assess the normality / sample size, and constant variance assumptions for the model with words as the response and noise type as the predictor for the 26 students.

Using the graphs below (residuals vs. noise type and histogram of residuals), assess the normality / sample size and constant variance assumptions for the model with GPA as the response and Award preference as the predictor for the 73 students.

Exercise 7. Use the following choices to determine which procedure you would use to answer a few questions of interest.

Confidence interval or hypothesis test for a proportion.
Chi-squared goodness of fit test.
Chi-squared test of association.
Confidence interval or hypothesis test for a mean or for a difference in means from paired data.
Confidence interval or hypothesis test for a slope in a regression model.
Confidence interval or hypothesis test for a difference in means in a regression model.
ANOVA hypothesis test.

Recall some of the variables collected in the STAT 113 survey data. Variables include: GPA, Exercise (hours per week), Award (olympic, nobel, or academy), Class year (First-year, Sophomore, Junior, or Senior), Travel (number of hours to get to SLU), Height (in inches), Computer (Mac or PC), Twitter (has twitter or does not), and Pulse (beats per minute).

Which procedure would you use to answer the following questions?

Is there evidence of an association between Class year and whether or not a student has a Twitter account?
Is there evidence of an association between Height and Award preference?
Is there evidence of an association between Exercise and Height?
Is there evidence that average GPA is different than 3.2 points?
Is there evidence that the percentage of students who have a Twitter is more than 50%?
Is there evidence for a preference for Award wanted?
Is there evidence for an association between Exercise and Pulse?

Exercise 8. The purpose of this final homework question is for you to do a short reflection on this course! Answer each of the following questions in 2-3 sentences. What you will be graded on is (a) whether or not you actually answered the question, (b) that what you answered makes sense (e.g. you did not reference something that we didn’t actually cover!), and (c) whether you answered the question in 2-3 sentences (there’s no penalty for answering in more than 3 sentences but I’m hoping to keep each answer under 3 sentences out of respect for your own time!) This should take you no longer than 30 minutes.

You may type the answers to this exercise if you would like.

Tell me a thing or two that you’ve learned about yourself from taking this class. This might be a perceived strength, weakness to work on, or something else entirely!
Why did you take this course? If you took it for a major or a minor requirement, then why do you think that major or minor requires a course in intro statistics?
About 75% of SLU students end up taking Intro Stat. Some students take it for a major requirement, as it is required by four out of five of SLU’s most popular majors (Business, Economics, Biology (though Biology students can take Calc I and II instead), and Psychology (Performance and Communications Arts is the one that doesn’t require it)), as well as many other majors. Why do you think Intro Stat is required by such a wide variety of majors?
Explain one way in which intro stat was similar to what you thought it would be. Explain one way in which intro stat was different than what you thought it would be.
What was your favourite part or parts of learning introductory statistics concepts or of this class? Why? (Even if you hated everything, choose “the best of the worst”).
Explain to someone who has not had intro stat what the main purpose of this class is.
What advice would you give to someone taking this course this spring?