Section 3 Homework

Reminder: You are allowed to work with other students on the homework assignments but you must acknowledge who you worked with at the top of your homework assignment.

At the top of you assignment, List Any Collaborators (if any):

Statement of Integrity: All work submitted is my own, and I have followed all rules for collaboration. If I used any AI on this assignment, I have clearly stated both (1) what my search prompt(s) was/were and (2) what I used from the AI answer(s).

Signature:

On the top of your assignment, copy the entire statement of integrity or just write the phrase “Statement of Integrity” and sign your name to it.

All homeworks should be handwritten (unless otherwise noted).

Exercise 1. For this exercise, we will use the data from the survey at the beginning of the semester. In particular, we will model GPA at SLU with no predictors. Below are a histogram and a table of summary statistics for GPA from the survey.

\(\hat{\mu}\)	n
3.418	73

Write the fitted model for GPA with no predictors.

Checkpoint (Check after completing part a)

answer should look like: \(\hat{Y}\) = 3.42 where \(\hat{Y}\) is predicted GPA.

Predict the GPA of a male student who plays a varsity sport on campus, owns a Mac, is a Sophomore, and has 3 siblings.
Predict the GPA of a male student who does not play a varsity sport on campus, owns a PC, is a Sophomore, and has 1 sibling.

Checkpoint (Check after completing part c)

hint: your answer for part b and part c should be equal

Suppose that the student in (b) has a GPA of 2.5 (making this value up to have complete anonymity in the survey). Find the residual for this student.

Checkpoint (Check after completing part d)

answer should be close to -0.89 points
recall the formula for calculating a residual is: \(residual = actual - predicted\)

Suppose that the student in (c) has a GPA of 3.75 (making this value up to have complete anonymity in the survey). Find the residual for this student.

Exercise 2. A data set from the OpenIntro textbook contains observations on 150 births at a hospital in North Carolina. Variables measured for each birth include

weight of the baby, in pounds,
m_age, the mother’s age, in years
smoke, whether or not the mother was a smoker (0 for nonsmoker, 1 for smoker)

For this exercise, we will fit a regression model using weight of the baby as the response and m_age, the mother’s age, as a predictor. Below is a scatterplot of m_age and weight with a fitted regression line, a table of output from the model, and the raw data from the first two mothers in the data set.

Fitting linear model: weight ~ m_age
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	6.775	0.5397	12.55	4.519e-25
m_age	0.01018	0.01976	0.5151	0.6072

weight	m_age	smoke
6.88	30	smoker
7.69	36	nonsmoker

From the model output, write the equation of the fitted regression line for a model with weight of the baby as the response and mother’s age as the predictor.

Checkpoint (Check after completing part a)

hint: \(\hat{Y}\) is predicted weight, and \(X\) is mother’s age

From the scatterplot, describe the relationship between weight and mother’s age.
For the first mother in the data set, find the predicted weight of their baby, according to the model. Include units in your answer.

Checkpoint (Check after completing part c)

answer should be close to: 7.08 pounds.

For the first mother in the data set, calculate the residual weight. Include units in your answer.
Is the model over or under-predicting the weight for the first mother?

Checkpoint (Check after completing part e)

reminder: a negative residual means the model is over-predicting the observation, a positive residual means the model is under-predicting the observation

For the second mother in the data set, find the predicted weight of their baby, according to the model. Include units in your answer.
For the second mother in the data set, calculate the residual weight. Include units in your answer.

Exercise 3. Suppose that we are interested in the relationship between driving distance and driving accuracy for all male professional golfers. We obtain data on 193 professional golfers from the 2018 season and record their driving distance (measured in yards) as well as their driving accuracy (measured in percent of fairways hit). The first three observations of the data set are given below as well as a scatterplot and output from a regression model with driving distance as the response and percent fairways hit (a metric for driving accuracy) as the predictor.

player_name	ranking	score_average	driving_accuracy	driving_distance
Dustin Johnson	1	68.7	59.46	314
Tommy Fleetwood	2	69.34	64.98	306.9
Justin Thomas	3	69.12	58.41	311.8

Fitting linear model: driving_distance ~ driving_accuracy
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	336.9	6.808	49.49	6.94e-111
driving_accuracy	-0.6554	0.1104	-5.939	1.335e-08

From the model output, write the equation of the fitted regression line for a model with distance as the response and percent of fairways hit as the predictor.

Checkpoint (Check after completing part a)

hint: \(\hat{Y}\) is predicted distance in yards, \(X\) is the percent fairways hit, and the slope is -0.66

Using the model, predict the driving distance for the first golfer in the data set. Include units in your answer.
Calculate the residual for the first golfer in the data set. Include units in your answer.

Exercise 4. Using the birth weight model from the earlier exercise (birth weight of the baby as the response (in pounds) and mother’s age (in years) as the predictor, complete the following.

Interpret the estimated slope in context of the problem.
Interpret the estimated intercept in context of the problem.
Should we interpret the intercept here? Why or why not?

Checkpoint (Check after completing part c)

hint: your answer should include the word extrapolation

The \(R^2\) value for this example is 0.00179. Interpret this \(R^2\) in context of the problem.

Checkpoint (Check after completing part d)

answer should be along the lines of: the percentage of variability in baby weight that can be explained by a model with age of the mother is 0.179%.

Exercise 5. Using the golf model from the earlier exercise (driving distance as the response (in yards) and percent fairways hit as the predictor), complete the following.

Interpret the estimated slope in context of the problem.
Interpret the estimated intercept in context of the problem.
Should we interpret the intercept here? Why or why not?

Checkpoint (Check after completing part c)

hint: your interpretation should be roughly the same as part c of Exercise 4

The \(R^2\) value for this example is 0.1559. Interpret this \(R^2\) in context of the problem.

Exercise 6. In this exercise, we will use the birth data from North Carolina to model the response variable baby weight (in pounds) with a categorical predictor smoke (0 if the mother is not a smoker, 1 if the mother is a smoker). Recall that data were collected from 150 births at the North Carolina Hospital. Below is a side-by-side boxplot of the data, the fitted regression model, and a table of data for the first two mothers in the data set.

Fitting linear model: weight ~ smoke
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	7.18	0.149	48.18	2.398e-92
smokesmoker	-0.4005	0.2581	-1.552	0.1229

weight	m_age	smoke
6.88	30	smoker
7.69	36	nonsmoker

Write down the fitted regression equation using the output above. Define the indicator variable in your model.

Checkpoint (Check after completing part a)

hint: \(\hat{Y}\) is predicted weight, and \(smoke_{ind}\) is equal to a \(0\) if the mother was not a smoker and a \(1\) if the mother was a smoker.

Predict the baby weight for a mother who does not smoke.
Predict the baby weight for a mother who does smoke.
Find the residual for the first mother in the data set. Include units in your answer.
Interpret the intercept in context of the problem.
Interpret the slope in context of the problem.

Checkpoint (Check after completing part f)

your answer should be along the lines of: we expect the average baby weight for mothers who do smoke to be 0.4005 pounds less than the average baby weight for mothers who do not smoke.

Exercise 7. Recall the stroop effect experiment that we performed in class a couple of weeks ago. In this experiment, we separated into two groups: a control group that completed a test with words whose colours matched the word description and a stroop group that completed a test with words whose colours did not match the word description. The response variable we recorded was the time it took to complete the test.

Below is a side-by-side boxplot of the data we collected, along with output from a fitted regression model with the indicator variable defined with control = 0 and stroop = 1.

Fitting linear model: time ~ test
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	19.09	2.206	8.652	7.672e-09
teststroop	12.47	3.006	4.148	0.000362

Write down the fitted regression equation using the output above. Define the indicator variable in your model.
Predict the time to complete the test for someone in the control test group.
Predict the time to complete the test for someone in the stroop test group.
Interpret the intercept in context of the problem.
Interpret the slope in context of the problem.