Section 4 Homework

Reminder: You are allowed to work with other students on the homework assignments but you must acknowledge who you worked with at the top of your homework assignment.

At the top of you assignment, List Any Collaborators (if any):

Statement of Integrity: All work submitted is my own, and I have followed all rules for collaboration.

Signature:

On the top of your assignment, copy the entire statement of integrity or just write the phrase “Statement of Integrity” and sign your name to it.

All homeworks should be handwritten (unless otherwise noted).

Exercise 1. Choose one of the following four videos to watch. I’ve selected these videos because they do not have a lot of code or technical detail. However, there may be specific things that you don’t understand in each video, which is totally fine: you’ll still be able to understand the general ideas and purpose!

Effective Visualizations: https://resources.rstudio.com/rstudio-conf-2020/effective-visualizations-miriah-meyer
Sports Analytics: https://resources.rstudio.com/rstudio-conf-2020/r-tidyverse-in-sports-namita-nandakumar
Humanitarian Data Science: https://www.rstudio.com/resources/rstudioglobal-2021/humanitarian-data-science-with-r/
ACLU Data Visualization: https://www.rstudio.com/resources/rstudioglobal-2021/trial-and-error-in-data-viz-at-the-aclu/

After you watch the video, write down 3 key messages from the video, each key message being 1 - 2 sentences.

Exercise 2. Standardized testing is still commonly used at many universities as data for part of the university admission’s decision. To investigate the association between test scores and first year GPA, data were collected on 1000 second-year students at an unnamed institution. The data used was obtained from the OpenIntro textbook. Included among the variable collected are

fy_gpa, the student’s first year GPA
sat_v, the student’s SAT Verbal score
sat_m, the student’s SAT Math score
sat_sum, the student’s SAT Verbal score plus their SAT Math score
hs_gpa, the student’s high school GPA
sex, the sex of the student (either female = 0 or male = 1 in this data set).

We are interested in building a predictive model for fy_gpa, a student’s first year GPA at the university. We randomly split the 1000 students into a training sample of 800 students and a test sample of 200 students.

The first model we consider is a model with sat_v, sex, and hs_gpa as predictors for the training data set of 800 students:

Fitting linear model: fy_gpa ~ sat_v + hs_gpa + sex
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.3927	0.1627	-2.414	0.01602
sat_v	0.01988	0.002852	6.968	6.733e-12
hs_gpa	0.6021	0.04326	13.92	1.391e-39
sexmale	-0.0785	0.04398	-1.785	0.07466

Write the fitted regression equation with the three predictors.

Checkpoint (Check after completing part a)

your answer should look like: \(\hat{Y} = -0.3927 + 0.0199 sat_v + 0.6021 hs_{gpa} + -0.0785 sex_{ind}\)

The following table gives the first three students in the test data set.

sex	sat_v	sat_m	sat_sum	hs_gpa	fy_gpa
male	65	62	127	3.4	3.18
female	58	64	122	4	3.33
female	56	60	116	3.75	3.25

Use the model that was fit with the training set to predict the first-year GPA of the first student in the test data set. Then, calculate a prediction error that is the observed first-year gpa for the first student minus what the model predicts the their first-year gpa is.

Checkpoint (Check after completing part b)

your answer should be: 0.312 points.

Use the model that was fit with the training set to predict the first-year GPA of the second student in the test data set. Then, calculate a prediction error that is the observed first-year gpa for the second student minus what the model predicts the their first-year gpa is.
Use the model that was fit with the training set to predict the first-year GPA of the third student in the test data set. Then, calculate a prediction error that is the observed first-year gpa for the third student minus what the model predicts the their first-year gpa is.
Suppose that, instead of 200 students, your test data set only consisted of these 3 students. Calculate the mean absolute error for this tiny test sample of 3 students.

For the test sample with all 200 students, the mean absolute error is actually 0.4335 points. Interpret this value in context of the problem.

Checkpoint (Check after completing part f)

your answer should be along the lines of: he average error in predicting first year GPA for a model with SAT verbal, High school GPA, and sex as predictors is 0.4335 points.

Now suppose we fit another model (output omitted) with sat_v, hs_gpa, sex, and sat_m as predictors. The mean absolute error for this model on the test data set is 0.4384 points. Which model is the better model, according to the mean absolute error criterion? Give a one phrase reason.
Based on the results, make an argument for why you think the Math SAT should or should not be a part of the admissions applications for universities. You can use reasoning outside of this example, but you should incorporate at least something from this analysis.

Checkpoint (Check after completing part h)

hint: you should compare the model’s mean absolute error with and without the Math SAT.

Exercise 3. Suppose that marathon running times for two different marathon races are normally distributed. The first race, Race A, is run on a road and has a mean time of 230 minutes with a standard deviation of 20 minutes. The second race, Race B, is run on a hilly trail and has a mean time of 270 minutes and a standard deviation of 30 minutes. You run both marathons, finishing the first race (Race A) in 225 minutes and the second race, Race B, in 240 minutes.

Draw the distribution of times for Race A using the 68-95-99 rule. Mark where yjuu8iour time falls on the distribution.
Calculate the Z-score for your Race A time.

Checkpoint (Check after completing part b)

your answer should be: \(Z = -0.25\)

Interpret your Z-score for Race A in context of the problem.

Checkpoint (Check after completing part c)

your answer should be: Our time to finish the race is 0.25 standard deviations below the mean race time.

Using StatKey, find the proportion of runners that ran the race faster (in less time) than your race time.
Draw the distribution of times for Race B using the 68-95-99 rule. Mark where your time falls on the distribution.
Calculate the Z-score for your Race B time.
In which race was your time “more unusual”? Give a one phrase reason.

Exercise 4. Suppose that the scores on an online IQ test are normally distributed with \(\mu = 150\) and \(\sigma = 25\). You have two friends take the IQ test. One friend (Friend A) scores a 175 and the other friend (Friend B) scores a 110.

Calculate the z-scores for both of your friends.

Checkpoint (Check after completing part a)

Friend A: 1
Friend B: -1.6

Interpret the z-score for Friend B in context of the problem.
Which friend had the more “unusual” score on the intelligence test? Give a one sentence explanation.
Using StatKey, find the proportion of people that score higher than Friend A on the test.

Checkpoint (Check after completing part d)

your answer should be: 0.1587

Using StatKey, find the proportion of people that score lower than Friend B on the test.
A third friend takes a different online IQ test. This test has a different mean and a different standard deviation than the first two tests. Explain why using z-scores would be useful to compare the results of the test from your third friend with the results of your other two friends.

Checkpoint (Check after completing part f)

hint: what if the third friend’s IQ was measured on a completely different scale?

Exercise 5. Suppose that you want to collect some data on what faculty attitudes are about diversity here at SLU. You have a set of questions you plan to ask faculty members. Note that, if conducting this survey with plans to publicize results, you would need IRB approval to interview human subjects. But, ignoring the need for approval, describe how you would obtain the following:

A simple random sample
A convenience sample
A volunteer sample
A census

Exercise 6. Make an argument for why the volunteer sample you discussed in the previous exercise might not be representative of faculty attitudes about diversity.

Exercise 7.

Suppose that a gardener wants to determine if a new brand of fertilizer improves her tomato plant growth, compared to the old brand of fertilizer. The gardener has 50 pots with 50 tomato plants to use for her study. Design an experiment for the gardener.

Checkpoint (Check after completing the exercise)

your experiment must include randomization: that is, for this to be an experiment (not an observational study), the gardener must randomly select some pots to receive the old fertlizier and some pots to receive the new fertilizer.

Suppose that, instead of the experiment, the gardener performs an observational study and just uses the new fertilizer for the northern half of her garden and uses the old fertilizer for the southern half. What might be a confounding variable in this study? Explain why the variable you chose might be confounding.

Checkpoint (Check after completing the exercise)

answers will vary here, but you must argue that:
- whatever variable you selected is associated with whether or not the plant received the new or old fertilizer (the explanatory).
- whatever variable you selected is associated with growth (the response).
one logical confounding variable to choose would be the amount of sun received.

Exercise 8. We discussed a few ways to collect data in class. In reality, sampling methods are typically more complicated than what we have discussed. There are entire courses devoted to sampling methods, survey methods, and collecting data. Let’s investigate some of these complexities using the job discrimination data that we introduced on the first day of class.

For this exercise, you will need to open the Job_Discrimination_Paper.pdf file on Canvas, which is the published paper on the job discrimination data. To answer the questions below, you will read particular sections of this paper: you do not need to read the entire paper for this question.

Read the Abstract and Introduction.

What do the authors mean when they say, at the top of page 3, “Since applicants’ names are randomly assigned, this gap can only be attributed to the name manipulation?”
Aside from the main finding of racial discrimination, what is one other result the authors state that you found interesting or important?

Read Section 5.1.

We have stated many times that, in experiments, confounding variables are balanced out when randomization into groups occurs. However, in this experiment, there actually could still be a confounding variable because we aren’t actually randomizing people, we are only randomizing names. What is one possible confounding variable given in this section? Why is it confounding (recall our definition of confounding variable from earlier in the semester).

Exercise 9. Suppose we are interested in the proportion of students that are right-handed at SLU. We take a convenience sample of STAT 113 students, finding that, of the 148 surveyed, 137 are right-handed.

What is the population of interest?
What is the parameter of interest and what is the notation for this parameter?
What is the point estimate and what is the notation for this point estimate?
Is our sample representative of the population of interest for the handedness characteristic? Why or why not?

Exercise 10. Suppose that we are interested in the proportion of Maple trees in the St. Lawrence county that are taller than 40 feet. The local government painstakingly measures every single tree in a census, finding that the proportion of all maple trees in St. Lawrence County taller than 40 feet is 0.29.

Even though you know the true proportion of trees in St. Lawrence County that are taller than 40 feet, you decide to take a sample of 20 trees in the county. Describe what sampling uncertainty/variability means in the context of this example.
Describe what the sampling distribution of the sample proportion is in context of this example.

Checkpoint (Check after completing part b)

hint: the sampling distribution of the sample proportion would result from repeatedly sampling 20 trees in St. Lawrence County millions of times.

Use the sampling distribution app from Handout 4c (https://stlawu.shinyapps.io/samp_dist_conf_int/) to build the sampling distribution of the sample proportion for a sample size of 20. Write down the mean and standard deviation (standard error) of the sampling distribution.
Provide a calculation for the standard deviation of the sampling distribution and verify that your calculation is approximately the same as the standard deviation from the App.
Will the sampling distribution of the sample proportion be approximately normally distributed? Provide a calculation as evidence for your answer.

Checkpoint (Check after completing part e)

hint: according to our success-failure condition, the sampling distribution will be normally distributed if both \(n \cdot p\) and \(n \cdot (1 - p)\) are larger than 10.

Now suppose that your friend takes a sample of 50 trees in St. Lawrence county. Will the spread of your friend’s sampling distribution of the sample proportion be greater than or less than the spread of your sampling distribution of the sample proportion with a sample size of 20 trees? Give a brief reason.
When your friend takes their sample of 50 trees, they find a sample proportion of trees that are taller than 40 feet of 0.3. Calculate a 95% confidence interval for the proportion of trees taller than 40 feet using your friend’s sample.

Checkpoint (Check after completing part g)

your 95% confidence interval should look like: (0.1742, 0.4258)

Interpret your interval in context of the problem.
Why does it not make sense to calculate and interpret a confidence interval for this example?

Checkpoint (Check after completing part i)

hint: your answer should include the word census.

Exercise 11. Suppose that you are interested in estimating the proportion of toddlers in the U.S. who are allergic to peanuts. You randomly select 150 toddlers and record whether each one has a peanut allergy. Assume that your sample is representative of all toddlers in the U.S.

You find in your sample that 20 out of the 150 toddlers have a peanut allergy.

Describe what sampling uncertainty/variability means in context of this problem.
What is the shape of the sampling distribution of the sample proportion for this problem? You need to justify your response with an appropriate calculation or explanation.

Checkpoint (Check after completing part b)

hint: your calculation should be about the success-failure condition.

Calculate a 99% confidence interval for the proportion of all toddlers with a peanut allergy.

Checkpoint (Check after completing part c)

your confidence interval should be: (0.0618, 0.2048)

Interpret this interval using our standard confidence interval interpretation in context of the problem.
Suppose that the true proportion of toddlers with a peanut allergy in the United States is 0.15. Using this information, explain what “95% confidence” means in context of this problem, using the idea of “repeated sampling” in your explanation.
Mark each of the following statements as True or False. Give a short one phrase or one sentence explanation for each response.

(1): The Margin of Error for a 90% confidence interval is larger than the Margin of Error for a 99% confidence interval.

(2): About 99% of all 99% confidence intervals constructed will contain the proportion \(\frac{20}{150}\).

(3): About 99% of all 99% confidence intervals constructed will contain 0.15.

Checkpoint (Check after completing parts (1), (2), and (3))

answers: (1) is FALSE, (2) is FALSE, (3) is TRUE

Exercise 12. A common topic in the U.S. is the aging of the “baby boomer” generation. Let’s look at the sampling distribution for a sample size of about 130 of the proportion of people in the U.S. who are 65 or older. The true proportion of people 65 or older at the time of the 2010 U.S. census is 0.12.

Using the app that we used for Handout 4d (https://stlawu.shinyapps.io/samp_dist_conf_int/), drag the slider for the true proportion to 0.12 and the sample size to 130. Then, answer the following questions.

Click the “Sampling Distribution” tab and write down the standard error, the standard deviation of the sampling distribution.

Checkpoint (Check after completing part a)

your answer should be approximately 0.0283

Calculate the standard error “by hand” using the formula and verify that it is approximately equal to the standard deviation in the app.
Click on the “Confidence Intervals” tab and drag the “Add and Subtract” slider until about 95% of the intervals cover the true proportion, 0.12. Write down the number on the slider.

Checkpoint (Check after completing part c)

your answer should be approximately 0.056.

By hand, calculate the Margin of Error for a 95% confidence for this example and verify that your calculation matches (approximately) the number on the slider in the app.

Exercise 13. Suppose we are interested in the proportion of students that are right-handed at SLU. We take a convenience sample of STAT 113 students, finding that, of the 148 surveyed, 137 are right-handed.

What is the point estimate and what is the notation for this point estimate?

Checkpoint (Check after completing part a)

hint: the notation for this point estimate is: \(\hat{p}\)

Check the success-failure condition.
Calculate a 99% confidence interval for the proportion of students who are right-handed at SLU.
Interpret the 99% confidence interval with the standard interpretation for a confidence interval.