Section 2 Homework

Reminder: You are allowed to work with other students on the homework assignments but you must acknowledge who you worked with at the top of your homework assignment.

At the top of you assignment, List Any Collaborators (if any):

Statement of Integrity: All work submitted is my own, and I have followed all rules for collaboration. If I used any AI on this assignment, I have clearly stated both (1) what my search prompt(s) was/were and (2) what I used from the AI answer(s).

Signature:

On the top of your assignment, copy the entire statement of integrity or just write the phrase “Statement of Integrity” and sign your name to it.

All homeworks should be handwritten (unless otherwise noted).

Exercise 1. Consider the STAT 113 survey data set that was given to all STAT 113 students at the beginning of the semester. One of the variables recorded was Exercise, which was how much exercise a student gets in a “typical” week (measured in hours per week). A histogram of this variable is given below

What is the shape of the distribution?
Provide an educated guess for the median amount of exercise.

Checkpoint (Check after completing part b)

answer should be somewhat close to 10 hours.

Provide an educated guess for the IQR.

Checkpoint (Check after completing part c)

Q1 is approximately 7 hours.
Q3 is approximately 15 hours.

Provide an educated guess for the standard deviation.

Checkpoint (Check after completing part d)

answer should be somewhat close to 5.81 hours.

Would you expect the mean exercise to be larger than, smaller than, or about the same as the median exercise? Give a short reason.
Would you use the mean or the median to describe the center of the distribution? Would you use the standard deviation or the IQR to describe the spread? Give a short reason.

Checkpoint (Check after completing part f)

the answer here must match whether or not you think there is “extreme” skewness or outliers.
- no extreme skewness and no extreme outliers: fine to use mean for center and standard deviation for spread.
- extreme skewness or extreme outliers: median for center and IQR for spread.
what counts as “extreme” is somewhat subjective here!

Exercise 2. The following gives a histogram of the Travel time to SLU from the STAT 113 survey.

What is the shape of the distribution?
Would you use the mean or the median to describe the center of the distribution? Would you use the standard deviation or the IQR to describe the spread? Give a short reason.
Which of these two variables do you think would have a larger standard deviation? The amount of pollutants emitted during a 100-mile drive for 50 randomly selected Subaru Crosstreks? Or, the amount of pollutants emitted during a 100-mile drive for 50 randomly selected general vehicles? Give a one sentence reason for your choice.

Exercise 3. The following is a histogram of the GPA values collected from the STAT 113 survey.

What is the shape of the distribution?
Provide an educated guess for the median GPA.
Provide an educated guess for the IQR.
Provide an educated guess for the standard deviation.
Would you expect the mean exercise to be larger than, smaller than, or about the same as the median exercise? Give a short reason.
Would you use the mean or the median to describe the center of the distribution? Would you use the standard deviation or the IQR to describe the spread? Give a short reason.
Sketch a histogram of GPA that has a higher standard deviation than the histogram given above.

Exercise 4. The following are 22 GPA values (these are not from the STAT 113 survey): 1.0, 2.4, 2.5, 2.6, 3.0, 3.0, 3.0, 3.0, 3.2, 3.2, 3.2, 3.2, 3.4, 3.6, 3.6, 3.6, 3.6, 3.75, 3.75, 3.75, 3.75, 4.0.

Some summary statistics of these GPA values are given in the following table.

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
1	3	3.2	3.186	3.6	4

Construct a boxplot of the GPA values, making sure to explicitly show work for calculation of any outliers.

Checkpoint (Check after completing the exercise)

make sure to use the upper and lower fence calculations to determine if there are any outliers (observations above the upper fence or below the lower fence are outliers and should be marked as asterisks).
if there are outliers, make sure that the “whiskers” of the boxplot extend to the smallest observation that is not an outlier (not to the lower fence!) and the largest observation that is not an outlier (not to the upper fence!).

Exercise 5. Is there a difference in the “average” height among SLU students who play a sport and SLU students who do not play a sport? We will use the survey data to explore this question.

Using the graph above, give the median height for SLU students who do not play a sport and the median height for SLU students who do play a sport.
Does it seem like whether or not SLU students play a sport and height are associated? In other words, is the “typical” height different for SLU students who play a sport than for SLU students who do not play a sport? Make sure to answer the question of interest using both a measure of center and how much overlap there seems to be in the two distributions.

Exercise 6. For this exercise, we will examine whether GPA is associated with Award preference (recorded as Olympic, Nobel, and Academy). In this case, our categorical variable has more than 2 levels. We would say that GPA and Award are associated if the average GPA is substantially different for any of the three levels of Award.

Does it seem like award preference and GPA are associated? In other words, is the “typical” GPA different for any of the three Award groups? Make sure to answer the question of interest using both a measure of center and how much overlap there seems to be in the two distributions.
If, in the previous question, you said it does look like award preference and GPA are associated, sketch a set of side-by-side boxplots that shows little evidence of an association between award preference and GPA. If, in the previous question, you said it does not look like award preference and GPA are associated, sketch a set of side-by-side boxplots that shows evidence of an association between award preference and GPA.

Exercise 7. For each of the following settings, identify whether or not the data is paired or not.

We want to compare stress levels of SLU employees before and after a mindfulness program. To do so, we measure the stress levels of 100 employees before and after the mindfulness program.
We want to compare stress levels of SLU employees before and after a mindfulness program. To do so, we measure the stress levels of 50 employees who just completed the mindfulness program and 50 employees who did not complete the mindfulness program.
We want to compare housing prices in Potsdam with housing prices in Canton, collecting a random sample of 100 homes in each town.
We measure reaction times of 100 athletes before a speed training regime and then again after a speed training regime.

Exercise 8. Suppose that I want to determine if there is statistical evidence that one of the first two assessments in this course was harder than the other. To do so, I obtain scores on the first two assessments in this class for each student in this class. The first three students are shown below (to protect anonymity of all students in the course and because we may not have yet taken the second assessment, the data shown below are fake).

assessment1	assessment2	difference
56	63	7
51	75	24
55	55	0

In one sentence, explain why the data collected here are paired.
Below is a histogram of the differences (assessment2 minus assessment1). Is there evidence for a substantial difference in mean scores in the first two assessments? If so, which assessment seems to have been the more challenging one, on average?

Draw a histogram of differences that does not show evidence that either assessment was harder, on average, than the other.

Checkpoint (Check after completing part c)

for there to be no difference, on average, your histogram of differences must be centered around 0.

Exercise 9. Go to https://openintro.shinyapps.io/correlation_game/ and change the setting to “Moderate” Drag the slider to guess the correlation coefficient and click “Submit.” Keep guessing until the app reveals what the actual correlation coefficient is. Repeat this a few times. Then, change the setting to “Hard”, guess the correlation coefficient, and repeat a few times. From what you learn in the app and what we discussed in class, draw quick sketches of scatterplots for generic variables x and y that satisfy the following conditions.

Correlation coefficient of about -0.97.
Correlation coefficient equal to about 0.04.
Relationship between y and x negative, weak, and linear.
Relationship between y and x positive, strong, and linear.

Exercise 10. Standardized testing is still commonly used at many universities as data for part of the university admission’s decision. To investigate the association between test scores and first year GPA, data were collected on 1000 second-year students at an unnamed institution. The data used was obtained from the OpenIntro textbook. Included among the variable collected are

fy_gpa, the student’s first year GPA
sat_v, the student’s SAT Verbal score
sat_m, the student’s SAT Math score
sat_sum, the student’s SAT Verbal score plus their SAT Math score
hs_gpa, the student’s high school GPA

Below is a scatterplot of fy_gpa vs. sat_m:

Describe the relationship between first year GPA and SAT Math score.

Checkpoint (Check after completing part a)

answer should include three components:
- linear.
- weak (or moderate).
- positive.

Provide a guess for the correlation coefficient between SAT Math score and first-year GPA.

Below is a scatterplot of high school GPA and first year GPA.

Describe the relationship between first year GPA and high school GPA.

Exercise 11. Consider again the Sport and Award variables collected in the STAT 113 survey data set. Sport is either Yes if the student plays a sport here at SLU and No if the student does not. Award preference is either Academy, Nobel, or Olympic.

	Academy Award	Nobel Prize	Olympic Gold
No	7	32	11
Yes	5	24	75

Of those that do not play a sport, what proportion would want to win an Olympic Medal? Of those that do not play a sport, what proportion would want to win an Academy Award? Of those that do not play a sport, what proportion would want to win a Nobel Prize?

Checkpoint (Check after completing part a)

Medal: 0.22
Academy: 0.14
Nobel: 0.64

Of those that play a sport, what proportion would want to win an Olympic Medal? Of those that play a sport, what proportion would want to win an Academy Award? Of those that play a sport, what proportion would want to win a Nobel Prize?
Are the proportions you calculated row proportions or column proportions?

A stacked bar chart of the proportions that were calculated is given below.

Based on the stacked barplot and on the proportions in the earlier questions, does there seem to be an association between whether or not STAT 113 SLU students play a sport and the type of Award they would want to win? Explain. (Note that the Award variable has more than 2 levels: we would consider there to be an association if the proportion of those choose any of the three levels is “substantially” different from those who play a sport vs. those who do not play a sport.)

Checkpoint (Check after completing part d)

Yes, it looks like the variables are associated because the proportion of students who play a sport that responded they would prefer the Olympic medal is much higher than the proportion of students who do not play a sport that responded they would prefer the Olympic medal. We can see this in the bar plot because the distribution of No sport students is substantially different than the distirbution of Yes sport students.

Exercise 12. The military data set from the OpenIntro textbook contains data on over one million members of the United States military. Included among the variables collected are:

branch, the branch of the military
gender, collected as either male or female for this data set

Use the following two-way (contingency) table to answer the questions below.

	female	male
air force	64200	267286
army	74845	481004
marine corps	13200	189766
navy	50473	273819

Of the sampled people in the air force, what proportion are female? Of the sampled people in the army, what proportion are female? Of the sampled people in the marine corps, what proportion are female? Of the sampled people in the navy, what proportion are female?

Examine the following stacked bar plot with row proportions as the y-axis on the y-axis and army branch on the x-axis, with bars coloured or shaded by gender.

Does it seem that army branch and gender are associated? Why or why not? Use both the proportions in part a and the barplot in part b to support your answer.
Regardless of your answer to part b, sketch a stacked bar plot that shows even more evidence of an association between branch and gender than the actual data does.

Exercise 13. Consider again the baseball data set on Major League Baseball players in 2018. Included among the variables collected for these players are their position and whether or not they hit more than 10 homeruns. For this exercise, we will only look at the 371 players who positions are in the infield. We want to explore whether there is an association between infield position and whether a player hits more than 10 homeruns. Below is a contingency table of the data.

	10 or less	more than 10
1B	10	41
2B	33	23
3B	26	29
SS	22	21

What are the two categorical variables and how many levels does each variable have?
What percent of the 1B, first base players, have hit more than 10 homeruns? Is this a row or column percent?
What percent of the 2B, second base players, have hit more than 10 homeruns? Is this a row or column percent?

Below is a stacked bar plot of the data.

Based on the stacked bar plot, does it seem like infield position and whether or not the player hit more than 10 homeruns are associated? Give a short reason.
Regardless of your answer to the previous question, sketch a stacked bar plot of position and homerun_cat that would show less of an association than the data that we actually observed.

Exercise 14. Consider a hypothetical data set where a researcher collects information on used cars throughout northern New York. For each used car the researcher samples, they record

the car’s price
the car’s colour
the number of miles driven
the age of the car
whether the car has automatic or manual transmission
the car’s type (sedan, SUV, etc.)
the mileage of the car (in miles per gallon)

For the five types of plots we have discussed (1) Histogram or Boxplot, (2) Bar plot, (3), Side-by-side Boxplots, (4) Scatterplot, (5) Stacked Barchart, (6) Histogram of Differences or Line Plot, choose the plot that would be most appropriate to explore each of the following questions of interest.

What is the average age of used cars in northern New York?

Checkpoint (Check after completing part a)

Histogram or Boxplot.
Recall that to determine the appropriate graph, you first can determine (1) how many variables are in the question of interest (just 1 here: age) and (2) if each variable is quantitative or categorical (car age is quantitative).

What is the relationship between age of a car and the number of miles driven on the car?
What is the relationship between car colour and car price?
What is the proportion of cars that have each transmission type?
Is there an association between the transmission and the car type (Sedan, SUV, etc.).
Is there an association between the transmission and the car price?