9  Data Ethics

Goals:

9.1 Ethical Examples

We’ve tried to interweave issues of ethics throughout many examples used already in this course, but the purpose of this section is to put data ethics in direct focus.

Some questions to consider for any data collected, especially data collected on human subjects:

  • who gets to use data and for what purposes?

  • who collected the data and does that organization have any conflicts of interest?

  • is presentation of an analysis harmful to a particular person or group of people? Are there benefits of an analysis?

  • have the subjects of a data collection procedure been treated respectfully and have they given consent to their information being collected?

    • When is consent needed and when is it not? For example, we have looked at data on professional athletes. Do they need to provide consent or is consent inherent in being in the spotlight?

    • We’ve also scraped data from SLU’s athletics website to look at data pertaining to some of you! Is that ethical? Is there a line you wouldn’t cross pertaining to data collected on named, individual people?

Exercise 1. Read Sections 8.1 - 8.3 in Modern Data Science with R. Then, write a one paragraph summary of the reading and how it might pertain to the way you use or interpret data.

Exercise 2. Read 2 of the subsections in 8.4 in Modern Data Science with R. These can be any two of the eight subsections of your choosing.

  1. After reading the two subsections, read the 12 principles to guide ethical action in 8.5 in Modern Data Science with R. For each of the two subsections you chose, select one of the 12 principles that you feel is most relevant to each example, explaining your reasoning.

  2. Read the two subsections in 8.5 in Modern Data Science with R corresponding with the two subsections that your read in section 8.4. What principles do the authors reflect on for your two examples?

Exercise 3. Data Feminism is related to data ethics, though the two terms are not synonymous. Recently, Catherine D’Ignazio and Lauren F. Klein published a book called Data Feminism https://datafeminism.io/

Read the following blog post on Data Feminism, focusing on the section on Missing Data. https://teachdatascience.com/datafem/ .

Pick one example from the bulleted list and write a 2 sentence explanation that explains why it might be important to acknowledge the missing data in an analysis.

Exercise 4. Choose 1 of the following two articles to read

  1. For the LGBTQIA+ article, write a two sentence summary for the side of the argument that research in facial recognition software to identify members of the LGBTQ+ community should not occur, even if this viewpoint isn’t your own.

Then, write a two sentence summary for the side of the argument that research in facial recognition software to identify members of the LGBTQ+ community is okay as long as the results are used responsibly, even if this viewpoint isn’t your own.

  1. For the anti-racist data science article, under Step 2, pick a News Article and read the first few paragraphs. Describe, in 2-3 sentences, what your article’s example of bias is and why the incidence of bias matters.

9.1.1 Data Privacy

Related to data ethics is the idea of data privacy.

  • What data is private and what data is public? For some examples, this may seem obvious, but for others (e.g. data on a government agency that collects data on people), the answer might not be as clear cut.
  • Is anonymous data truly anonymous?
  • What type of consent should be provided before collecting data on someone?

We will explore some of these issues in class.

9.2 Hypothesis Generation vs. Confirmation

We have focused on hypothesis generation for all data sets in this particular course. Read the following two articles that explain the difference between hypothesis generation and hypothesis confirmation:

Read the following two very short articles, one from our textbook and one from another source:

We will discuss those readings further in class.

9.3 Practice

9.3.1 Class Exercises

Class Exercise 1. How anonymous are SLU’s course evaluations? We will do an in-class activity to investigate this.

Class Exercise 2. Suppose that I collect data on students in this Data Science class. In each setting (a) through (d), suppose that I give you a data set with the following variables collected on each student in the class. Which option, if any, would it be ethically okay for me to share the data with all students in the class.

  1. current grade and time spent on Canvas.

  2. current grade, class year, and whether or not the student is a stat major

  3. favorite R package, whether or not the student took STAT 213, whether or not the student took CS 140, and Major

  4. favorite R package, whether or not the student took STAT 213, whether or not the student took CS 140, and current grade in the course

9.3.2 Your Turn

Your Turn 1. In your group, explain the difference between hypothesis generation and hypothesis confirmation.

Your Turn 2. Discuss in your group how many times you can use a single observation for hypothesis generation? for hypothesis confirmation?

Your Turn 3. Again in your group, answer which of the following questions, pertaining to someone’s fitness, sound more suitable to be answered with Hypothesis Exploration? Which with Hypothesis Confirmation?

  1. You want to know if, on average, this person exercises more on weekends or more on weekdays, with no other questions of interest.

  2. You want to look at general trends in the person’s step count and try to determine if various events influenced the step count.

  3. You want to know if the person exercises more in winter or more in summer, and you would also like to investigate other seasonal trends.

Note

Prediction is different from hypothesis confirmation, because you typically don’t really care which variables are associated with your response. You only want a model that gives the “best” predictions. Because of this, if your goal is prediction, you typically have a lot more freedom with how many times you can “use” a single observation. We will talk a little more about prediction later in the semester.