10 Data Ethics

Goals:

  • explain why data ethics is an important issue in data science using a couple of examples.
  • describe a few issues with data privacy and explain why, just because data doesn’t have an individual’s name doesn’t necessarily make the data truly anonymous.
  • explain the difference between hypothesis confirmation and hypothesis exploration and why the distinction matters.

10.1 Ethical Examples

We’ve tried to interweave issues of ethics throughout many examples used already in this course, but the purpose of this section is to put data ethics in direct focus.

Some questions to consider for any data collected, especially data collected on human subjects:

  • who gets to use data and for what purposes?

  • who collected the data and does that organization have any conflicts of interest?

  • is presentation of an analysis harmful to a particular person or group of people? Are there benefits of an analysis?

  • have the subjects of a data collection procedure been treated respectfully and have they given consent to their information being collected?

    • When is consent needed and when is it not? For example, we have looked at data on professional athletes. Do they need to provide consent or is consent inherent in being in the spotlight?

    • We’ve also scraped data from SLU’s athletics website to look at data pertaining to some of you! Is that ethical? Is there a line you wouldn’t cross pertaining to data collected on named, individual people?

10.1.1 Exercises

Exercises marked with an * indicate that the exercise has a solution at the end of the chapter at 10.5.

  1. Read Sections 8.1 - 8.3 in Modern Data Science with R. Then, write a one paragraph summary of the reading and how it might pertain to the way you use or interpret data.

  2. Data Feminism is related to data ethics, though the two terms are certainly not synonymous. Recently, Catherine D’Ignazio and Lauren F. Klein published a book called Data Feminism https://datafeminism.io/

Read the following blog post on Data Feminism, focusing on the section on Missing Data. https://teachdatascience.com/datafem/ .

Pick one example from the bulleted list and write a 2 sentence explanation that explains why it might be important to acknowledge the missing data in an analysis.

  1. Choose 1 of the following two articles to read
  1. For the LGBTQIA+ article, write a two sentence summary for the side of the argument that research in facial recognition software to identify members of the LGBTQ+ community should not occur, even if this viewpoint isn’t your own.

Then, write a two sentence summary for the side of the argument that research in facial recognition software to identify members of the LGBTQ+ community is okay as long as the results are used responsibly, even if this viewpoint isn’t your own.

  1. For the anti-racist data science article, under Step 2, pick a News Article and read the first few paragraphs. Describe, in 2-3 sentences, what your article’s example of bias is and why the incidence of bias matters.

10.2 Data Privacy

Related to data ethics is the idea of data privacy.

  • What data is private and what data is public? For some examples, this may seem obvious, but for others (e.g. data on a government agency that collects data on people), the answer might not be as clear cut.
  • Is anonymous data truly anonymous?
  • What type of consent should be provided before collecting data on someone?

We will explore some of these issues in the following exercises.

10.2.1 Exercises

Exercises marked with an * indicate that the exercise has a solution at the end of the chapter at 10.5.

  1. How anonymous are SLU’s course evaluations? We will do an in-class activity to investigate this.

  2. Suppose that I collect data on students in this Data Science class. In each setting (a) through (d), suppose that I give you a data set with the following variables collected on each student in the class. Which option, if any, would it be ethically okay for me to share the data with all students in the class.

  1. current grade and time spent on Canvas.

  2. current grade, class year, and whether or not the student is a stat major

  3. favorite R package, whether or not the student took STAT 213, whether or not the student took CS 140, and Major

  4. favorite R package, whether or not the student took STAT 213, whether or not the student took CS 140, and current grade in the course

10.3 Hypothesis Generation vs. Confirmation

We have focused on hypothesis generation for all data sets in this particular course. Read the following two articles that explain the difference between hypothesis generation and hypothesis confirmation:

Read the following two very short articles, one from our textbook and one from another source:

10.3.1 Exercises

Exercises marked with an * indicate that the exercise has a solution at the end of the chapter at 10.5.

  1. Explain the difference between hypothesis generation and hypothesis confirmation.

  2. How many times can you use a single observation for hypothesis generation? for hypothesis confirmation?

  3. Which of the following questions, pertaining to someone’s fitness, sound more suitable to be answered with Hypothesis Exploration? Which with Hypothesis Confirmation?

  1. You want to know if, on average, this person exercises more on weekends or more on weekdays, with no other questions of interest.

  2. You want to look at general trends in the person’s step count and try to determine if various events influenced the step count.

  3. You want to know if the person exercises more in winter or more in summer, and you would also like to investigate other seasonal trends.

Note: Prediction is different from hypothesis confirmation, because you typically don’t really care which variables are associated with your response. You only want a model that gives the “best” predictions. Because of this, if your goal is prediction, you typically have a lot more freedom with how many times you can “use” a single observation. We will talk a little more about prediction later in the semester.

10.4 Chapter Exercises

There are no chapter exercises for this chapter.

10.5 Exercise Solutions

There are no exercise solutions for this chapter.