Data science ethics

Lecture 13

2024-06-04

Today

Data science ethics:

  • Misrepresentation

  • Data privacy

  • Algorithmic bias

Misrepresentation

Stand your ground

In 2005, the Florida legislature passed the controversial “Stand Your Ground” law that broadened the situations in which citizens can use lethal force to protect themselves against perceived threats. Advocates believed that the new law would ultimately reduce crime; opponents feared an increase in the use of lethal force.

Stand your ground

What does the visualization, published by Reuters on February 16, 2014, say about the number of firearm murders in Florida over the years?

03:00

Data privacy

“Your” data

  • Every time we use apps, websites, and devices, our data is being collected and used or sold to others.

  • More importantly, decisions are made by law enforcement, financial institutions, and governments based on data that directly affect the lives of people.

Privacy of your data

What pieces of data have you left on the internet today? Think through everything you’ve logged into, clicked on, checked in, either actively or automatically, that might be tracking you. Do you know where that data is stored? Who it can be accessed by? Whether it’s shared with others?

Sharing your data

What are you OK with sharing?

  • Name
  • Age
  • Email
  • Phone Number
  • List of every video you watch
  • List of every video you comment on
  • How you type: speed, accuracy
  • How long you spend on different content
  • List of all your private messages (date, time, person sent to)
  • Info about your photos (how it was taken, where it was taken (GPS), when it was taken)

What does Google think/know about you?

Have you ever thought about why you’re seeing an ad on Google? Google it! Try to figure out if you have ad personalization on and how your ads are personalized.

03:00

Your browing history

Which of the following are you OK with your browsing history to be used towards?

  • For serving you targeted ads
  • To score you as a candidate for a job
  • To predict your race/ethnicity for voting purposes

Who else gets to use your data?

Suppose you create a profile on a social media site and share your personal information on your profile. Who else gets to use that data?

  • Companies the social media company has a connection to?
  • Companies the social media company sells your data to?
  • Researchers?

OK Cupid data breach

  • In 2016, researchers published data of 70,000 OkCupid users—including usernames, political leanings, drug usage, and intimate sexual details
  • Researchers didn’t release the real names and pictures of OKCupid users, but their identities could easily be uncovered from the details provided, e.g. usernames

Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it in a more useful form.

Researchers Emil Kirkegaard and Julius Daugbjerg Bjerrekær

Algorithmic bias

Garbage in, garbage out

  • In statistical modeling and inference we talk about “garbage in, garbage out” – if you don’t have good (random, representative) data, results of your analysis will not be reliable or generalizable.
  • Corollary: Bias in, bias out.

Google translate

What might be the reason for Google’s gendered translation? How do ethics play into this situation?

Stochastic Parrots

What is meant by “stochastic parrots” in the paper title?

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 (Bender et. al., 2021)

05:00

Machine Bias

2016 ProPublica article on algorithm used for rating a defendant’s risk of future crime:

In forecasting who would re-offend, the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.

  • The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.

  • White defendants were mislabeled as low risk more often than black defendants.

Risk score errors

What is common among the defendants who were assigned a high/low risk score for reoffending?

Risk scores

How can an algorithm that doesn’t use race as input data be racist?