jedediyah.com/nctm2022



Breaking Bias in Data and Modeling

Jedediyah Williams, PhD
Nantucket High School


September, 2022

@jedediyah
jedediyah.com/nctm2022




About Jed

Teaching
Astronomy
Robots
Teaching

   

DATA: There is too much to talk about!

  • Power
  • Surveillance
  • Privacy
  • Security
  • Consent
  • Access
  • Fairness
  • Education
  • Energy
  • Military
  • Misuse
  • Adversarial Attacks
  • Disinformation
  • Liberty
  • Discrimination
  • Labor
  • Environment
  • Exploitation
  • Law and Oversight
  • Accountability
  • Justice

  • Data Ethics
  • AI Ethics
  • Fair ML
  • Fair AI
  • Algorithmic Bias
  • Data Bias

axioms

  1. Math is awesome.
  2. Causing unnecessary harm is bad.



Data Modeling Process


Data
Preprocess
Explore
Model
Communicate



Data Modeling Process

Data
Preprocess
Explore
Model
Communicate


  • Modeling with data
  • Teaching and scafolding modeling with data
  • Critically analyzing data technologies


"Yes, train these young people to get these skills, but integrate into that not only the technical capacity but the critical capacity to question what they're doing and what's happening. To me, it is not true empowerment unless people can have the power to question how these skills are going to be used."

Data Modeling Process

Data
Preprocess
Explore
Model
Communicate


When we approach modeling as a series of design choices, we highlight the assumptions and subjectivity of value judgements made at each stage and begin to expose the inherent biases embedded within our models.

1 minute, talk to your neighbors:


How can
math
cause harm?



When we say that we are teachers of "mathematics", which "mathematics" are we talking about?

Authors Reading Watching
Cathy O'Neil
Weapons of Math Destruction (2016)
Viginia Eubanks
Automating Inequality (2018)
Automating Inequality
PBS 2018
Safiya Umoja Noble
Algorithms of Oppression (2018)
Meredith Broussard
(2019)

Janelle Shane
(2019)
The danger of AI is weirder than you think
TED 2019
Hannah Fry
( )
Should Computers Run the World?
Royal Institution 2019
Caroline Criado Perez
( )
Invisible Women
Engage 2019
Ruha Benjamin
( )
Ruha's resources for Race After Tech
Melanie Mitchell
( )
The Collapse of Artificial Intelligence
Santa Fe Institute 2019
Sasha Costanza-Chock
Design Justice (2020)

Kate Crawford
(2021)

Catherine D'Ignazio &
Lauren F. Klein
( )

Wait!
Isn't math objective and neutral?

Let's ignore philosophy, paradoxes, incompleteness, decidability, the unfinished state of mathematics; there is still complexity.

"However, two major discoveries of the twentieth century showed that Laplace's dream of complete prediction is not possibe, even in principle...

It was the understanding of chaos that eventually laid to rest the hope of perfect prediction of all complex systems, quantum or otherwise." (Mitchell, 2019, p. 20)
"But even if it were the case that the natural laws had no longer any secret for us, we could still only know the initial situation approximately.
       ...
it may happen that small differences in the initial conditions produce very great ones in the final phenomenon.
       ...
Prediction becomes impossible."
(Poincaré, 1908, as cited in Mitchell, 2019, p. 21)












https://twitter.com/standupmaths/status/741251532167974912









"The lack of humility before nature that's being displayed here staggers me." - Malcolm, Jurassic Park

Data modeling applications

  • Search engine
  • Recommendation systems
  • Ranking systems
  • Application / resume filtering
  • Computer vision
  • Chat bots
  • Policing
  • Sentencing and parole
  • "Self-driving" vehicles
  • ...
"Our success, happiness, and wellbeing are never fully of our own making. Others' decisions can profoundly affect the course of our lives...

Arbitrary, inconsistent, or faulty decision-making thus raises serious concerns..."

- Fairness and Machine Learning, Barocas, Hardt, and Narayanan

What are some consequences of data technologies?

Some of the more well known harms













https://www.nytimes.com/2019/08/16/technology/ai-humans.html
https://www.attendeeinteractive.com/privacy-policy/
Anatomy of an AI system, Crawford and Joler
Adversarial attack
Algorithms are brittle - Melanie Mitchell
Lack of oversight or auditing
The act, by those in power, of making decisions for us is a display of the imbalance of power.
- Sun-ha Hong, Prediction as Extraction of Discretion
You are being surveilled.
You are being experimented on.

Big Picture

When handing over the tools of mathematics,
we are responsible as educators
for teaching their responsible use.

It is a sin of omission when we fail to acknowledge the consequences of the content we teach; Consequences which include ethical and technical pitfalls.

Subtle picture

  • There is no simple solution. There is no checklist that if you've done these things then you won't cause harm.
  • Many ethical concerns are technical concerns.
  • Predicting, detecting, and mitigating harm and discrimination in data technologies are complex and active areas of research.


Fayyad et al (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data
(Knowledge Discovery in Databases)

Chapman et al (1999), Wirth (2000). "Towards a standard process model for data mining".
1. Obtain: pointing and clicking does not scale
2. Scrub: the world is a messy place
3. Explore: You can see a lot by looking
4. Models: always bad, sometimes ugly
5. INterpret: "The purpose of computing is insight, not numbers."

Mason and Wiggins (2010). "A Taxonomy of Data Science".

Schutt and O'Neil (2014). "Doing Data Science: Straight talk from the frontline".

GAIMME Guidlines for assessment & instruction in mathematical modeling education (2016).

Guidelines for Assessment and Instruction in Statistics Education (GAISE) Reports
(2020, based on 2007).

Estrellado et al (2020). Data Science in Education Using R, Section 3.2.

Zico Kolter (2021). Practical Data Science, Intrdouction

Common Core / Achieve the Core.
I like the video here!

Many frameworks. Much overlap.

Data
1. Get the data
Preprocess
2. Clean up the data
Explore
3. Explore the data
Model
4. Model it
Communicate
5. Share the results


Data
1. Get the data
Preprocess
2. Clean up the data
Explore
3. Explore the data
Model
4. Model it
Communicate
5. Share the results


Data Modeling Process


Data
Preprocess
Explore
Model
Communicate



Data Modeling Process

Data
Preprocess
Explore
Model
Communicate

Design
∘ Turn a problem into a data-problem.
∘ Survey or experimental design
∘ Database infrastructure
Acquire
∘ Survey or experiment
∘ Download the dataset! CSV, API, etc.
∘ Web scraping

Data Modeling Process

Data
Preprocess
Explore
Model
Communicate

Wrangle
∘ Format
∘ Clean and organize
∘ Check data integrity
Prepare
∘ Label
∘ Split into training and testing sets
∘ Normalize

Data Splitting

Data Modeling Process

Data
Preprocess
Explore
Model
Communicate

Visualize
∘ Plot and familiarize with data
∘ Look for and compare features visually
∘ Consider appropriate models
Inspect
∘ Exploratory data analysis
∘ Descriptive statistics
∘ Identify features analytically

Data Modeling Process

Data
Preprocess
Explore
Model
Communicate

Model
∘ Try and compare multiple models
∘ Consider bias and variance
∘ Interpret model and performance
Validate
∘ Assess model performance on independent test data
∘ Error analysis and stress-test
∘ Consider consequences

Data Modeling Process

Data
Preprocess
Explore
Model
Communicate

Reflect
∘ Consider contexts, bias, and consequence
∘ Create audit plant
∘ Document - data and model
Share
∘ Report documentation
∘ Inform policy
∘ Deploy in product

Data Modeling Process


Data
Preprocess
Explore
Model
Communicate



Data Modeling Process


Environment

Data
Preprocess
Explore
Model
Communicate



A framework for critical analysis

Data
• Harmful data collection, lack of consent, insecure / lack of privacy, historical, representational, or measurement bias, ...

Preprocess
• Labor exploitation, labeling by non-experts, incorrect labeling, trauma experienced by labelers, ...

Explore
• Feature selection bias, bias in interpretation of data visualization, data manipulation, feature hacking, ...

Model
• Bias in model choice, model-amplified bias, environmental impact, learning bias, evaluation bias, peripheral modeling, ...

Communicate
• Biased model interpretation, ignoring variance, rejecting model, deploying harmful products, deployment bias, ...

Meta
• "Pernicious feedback loops", runaway homogeneity, susceptability to adversarial attack, lack of oversight or auditing, ...

"A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle", Harini Suresh and John V. Guttag
https://lighthouse3.com/newsletter/
Critical Questions:
  • What are the motivations for the project?
  • What is the intended use?
  • What is the unintended use or misuse?
  • Where does the data come from?
  • Who collects the data?
  • Who owns the data?
  • How is the data collected?
  • How is the data stored?
  • How old is the data?
  • When will the data expire?
  • How will the data be secured?
  • What happens with the data when the company is sold?
  • Who does the labeling?
  • What labels will they decide to use?
  • Are the labelers experts?
  • Are the labels accurate?
  • What biases are represented in the data?
  • How is data included or excluded?
  • How are outliers addressed?
  • What subpopulations are represented?
  • What subpopulations are over- or underrepresented?
  • What portions of the data are inspected?
  • What features are selected for modeling?
  • What model is chosen?
  • What features do we think are being modeled?
  • What latent features are actually being modeled?
  • What is the domain of the model?
  • What are the consequences of error?
  • What decisions will be made with the model?
  • What biases are perpetuated?
  • Where will the model be deployed?
  • What could go wrong?
  • Who is responsible when things go wrong?
  • How can issues be reported?
  • Will new data be fed back in to update the model?

Have you ever read a book in a math class?

Data Modeling Process

Data
Preprocess
Explore
Model
Communicate



  • Modeling with data
  • Teaching and scafolding modeling with data
  • Critically analyzing data technologies


How high does a bouncy ball bounce?


Data

Preprocess

Explore

Model

Communicate



How high does a bouncy ball bounce?

Data
• Data problem: What will be the bounce height \(h_{bounce}\) of my bouncy ball when dropped from rest from a given drop height \(h_{drop}\)?
• Record several slow-motion videos.

Preprocess
• Randomly choose a subset of videos as the training set.
• Parse the training set videos into a table.

Explore
• Create a scatter plot of \(h_{bounce}(h_{drop})\)
• Look for features! Notice and wonder. Consider models.

Model
• Find a best-fit model on the training data.
• Validate the model on the testing data.

Communicate
• Reflect on the process.
• Share out.

How high does a bouncy ball bounce?

Bounce Prediction Error

      

Bounce Prediction Error

      

Bounce Prediction Error

      

Bounce Prediction Error

      

Bounce Prediction Error

      

Bounce Prediction Error

      

Bounce Prediction Error

      

Training Data Testing Data
https://reproducible.cs.princeton.edu/#rep-failures
Break models
"How high does a bouncy ball bounce?"
"How high does a bouncy ball bounce?"

becomes:

"How much can we minimize the error of a linear model when predicting how high this particular bouncy ball will bounce in this room on this surface at this temperature and humidity when dropped from rest at a height of no more than two meters?"

Data Modeling Projects

  • How high does a bouncy ball bounce?
  • How far will the ball roll?
  • What is the period of a pendulum?
  • When will the water reach 40℃?
  • When is high tide?
  • How much daylight will there be on Jan 1?
  • When will sun set on Feb 1?
  • What is the best move in Hexapawn?
  • What is the best move in Tic Tac Toe?
  • Which NFL team will win Monday?

I have a project for you!
But first, in Summary

  • Model data. It's awesome.
  • Break models. Witness them failing.
  • Critically analyze technology.
Classify these fruit!
Using data pipelines as critical frameworks:


Teaching with ethics at the forefront:
AI Now Institute reports: https://ainowinstitute.org/reports.html
Automating Ambiguity: Challenges and Pitfalls of Artificial Intelligence - Abeba Birhane
On the dangers of stochastic parrots: Can language models be too big? - Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell
Rachael Tatman - YouTube
Rachel Thomas Fast.ai Data Ethics Course
Joy Buolamwini https://www.media.mit.edu/people/joyab/publications/
SERJ special issue: https://iase-web.org/ojs/SERJ/issue/view/28
AIES '22: Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society https://dl.acm.org/doi/proceedings/10.1145/3514094
Education:
Teaching Machine Learning in the Context of Critical Quantitative Information Literacy
Integrating data science ethics into an undergraduate major: A case study
A call for a humanistic stance toward k-12 data science education
Artificial intelligence in education: Addressing ethical challenges in K-12 settings
Provisional Data Science for Social Change Spring 2022 schedule