Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language models trained with reinforcement learning from human feedback have the capability to “morally self-correct”
Experiments provide evidence of moral self-correction
Capability emerges at 22B model parameters and improves with increasing model size and RLHF training
Language models can follow instructions and learn complex normative concepts of harm like stereotyping, bias, and discrimination
Results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles

Large language models can exhibit harmful social biases
Scaling model size can increase model performance
Hypothesis: larger models may have the capability to morally self-correct
Experiments measure propensity for large language models to use negative stereotypes or discriminate based on protected demographic attributes
Capacity for moral self-correction emerges at 22B model parameters
Models can be steered to avoid harmful outputs by instructing them to do so
Bias Benchmark for QA and Winogender benchmark used to measure stereotype bias and occupational gender bias respectively
New benchmark developed to test for racial discrimination
Three simple prompt based interventions used
Results show bias can be reduced with increasing model size
Models can be steered to use gender pronouns that are uncorrelated or correlated with real world statistics
Models can discriminate against or in favor of Black students depending on instruction
Capacity for moral self-correction exists in models with more than 22B parameters and sufficient RLHF training

GPT-2 and T5 language models can self-diagnose stereotype bias and toxicity
Self-diagnosis accuracy increases with model size
An algorithm for self-debiasing has been proposed
Natural language can be used to reduce bias
RoBERTa-large does not produce less biased outputs when instructed to do so with natural language interventions
Larger models trained with RLHF can produce less biased outputs
Prompting GPT-3 can decrease bias on the BBQ benchmark
Complex reasoning tasks emerge with model size

RLHF is a popular technique for reducing harmful behaviors in large language models
Amount of RLHF training can significantly change metrics on a wide range of personality, political preference, and harm evaluations

Tests the effect of natural language instructions on two moral phenomena: stereotyping and discrimination
Uses two well-known stereotyping benchmarks to measure stereotyping
Constructs a new benchmark to measure discrimination based on race in a law school course admission question

BBQ is a set of 58,492 questions designed to test for societal biases against people belonging to protected classes.
Each problem is a multiple choice question with three possible answers.
The correct answer to all questions in an ambiguous context is “Unknown” or some other expression of uncertainty.
Questions also come paired with an additional disambiguated context condition.
BBQ measures accuracy and bias score across both ambiguous and disambiguated contexts for each category.
Experimental conditions include Q, Q+IF, and Q+IF+CoT.

Pearson correlation coefficient, ρ, between the probabilities that the model assigns a female gendered pronoun and the occupational gender statistics from the BLS varies with model size
In the Q condition, ρ ≈ 0.6 at all model sizes
In the Q+IF condition, ρ decreases relative to the Q condition, but only for model sizes ≥ 22B
In the Q+IF+CoT condition, ρ approaches 0 at 175B parameters
Model assigns most of the mass to neutral pronouns and is close to distributing mass equally between male and female pronouns when it does not use a gendered pronoun
In the Q+IF+Match Stats condition, ρ = 1
In the Q+Match stats condition, ρ approaches near 1 at 175B parameters
Increasing RLHF steps has no clear effect on ρ for any intervention
Metric to evaluate for discrimination is difference in the probability that the language model suggests that the law professor admits a student into the class conditioned on race
Expect metric to be 0 for models that do not discriminate
Bias score increases with increasing model size, but Q+IF and Q+IF+CoT interventions significantly reduce the bias
Bias score is strongest in categories such as Age, Disability Status, Nationality, Physical Appearance, Religion, and Socioeconomic status
Accuracy scores in the disambiguated context are consistently high across all experimental conditions

Demographic parity varies with number of model parameters
For models with fewer than 52B parameters, no discrimination between Black and white students
At 52B parameters, Q condition is 15% less likely to admit Black students, Q+IF is 5% more likely
Q+IF+CoT condition has less clear trend with model size, but tends to discriminate in favor of Black students
Increasing RLHF steps has significant effect on demographic parity

Hypothesis: large language models may have the capability to “morally self-correct”
Experiments reveal different facets of moral self-correction
BBQ experiment: instructing models to not be biased reduces bias
Winogender experiment: can steer models to accurately reflect occupational gender statistics or avoid using gendered pronouns
Discrimination experiment: models can achieve demographic parity or discriminate in favor of a historically disadvantaged group
Capability for moral self-correction emerges at 22B parameters
Models rely on two capabilities for moral self-correction: following instructions and learning normative concepts of harm
Focus on American English
Dual-use: techniques can be inverted to create unethical outputs
Prompt engineering: small variations in prompts can yield large changes in model outputs
Increasing accuracy reduces bias
RLHF training has greatest effect for models larger than 22B parameters
Can achieve demographic parity by tuning model size and amount of RLHF steps
Reliable instruction-following and normative concepts of harm present in training data across all languages and cultures