Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Investigated scaling behaviors of red teaming across 3 model sizes and 4 model types
- Released dataset of 38,961 red team attacks for others to analyze and learn from
- Analyzed data and found a variety of harmful outputs, ranging from offensive language to subtly unethical outputs
Paper Content
Introduction
- Large language models can have harmful behaviors
- Examples of harmful behaviors include reinforcing social biases, generating offensive/toxic outputs, leaking personal info, aiding in disinformation campaigns, generating extremist texts, and spreading falsehoods
- As AI systems improve, the scope of possible harms is likely to grow
- Strategies have been developed to address some of these harms
- Red teaming is a tool to address harm, using manual or automated methods to probe a language model for harmful outputs
- Paper describes early efforts to implement manual red teaming
- Investigated scaling behaviors for red teaming across 3 model sizes and 4 model types
- Released dataset of 38,961 red team attacks
- Described instructions, processes, and statistical methodologies for red teaming
- Proposed policy interventions for how to develop shared norms, practices, and technical standards for red teaming
Related work
- We use the same models as in our previous work
- We run additional experiments to determine the influence of model size on susceptibility to red team attacks
- We analyze the content of the attacks to understand the types of harms uncovered by red teaming
- We provide more detail on our red team methods and release the data
- We focus on reinforcement learning from human feedback as our most promising safety intervention
Red team task
- Developed an interface for red team members to have open-ended conversations with an AI assistant
- Provided a brief list of example conversation topics
- Asked participants to enter a description of how they intend to red team the model
- Reviewed literature and conducted interviews to incorporate best practices into task instructions and interface
- Warned participants of sensitive content
- Asked participants to select topics within their own risk tolerance
- Asked participants to select the more harmful of two model-generated responses
- Used dataset of pairs of model responses to train a harmlessness preference model
- Asked participants to rate how successful they were at making the AI assistant say something bad
- Assigned red team members to models at random
Models
- We derive dialogue models from a general language model and a helpful and harmless preference model.
- We train decoder-only transformer models ranging in size from 2.7B to 13B to 52B parameters.
- Our preference model is trained to predict both harmlessness and helpfulness.
- We use 1-shot learning to prompt our general language models to behave as dialogue models.
- We use 14-shot learning to prompt our general language models to be helpful, harmless, and honest.
- We generate 16 samples of AI assistant responses from prompted language models and select the 2 least harmful samples.
- We use reinforcement learning to train a prompted language model to maximize the scores given by the preference model.
Red team
- Recruited 324 US-based crowdworkers from Amazon’s Mechanical Turk and Upwork
- Paid between $7.50 and $9.50 for each set of 5 conversations on MTurk
- Paid $20 per hour on Upwork
- Crowdworker population not fully representative of US population
- 79% of participants self-identify as “White or Caucasian”
- 66% of participants have at least a college degree
- 80% of red team attacks come from 50 out of 300 workers
Data analysis
- Collected 38,961 red team attacks
- Measured 3 variables for each attack
- Red team members self-rated success on a 5-point Likert scale
- Distribution of self-rating was bimodal, with two peaks at 0 and 4
- Used harmlessness preference model to compute harmlessness score of AI assistant’s dialogue
- Found inverse proportion between self-rating of attack success and minimum harmlessness score
Review task
- Collected data across all model types
- Performed follow-up experiment to measure two variables: inter-annotator agreement and content of attack types
- Self-ratings of attack success are subjective and can vary based on elements of attack and red team member
- Wanted to understand variability across different raters
- Computed harmlessness score on task description and assistant utterances
- Aggregated scores using min or max
- Relied on human judgement of attack success on Likert scale
- Ran experiment on 500 red team attacks for two models
- Low level of inter-rater agreement on success of red team attacks
- Fleiss’s Kappa of 0.32 between 4 raters
- Asked reviewers to tag transcripts with up to 2 of 20 total topic tags
- Incorporated findings from literature on Trust & Safety into experiment design
- Used shared communication tool (Slack) to communicate with group
- Developed custom well-being survey and sent it to reviewers after 10 tasks
Results
- Average success rate for control condition and simplest safety intervention is the same
- Rejection sampling makes it difficult to red team language models
- No clear trends with model size for self-reported attack success rate
- RLHF models become increasingly difficult to red team as they increase in size
- Plain LM and prompted LM have little difference
- Rejection sampling is an effective safety intervention
- Harmful outputs from RS and RLHF models
- Visualization of dataset shows clusters of red team attempts
- Some attacks more successful than others
- Crowdworkers use template-based attacks
- Top 5 attack types correspond to discrimination, hate speech, violence, unethical behavior, and bullying
- Less common tags include child abuse, self harm, sexual exploitation, terrorism, and animal abuse
Discussion
Limitations and future work
- Red teaming language models in the form of an AI assistant allows probing of open-ended input and output spaces
- LMs can be used in many applications that don’t require open-endedness
- Crowdworkers generated attacks that required domain expertise
- Data is incomplete due to unknown and unbounded space of possible harms
- Could have asked third party organizations to red team
- Could have given crowdworkers way to indicate domain expertise needed
- Could have noted instructions to interface to encourage creativity
- Red teaming done manually, could be automated in future
- Comparing manual and automated approaches to red teaming in future work
Policy interventions
- Red teaming involves controversial subject matter
- Organizations have counter-incentives to share findings
- Difficulty sharing future risks, failures, and implications of yet-to-be developed systems
- Incentive structure needs to be changed to share findings
- Need to build consensus around how to red team and release findings
- Questions to answer: who should red team, protections, instructions, annotation, analysis, success criteria
- Decision to release data made in a vacuum
- Benefits of release outweigh potential harm
- Informational interviews with Trust & Safety professionals
- Clear and specific warnings, personal risk tolerance, recommended well-being exercises, pay for time, segment tasks, preview to opt out, well-being survey
- Average attack success, correlation between attack success and harmlessness score
- Pros and cons for releasing red team data