Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Generating a chain of thought can improve LLM performance. Zero-shot CoT evaluations have been done mainly on logical tasks. This paper evaluates zero-shot CoT on two sensitive domains. Using zero-shot CoT can increase the likelihood of undesirable output. Zero-shot CoT should be avoided on tasks with marginalized groups or harmful topics. Paper Content Introduction LLMs improve performance on a range of tasks Popular approach to implementing CoT involves zero-shot generation Zero-shot CoT produces undesirable biases and toxicity Models can sabotage performance when requiring social knowledge Zero-shot CoT increases model bias and generation toxicity Zero-shot CoT increases stereotypical reasoning and encourages toxic behaviour Related work LLMs can use intermediate reasoning steps to improve performance on tasks like arithmetic, metaphor generation, and commonsense/symbolic reasoning Adding “Let’s think step by step” to a prompt can improve zero-shot performance on reasoning benchmarks Other prompting methods have also yielded performance increases LLMs are sensitive to prompting perturbations LLMs are prone to generating unreliable explanations Instruct-tuned and value-aligned LLMs aim to increase reliability and robustness NLP models exhibit a wide range of social and cultural biases LLMs also exhibit a range of biases and risks Stereotype & toxicity benchmarks Leveraged 3 widely used stereotype benchmark datasets: CrowS Pairs, Stereoset, and BBQ Bootstrapped a small set of explicitly harmful questions (HarmfulQ) Converted each dataset into a zero-shot reasoning task Evaluated out-of-the-box performance in a zero-shot setting Stereotype benchmarks CrowS Pairs is a dataset of 1508 sentences covering 9 stereotype dimensions StereoSet is a dataset of 17K instances of stereotypical bias annotated by crowd workers BBQ is a dataset of 50K questions targeting 11 stereotype categories All datasets are used to evaluate model bias Toxicity benchmark Evaluate how models handle open-ended toxic requests Created a benchmark of 200 explicitly toxic questions Prompted text-davinci-002 to generate harmful questions Manually removed repetitive questions with high text overlap Prompted LLM to generate questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful Seeded prompt with three few-shot examples Methods Evaluating problematic outputs in a prompt-based setting Outlining prompt construction for each benchmark Discussing reasoning strategies Framing benchmarks as prompting tasks BBQ, HarmfulQ, CrowS Pairs, and Stereoset are framed as QA tasks For CrowS Pairs and Stereoset, models are prompted to select the more accurate sentence between the stereotypical and anti-stereotypical setting For stereotype datasets, target stereotype and anti-stereotype examples are included as options, with an “Unknown” option as the correct answer Synonyms for “Unknown” are randomly selected for each question to account for potential preference for a specific lexical item Positional bias is reduced by randomly shuffling the type of answer associated with each of the options Scoring bias and toxicity Evaluate biases in model completions using accuracy Models should not rely on stereotypes or antistereotypes Evaluate models by percent of pattern-matched unknown selections Manually label model outputs as encouraging or discouraging Calculate percent of model generations that encourage harmful behaviour Compute % point differences between CoT and Standard Prompting Models Evaluated best performing GPT-3 model from zero-shot CoT work Standard parameters provided by OpenAI’s API Generated 5 completions for both Standard and CoT Prompt settings Evaluations ran between Oct 28th and Dec 14th, 2022 Analyzed instruction-tuned davinci models in §5....