Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Chain-of-Thought (CoT) prompting improves Language Models’ (LM) performance on complex reasoning tasks.
  • Faithful CoT is a framework that breaks down a reasoning task into two stages: Translation and Problem Solving.
  • Faithful CoT outperforms traditional CoT prompting on 9 out of 10 datasets.
  • Faithful CoT achieves new state-of-the-art few-shot performance on 7 out of 10 datasets.

Paper Content

Introduction

  • Complex reasoning tasks are difficult for language models
  • Chain-of-Thought reasoning has improved performance
  • Chain-of-Thought reasoning provides an interpretable window into the model’s behavior
  • Chain-of-Thought reasoning lacks faithfulness, meaning explanations are not always accurate
  • Faithfulness in interpretability means that an explanation should accurately represent the reasoning process behind the model’s prediction
  • Plausibility refers to how convincing an explanation is to humans
  • Chain-of-Thought-style prompting involves generating a reasoning chain and final answer given a complex question
  • Three types of CoT-style prompting: all-at-once, ensemble-based, and modularized
  • Faithfulness in NLG refers to the generated text being faithful to an explicit source
  • Work is concurrent with two other papers, but has differences in generalizability, recasting of tasks, and interleaving of NL and SL

Method

  • Faithful CoT is a 2-stage pipeline
  • Prompt consists of (Q, C, A) triples
  • Interleaves NL and SL in C
  • Derives final answer A from reasoning chain C
  • C NL and C SL are interleaved in generation

Math word problems (mwp)

  • A grade-school math question is broken down into multiple smaller-scale subquestions.
  • Each subquestion is accompanied with rationale(s) to support the answer.
  • Python code is generated to answer each subquestion and the code is executed to derive the answer.

Multi-hop qa

  • Given a complex question, the goal is to obtain an answer as a Boolean value or string value variable.
  • The reasoning chain is slightly different depending on the nature of the task.
  • The reasoning chain involves Boolean algebra, string comparisons, relation definitions, and logic programming.
  • The answers are converted to Datalog statements and then combined to formalize the truth condition of the final answer.

Planning

  • Given a household task query from a user, a plan of actions is created for the robot to accomplish the task.
  • The query is translated into a symbolic goal in PDDL5, which is then used by a PDDL Planner to obtain a plan of actions.

Logical inference

  • Given a logical inference problem Q written in NL, the goal is to obtain A as a string-valued variable
  • CLUTRR dataset involves inferring family relationship between two people from a short story
  • Translation stage prompts the LM to generate C, consisting of C NL and C SL
  • C NL breaks down Q into subquestions and provides input extracts as rationales
  • C SL answers subquestions via a logical expression representing the relation between Q
  • Problem Solving stage uses a simple logical inference engine to derive A
  • Evaluation datasets used for each domain are GSM8K, SVAMP, MultiArith, ASDiv, and AQuA
  • 8-shot prompt used for all datasets except AQuA
  • Multi-hop QA datasets are Strate-gyQA and SayCan

Evaluation metrics

  • Performance of model is evaluated by accuracy of final answer
  • For MWP datasets (except AQuA), correct answer is defined as exact match between prediction and ground truth rounded up to nearest integer
  • For StrategyQA and Sports Understanding, correct answer is exact match between prediction and ground truth evaluated as Boolean variable
  • For SayCan, generated plan is correct if it is among ground truth plans
  • For other datasets, correct answer is exact match between prediction and ground truth strings

Language model

  • OpenAI Codex (Chen et al., 2021) is used as the underlying language model for translation
  • OpenAI Codex has 175B parameters
  • Implementation details can be found in Appendix A

Baselines

  • Standard few-shot prompting uses demonstrations of only the question and the answer
  • CoT prompting additionally provides a reasoning chain in NL
  • 8 prompting methods are compared under two decoding strategies: greedy decoding and self-consistency decoding
  • Temperature of 0.4 and 40 generations used for all datasets

Results

  • Faithful CoT outperforms CoT on 9/10 datasets
  • Average accuracy gain is larger for Planning and Logical Inference
  • New few-shot SOTA results on 7 datasets
  • Primary cause of errors is sparsity of Datalog in pretraining data

Analysis

  • Analyzed role of different components in pipeline
  • Greedy decoding used for analysis
  • Datasets used: GSM8K, Date Understanding, SayCan, and CLUTRR

Ablation study

  • Faithful CoT has strong performance
  • Ablation study to see how much each part of the prompt contributes to accuracy
  • NL comments contribute little to performance on GSM8K, Date Understanding, and SayCan
  • NL comments are crucial on CLUTRR
  • Nudge line brings striking improvement on CLUTRR
  • External solver relieves burden from LM, but can hurt performance on SayCan

Robustness to exemplars

  • Choice of exemplars does not significantly affect performance
  • Performance is still above baseline on all datasets except SayCan

Error analysis

  • 49% of errors are wrong subquestions
  • 24% of errors are wrong code
  • 12% of errors are semantic understanding errors
  • 7% of errors are generation cutoff errors
  • 5% of errors are wrong gold label errors
  • 3% of errors are missing subquestions

Conclusion

  • Propose Faithful CoT, a framework that decomposes complex reasoning into Translation and Problem Solving
  • Translation stage produces a reasoning chain in the form of interleaved natural and symbolic language
  • Problem-Solving stage calls an external solver to execute the reasoning chain and derive the final answer
  • Guarantees faithfulness of explanation
  • Demonstrate efficacy on 4 types of complex reasoning problems: Math Word Problems, Multi-hop QA, Planning, and Logical Inference
  • Sets new SOTA performance on 7 of 10 datasets
  • Synergy between faithfulness and performance
  • Analysis of strengths and weaknesses
  • Limitation: Translation stage is still opaque
  • Human evaluation of correctness of reasoning chains
  • NL comments in reasoning chain can serve as an interface for users
  • Inference cost per example is $0.01-$0.03
  • Hyper-parameters: temperature, max_tokens, n, frequency_penalty, presence_penalty
  • Unfaithful output from CoT method on 3 datasets
  • Graph validity, no over-dependency, no under-dependency constraints
  • Dataset details: statistics, number of few-shot exemplars, example inputs and outputs