Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Chain-of-Thought (CoT) prompting improves Language Models’ (LM) performance on complex reasoning tasks.
- Faithful CoT is a framework that breaks down a reasoning task into two stages: Translation and Problem Solving.
- Faithful CoT outperforms traditional CoT prompting on 9 out of 10 datasets.
- Faithful CoT achieves new state-of-the-art few-shot performance on 7 out of 10 datasets.
Paper Content
Introduction
- Complex reasoning tasks are difficult for language models
- Chain-of-Thought reasoning has improved performance
- Chain-of-Thought reasoning provides an interpretable window into the model’s behavior
- Chain-of-Thought reasoning lacks faithfulness, meaning explanations are not always accurate
Related work
- Faithfulness in interpretability means that an explanation should accurately represent the reasoning process behind the model’s prediction
- Plausibility refers to how convincing an explanation is to humans
- Chain-of-Thought-style prompting involves generating a reasoning chain and final answer given a complex question
- Three types of CoT-style prompting: all-at-once, ensemble-based, and modularized
- Faithfulness in NLG refers to the generated text being faithful to an explicit source
- Work is concurrent with two other papers, but has differences in generalizability, recasting of tasks, and interleaving of NL and SL
Method
- Faithful CoT is a 2-stage pipeline
- Prompt consists of (Q, C, A) triples
- Interleaves NL and SL in C
- Derives final answer A from reasoning chain C
- C NL and C SL are interleaved in generation
Math word problems (mwp)
- A grade-school math question is broken down into multiple smaller-scale subquestions.
- Each subquestion is accompanied with rationale(s) to support the answer.
- Python code is generated to answer each subquestion and the code is executed to derive the answer.
Multi-hop qa
- Given a complex question, the goal is to obtain an answer as a Boolean value or string value variable.
- The reasoning chain is slightly different depending on the nature of the task.
- The reasoning chain involves Boolean algebra, string comparisons, relation definitions, and logic programming.
- The answers are converted to Datalog statements and then combined to formalize the truth condition of the final answer.
Planning
- Given a household task query from a user, a plan of actions is created for the robot to accomplish the task.
- The query is translated into a symbolic goal in PDDL5, which is then used by a PDDL Planner to obtain a plan of actions.
Logical inference
- Given a logical inference problem Q written in NL, the goal is to obtain A as a string-valued variable
- CLUTRR dataset involves inferring family relationship between two people from a short story
- Translation stage prompts the LM to generate C, consisting of C NL and C SL
- C NL breaks down Q into subquestions and provides input extracts as rationales
- C SL answers subquestions via a logical expression representing the relation between Q
- Problem Solving stage uses a simple logical inference engine to derive A
- Evaluation datasets used for each domain are GSM8K, SVAMP, MultiArith, ASDiv, and AQuA
- 8-shot prompt used for all datasets except AQuA
- Multi-hop QA datasets are Strate-gyQA and SayCan
Evaluation metrics
- Performance of model is evaluated by accuracy of final answer
- For MWP datasets (except AQuA), correct answer is defined as exact match between prediction and ground truth rounded up to nearest integer
- For StrategyQA and Sports Understanding, correct answer is exact match between prediction and ground truth evaluated as Boolean variable
- For SayCan, generated plan is correct if it is among ground truth plans
- For other datasets, correct answer is exact match between prediction and ground truth strings
Language model
- OpenAI Codex (Chen et al., 2021) is used as the underlying language model for translation
- OpenAI Codex has 175B parameters
- Implementation details can be found in Appendix A
Baselines
- Standard few-shot prompting uses demonstrations of only the question and the answer
- CoT prompting additionally provides a reasoning chain in NL
- 8 prompting methods are compared under two decoding strategies: greedy decoding and self-consistency decoding
- Temperature of 0.4 and 40 generations used for all datasets
Results
- Faithful CoT outperforms CoT on 9/10 datasets
- Average accuracy gain is larger for Planning and Logical Inference
- New few-shot SOTA results on 7 datasets
- Primary cause of errors is sparsity of Datalog in pretraining data
Analysis
- Analyzed role of different components in pipeline
- Greedy decoding used for analysis
- Datasets used: GSM8K, Date Understanding, SayCan, and CLUTRR
Ablation study
- Faithful CoT has strong performance
- Ablation study to see how much each part of the prompt contributes to accuracy
- NL comments contribute little to performance on GSM8K, Date Understanding, and SayCan
- NL comments are crucial on CLUTRR
- Nudge line brings striking improvement on CLUTRR
- External solver relieves burden from LM, but can hurt performance on SayCan
Robustness to exemplars
- Choice of exemplars does not significantly affect performance
- Performance is still above baseline on all datasets except SayCan
Error analysis
- 49% of errors are wrong subquestions
- 24% of errors are wrong code
- 12% of errors are semantic understanding errors
- 7% of errors are generation cutoff errors
- 5% of errors are wrong gold label errors
- 3% of errors are missing subquestions
Conclusion
- Propose Faithful CoT, a framework that decomposes complex reasoning into Translation and Problem Solving
- Translation stage produces a reasoning chain in the form of interleaved natural and symbolic language
- Problem-Solving stage calls an external solver to execute the reasoning chain and derive the final answer
- Guarantees faithfulness of explanation
- Demonstrate efficacy on 4 types of complex reasoning problems: Math Word Problems, Multi-hop QA, Planning, and Logical Inference
- Sets new SOTA performance on 7 of 10 datasets
- Synergy between faithfulness and performance
- Analysis of strengths and weaknesses
- Limitation: Translation stage is still opaque
- Human evaluation of correctness of reasoning chains
- NL comments in reasoning chain can serve as an interface for users
- Inference cost per example is $0.01-$0.03
- Hyper-parameters: temperature, max_tokens, n, frequency_penalty, presence_penalty
- Unfaithful output from CoT method on 3 datasets
- Graph validity, no over-dependency, no under-dependency constraints
- Dataset details: statistics, number of few-shot exemplars, example inputs and outputs