Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Chain-of-Thought (CoT) prompting improves Language Models’ (LM) performance on complex reasoning tasks.
Faithful CoT is a framework that breaks down a reasoning task into two stages: Translation and Problem Solving.
Faithful CoT outperforms traditional CoT prompting on 9 out of 10 datasets.
Faithful CoT achieves new state-of-the-art few-shot performance on 7 out of 10 datasets.

Paper Content

Introduction

Complex reasoning tasks are difficult for language models
Chain-of-Thought reasoning has improved performance
Chain-of-Thought reasoning provides an interpretable window into the model’s behavior
Chain-of-Thought reasoning lacks faithfulness, meaning explanations are not always accurate

Faithfulness in interpretability means that an explanation should accurately represent the reasoning process behind the model’s prediction
Plausibility refers to how convincing an explanation is to humans
Chain-of-Thought-style prompting involves generating a reasoning chain and final answer given a complex question
Three types of CoT-style prompting: all-at-once, ensemble-based, and modularized
Faithfulness in NLG refers to the generated text being faithful to an explicit source
Work is concurrent with two other papers, but has differences in generalizability, recasting of tasks, and interleaving of NL and SL

Method

Faithful CoT is a 2-stage pipeline
Prompt consists of (Q, C, A) triples
Interleaves NL and SL in C
Derives final answer A from reasoning chain C
C NL and C SL are interleaved in generation

Math word problems (mwp)

A grade-school math question is broken down into multiple smaller-scale subquestions.
Each subquestion is accompanied with rationale(s) to support the answer.
Python code is generated to answer each subquestion and the code is executed to derive the answer.

Multi-hop qa

Given a complex question, the goal is to obtain an answer as a Boolean value or string value variable.
The reasoning chain is slightly different depending on the nature of the task.
The reasoning chain involves Boolean algebra, string comparisons, relation definitions, and logic programming.
The answers are converted to Datalog statements and then combined to formalize the truth condition of the final answer.

Planning

Given a household task query from a user, a plan of actions is created for the robot to accomplish the task.
The query is translated into a symbolic goal in PDDL5, which is then used by a PDDL Planner to obtain a plan of actions.

Logical inference

Given a logical inference problem Q written in NL, the goal is to obtain A as a string-valued variable
CLUTRR dataset involves inferring family relationship between two people from a short story
Translation stage prompts the LM to generate C, consisting of C NL and C SL
C NL breaks down Q into subquestions and provides input extracts as rationales
C SL answers subquestions via a logical expression representing the relation between Q
Problem Solving stage uses a simple logical inference engine to derive A
Evaluation datasets used for each domain are GSM8K, SVAMP, MultiArith, ASDiv, and AQuA
8-shot prompt used for all datasets except AQuA
Multi-hop QA datasets are Strate-gyQA and SayCan

Evaluation metrics

Performance of model is evaluated by accuracy of final answer
For MWP datasets (except AQuA), correct answer is defined as exact match between prediction and ground truth rounded up to nearest integer
For StrategyQA and Sports Understanding, correct answer is exact match between prediction and ground truth evaluated as Boolean variable
For SayCan, generated plan is correct if it is among ground truth plans
For other datasets, correct answer is exact match between prediction and ground truth strings

Language model

OpenAI Codex (Chen et al., 2021) is used as the underlying language model for translation
OpenAI Codex has 175B parameters
Implementation details can be found in Appendix A

Baselines

Standard few-shot prompting uses demonstrations of only the question and the answer
CoT prompting additionally provides a reasoning chain in NL
8 prompting methods are compared under two decoding strategies: greedy decoding and self-consistency decoding
Temperature of 0.4 and 40 generations used for all datasets

Results

Faithful CoT outperforms CoT on 9/10 datasets
Average accuracy gain is larger for Planning and Logical Inference
New few-shot SOTA results on 7 datasets
Primary cause of errors is sparsity of Datalog in pretraining data

Analysis

Analyzed role of different components in pipeline
Greedy decoding used for analysis
Datasets used: GSM8K, Date Understanding, SayCan, and CLUTRR

Ablation study

Faithful CoT has strong performance
Ablation study to see how much each part of the prompt contributes to accuracy
NL comments contribute little to performance on GSM8K, Date Understanding, and SayCan
NL comments are crucial on CLUTRR
Nudge line brings striking improvement on CLUTRR
External solver relieves burden from LM, but can hurt performance on SayCan

Robustness to exemplars

Choice of exemplars does not significantly affect performance
Performance is still above baseline on all datasets except SayCan

Error analysis

49% of errors are wrong subquestions
24% of errors are wrong code
12% of errors are semantic understanding errors
7% of errors are generation cutoff errors
5% of errors are wrong gold label errors
3% of errors are missing subquestions

Conclusion

Propose Faithful CoT, a framework that decomposes complex reasoning into Translation and Problem Solving
Translation stage produces a reasoning chain in the form of interleaved natural and symbolic language
Problem-Solving stage calls an external solver to execute the reasoning chain and derive the final answer
Guarantees faithfulness of explanation
Demonstrate efficacy on 4 types of complex reasoning problems: Math Word Problems, Multi-hop QA, Planning, and Logical Inference
Sets new SOTA performance on 7 of 10 datasets
Synergy between faithfulness and performance
Analysis of strengths and weaknesses
Limitation: Translation stage is still opaque
Human evaluation of correctness of reasoning chains
NL comments in reasoning chain can serve as an interface for users
Inference cost per example is $0.01-$0.03
Hyper-parameters: temperature, max_tokens, n, frequency_penalty, presence_penalty
Unfaithful output from CoT method on 3 datasets
Graph validity, no over-dependency, no under-dependency constraints
Dataset details: statistics, number of few-shot exemplars, example inputs and outputs

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Math word problems (mwp)#

Multi-hop qa#

Planning#

Logical inference#

Evaluation metrics#

Language model#

Baselines#

Results#

Analysis#

Ablation study#

Robustness to exemplars#

Error analysis#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Math word problems (mwp)

Multi-hop qa

Planning

Logical inference

Evaluation metrics

Language model

Baselines

Results

Analysis

Ablation study

Robustness to exemplars

Error analysis

Conclusion