Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- ProofNet is a benchmark for autoformalization and formal proving of undergraduate-level mathematics.
- ProofNet consists of 371 examples, each with a formal theorem statement, natural language theorem statement, and natural language proof.
- The problems are from popular undergraduate pure mathematics textbooks and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology.
- The goal of ProofNet is to drive progress in autoformalization and automatic theorem proving.
- Baseline results are reported on statement autoformalization via in-context learning.
- Two novel statement autoformalization methods are introduced: prompt retrieval and distilled backtranslation.
Paper Content
Introduction
- Creation of an automatic mathematician is a longstanding challenge
- Neural generative language modelling is a promising approach
- Treating mathematical reasoning in natural language as a sequence learning task
- Abundance of natural language mathematics data on the internet
- Using language models to guide formal proof-search in an interactive theorem prover
- Autoformalization seeks to build a bridge between informal and formal mathematical reasoning
- Lack of standard benchmarks to guide progress in the field
- ProofNet benchmark consists of parallel natural language and formal mathematics
- 371 parallel formal theorem statements, natural language theorem statements, and natural language proofs
- PROOFGPT language models trained on 8 billion token dataset of mathematical text
- Establish baselines for ProofNet theorem autoformalization
- Two novel theorem autoformalization methods
- Problems in the ProofNet benchmark drawn from exercises in popular undergraduate mathematics textbooks
- Self-containment, naturality of formalization, low risk of train-test overlap criteria
- PROOFGPT models outperform Pythia base models and GPT-2
Methodology and experiments
- Evaluated capabilities of pre-trained language models on autoformalizing and informalizing theorem statements
- Investigating formal theorem proving and proof autoformalization left to future work
Autoformalization methods
- In-context learning is used as a baseline for autoformalization of theorem statements
- Two novel methods are introduced to boost autoformalization performance: prompt retrieval and distilled backtranslation
- In-context learning is a simple and powerful method for adapting language models
- OpenAI API’s Code-davinci-002 endpoint and proofGPT 1.3B and 6.7B models are used
- Prompts are listed in Appendix C
- Autoformalizations and informalizations are judged by a human expert
- Few-shot learning performance is sensitive to the prompt used
- Prompt retrieval procedure is implemented to retrieve the most relevant few-shot examples
- Knowledge-base K is created with 90,530 statements from Lean mathlib
- OpenAI API’s embedding-ada-002 endpoint is used to generate text embeddings
- Distilled backtranslation is used to fine-tune language models on autoformalization of theorem statements
- Fine-tuning hyperparameters are described in Appendix D
Results and discussion
In-context learning
- Autoformalization accuracy, typecheck rate, and compile rate are reported
- BLEU scores are calculated by splitting on whitespace and using BLEU-4 with smoothing
- Informalization accuracy is higher than formalization accuracy
- Large pre-trained language models have a strong grasp of the semantics of formal mathematics
- Among Code-davinci-002’s typechecked generations, roughly half are correct formalizations
Prompt retrieval and distilled backtranslation
- Applying prompt retrieval to Code-davinci-002 model increases accuracy and typecheck rate
- Distilled backtranslation improves performance of PROOFGPT 1.3B model
- Typecheck rate is a good predictor of autoformalization performance
- BLEU metric is a poor guide to formalization performance
- Code-davinci-002 makes lexical and logical errors
Related work
- Language models have been used in theorem proving in natural language and ITPs
- Popular benchmarks for evaluating language model-based provers exist
- Autoformalization with language models has been demonstrated
- Work in formal theorem proving and autoformalization depends on libraries of formalized mathematics
- Autoformalization faces many of the same challenges as unsupervised machine translation
- Distilled backtranslation method inspired by Han et al. [2021a]
Conclusion
- Introduced ProofNet benchmarking
- Pre-trained large language models achieved non-trivial performance
- Proposed prompt retrieval and distilled backtranslation to improve autoformalization performance
- Trained and open-sourced PROOFGPT models
- Results of few-shot learning with LLMs on formalization and informalization of ProofNet statements