Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

ProofNet is a benchmark for autoformalization and formal proving of undergraduate-level mathematics.
ProofNet consists of 371 examples, each with a formal theorem statement, natural language theorem statement, and natural language proof.
The problems are from popular undergraduate pure mathematics textbooks and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology.
The goal of ProofNet is to drive progress in autoformalization and automatic theorem proving.
Baseline results are reported on statement autoformalization via in-context learning.
Two novel statement autoformalization methods are introduced: prompt retrieval and distilled backtranslation.

Creation of an automatic mathematician is a longstanding challenge
Neural generative language modelling is a promising approach
Treating mathematical reasoning in natural language as a sequence learning task
Abundance of natural language mathematics data on the internet
Using language models to guide formal proof-search in an interactive theorem prover
Autoformalization seeks to build a bridge between informal and formal mathematical reasoning
Lack of standard benchmarks to guide progress in the field
ProofNet benchmark consists of parallel natural language and formal mathematics
371 parallel formal theorem statements, natural language theorem statements, and natural language proofs
PROOFGPT language models trained on 8 billion token dataset of mathematical text
Establish baselines for ProofNet theorem autoformalization
Two novel theorem autoformalization methods
Problems in the ProofNet benchmark drawn from exercises in popular undergraduate mathematics textbooks
Self-containment, naturality of formalization, low risk of train-test overlap criteria
PROOFGPT models outperform Pythia base models and GPT-2

Evaluated capabilities of pre-trained language models on autoformalizing and informalizing theorem statements
Investigating formal theorem proving and proof autoformalization left to future work

In-context learning is used as a baseline for autoformalization of theorem statements
Two novel methods are introduced to boost autoformalization performance: prompt retrieval and distilled backtranslation
In-context learning is a simple and powerful method for adapting language models
OpenAI API’s Code-davinci-002 endpoint and proofGPT 1.3B and 6.7B models are used
Prompts are listed in Appendix C
Autoformalizations and informalizations are judged by a human expert
Few-shot learning performance is sensitive to the prompt used
Prompt retrieval procedure is implemented to retrieve the most relevant few-shot examples
Knowledge-base K is created with 90,530 statements from Lean mathlib
OpenAI API’s embedding-ada-002 endpoint is used to generate text embeddings
Distilled backtranslation is used to fine-tune language models on autoformalization of theorem statements
Fine-tuning hyperparameters are described in Appendix D

Autoformalization accuracy, typecheck rate, and compile rate are reported
BLEU scores are calculated by splitting on whitespace and using BLEU-4 with smoothing
Informalization accuracy is higher than formalization accuracy
Large pre-trained language models have a strong grasp of the semantics of formal mathematics
Among Code-davinci-002’s typechecked generations, roughly half are correct formalizations

Applying prompt retrieval to Code-davinci-002 model increases accuracy and typecheck rate
Distilled backtranslation improves performance of PROOFGPT 1.3B model
Typecheck rate is a good predictor of autoformalization performance
BLEU metric is a poor guide to formalization performance
Code-davinci-002 makes lexical and logical errors

Language models have been used in theorem proving in natural language and ITPs
Popular benchmarks for evaluating language model-based provers exist
Autoformalization with language models has been demonstrated
Work in formal theorem proving and autoformalization depends on libraries of formalized mathematics
Autoformalization faces many of the same challenges as unsupervised machine translation
Distilled backtranslation method inspired by Han et al. [2021a]

Introduced ProofNet benchmarking
Pre-trained large language models achieved non-trivial performance
Proposed prompt retrieval and distilled backtranslation to improve autoformalization performance
Trained and open-sourced PROOFGPT models
Results of few-shot learning with LLMs on formalization and informalization of ProofNet statements