Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • ProofNet is a benchmark for autoformalization and formal proving of undergraduate-level mathematics.
  • ProofNet consists of 371 examples, each with a formal theorem statement, natural language theorem statement, and natural language proof.
  • The problems are from popular undergraduate pure mathematics textbooks and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology.
  • The goal of ProofNet is to drive progress in autoformalization and automatic theorem proving.
  • Baseline results are reported on statement autoformalization via in-context learning.
  • Two novel statement autoformalization methods are introduced: prompt retrieval and distilled backtranslation.

Paper Content

Introduction

  • Creation of an automatic mathematician is a longstanding challenge
  • Neural generative language modelling is a promising approach
  • Treating mathematical reasoning in natural language as a sequence learning task
  • Abundance of natural language mathematics data on the internet
  • Using language models to guide formal proof-search in an interactive theorem prover
  • Autoformalization seeks to build a bridge between informal and formal mathematical reasoning
  • Lack of standard benchmarks to guide progress in the field
  • ProofNet benchmark consists of parallel natural language and formal mathematics
  • 371 parallel formal theorem statements, natural language theorem statements, and natural language proofs
  • PROOFGPT language models trained on 8 billion token dataset of mathematical text
  • Establish baselines for ProofNet theorem autoformalization
  • Two novel theorem autoformalization methods
  • Problems in the ProofNet benchmark drawn from exercises in popular undergraduate mathematics textbooks
  • Self-containment, naturality of formalization, low risk of train-test overlap criteria
  • PROOFGPT models outperform Pythia base models and GPT-2

Methodology and experiments

  • Evaluated capabilities of pre-trained language models on autoformalizing and informalizing theorem statements
  • Investigating formal theorem proving and proof autoformalization left to future work

Autoformalization methods

  • In-context learning is used as a baseline for autoformalization of theorem statements
  • Two novel methods are introduced to boost autoformalization performance: prompt retrieval and distilled backtranslation
  • In-context learning is a simple and powerful method for adapting language models
  • OpenAI API’s Code-davinci-002 endpoint and proofGPT 1.3B and 6.7B models are used
  • Prompts are listed in Appendix C
  • Autoformalizations and informalizations are judged by a human expert
  • Few-shot learning performance is sensitive to the prompt used
  • Prompt retrieval procedure is implemented to retrieve the most relevant few-shot examples
  • Knowledge-base K is created with 90,530 statements from Lean mathlib
  • OpenAI API’s embedding-ada-002 endpoint is used to generate text embeddings
  • Distilled backtranslation is used to fine-tune language models on autoformalization of theorem statements
  • Fine-tuning hyperparameters are described in Appendix D

Results and discussion

In-context learning

  • Autoformalization accuracy, typecheck rate, and compile rate are reported
  • BLEU scores are calculated by splitting on whitespace and using BLEU-4 with smoothing
  • Informalization accuracy is higher than formalization accuracy
  • Large pre-trained language models have a strong grasp of the semantics of formal mathematics
  • Among Code-davinci-002’s typechecked generations, roughly half are correct formalizations

Prompt retrieval and distilled backtranslation

  • Applying prompt retrieval to Code-davinci-002 model increases accuracy and typecheck rate
  • Distilled backtranslation improves performance of PROOFGPT 1.3B model
  • Typecheck rate is a good predictor of autoformalization performance
  • BLEU metric is a poor guide to formalization performance
  • Code-davinci-002 makes lexical and logical errors
  • Language models have been used in theorem proving in natural language and ITPs
  • Popular benchmarks for evaluating language model-based provers exist
  • Autoformalization with language models has been demonstrated
  • Work in formal theorem proving and autoformalization depends on libraries of formalized mathematics
  • Autoformalization faces many of the same challenges as unsupervised machine translation
  • Distilled backtranslation method inspired by Han et al. [2021a]

Conclusion

  • Introduced ProofNet benchmarking
  • Pre-trained large language models achieved non-trivial performance
  • Proposed prompt retrieval and distilled backtranslation to improve autoformalization performance
  • Trained and open-sourced PROOFGPT models
  • Results of few-shot learning with LLMs on formalization and informalization of ProofNet statements