Proof Automation with Large Language Models

Minghai Lu, Benjamin Delaware, and Tianyi Zhang

abstract

Interactive theorem provers such as Coq can help users establish strong, foundational guarantees about the correctness of software. Unfortunately, these guarantees often come at the cost of significant manual proof effort from expert users. While Large Language Models (LLMs) have shown promise in automatically generating informal natural language proofs, they have proven less effective at generating formal proofs in interactive theorem provers.

In this paper, we first conduct a formative study to identify common mistakes made by LLMs when asked to generate formal proofs using the Coq proof assistant. By analyzing 520 proof generation errors made by GPT-3.5, we found that it often produces a proof with a correct high-level structure, but in which many of the lower-level details are wrong. Based on this observation, we propose PALM, a novel generate-then-repair approach that first prompts an LLM to generate an initial proof and then leverages targeted symbolic methods to iteratively repair low-level problems.

We evaluate PALM on a large dataset that includes more than 10K theorems. Our results show that PALM significantly outperforms other state-of-the-art approaches, successfully proving between 76% and 180% more theorems than its competitors. Moreover, PALM proves 1270 theorems beyond the reach of existing approaches. We also demonstrate the generalizability of PALM across different LLMs.