Interactive theorem provers such as Coq can help
users establish strong, foundational guarantees
about the correctness of
software. Unfortunately, these guarantees often
come at the cost of significant manual proof
effort from expert users. While Large Language
Models (LLMs) have shown promise in
automatically generating informal natural
language proofs, they have proven less effective
at generating formal proofs in interactive
theorem provers.
In this paper, we first conduct a formative
study to identify common mistakes made by LLMs
when asked to generate formal proofs using the
Coq proof assistant. By analyzing 520 proof
generation errors made by GPT-3.5, we found that
it often produces a proof with a correct
high-level structure, but in which many of the
lower-level details are wrong. Based on this
observation, we propose PALM, a novel
generate-then-repair approach that first prompts
an LLM to generate an initial proof and then
leverages targeted symbolic methods to
iteratively repair low-level problems.
We evaluate PALM on a large dataset that
includes more than 10K theorems. Our results
show that PALM significantly outperforms other
state-of-the-art approaches, successfully
proving between 76% and 180% more theorems than
its competitors. Moreover, PALM proves 1270
theorems beyond the reach of existing
approaches. We also demonstrate the
generalizability of PALM across different LLMs.