\addbibresource

references.bib

CoqPyt: Proof Navigation in Python in the Era of LLMs

Pedro Carrott [email protected] 0000-0003-4316-928X Imperial College LondonLondonUK , Nuno Saavedra [email protected] 0000-0003-4148-5991 INESC-ID/IST, University of LisbonLisbonPortugal , Kyle Thompson 0000-0002-2868-7612 [email protected] University of California San DiegoSan DiegoUSA , Sorin Lerner 0000-0003-3957-0628 [email protected] University of California San DiegoSan DiegoUSA , João F. Ferreira 0000-0002-6612-9013 [email protected] INESC-ID/IST, University of LisbonLisbonPortugal and Emily First 0000-0002-2896-2928 [email protected] University of California San DiegoSan DiegoUSA

(2024)

Abstract.

Proof assistants enable users to develop machine-checked proofs regarding software-related properties. Unfortunately, the interactive nature of these proof assistants imposes most of the proof burden on the user, making formal verification a complex, and time-consuming endeavor. Recent automation techniques based on neural methods address this issue, but require good programmatic support for collecting data and interacting with proof assistants. This paper presents CoqPyt, a Python tool for interacting with the Coq proof assistant. CoqPyt improves on other Coq-related tools by providing novel features, such as the extraction of rich premise data. We expect our work to aid development of tools and techniques, especially LLM-based, designed for proof synthesis and repair. A video describing and demonstrating CoqPyt is available at: https://youtu.be/fk74o0rePM8.

Coq, Data extraction, Theorem proving, Retrieval augmentation

^†^†conference: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering; July 15–19, 2024; Ipojuca (Pernambuco), Brazil^†^†booktitle: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE Companion ’24), July 15–19, 2024, Ipojuca (Pernambuco), Brazil^†^†copyright: rightsretained^†^†journalyear: 2024^†^†doi: 10.1145/3663529.3663814^†^†isbn: 979-8-4007-0658-5/24/07^†^†ccs: Software and its engineering Software notations and tools^†^†ccs: Software and its engineering Development frameworks and environments

1. Introduction

Formal software verification is an incredibly effective method for develo** high quality software, as it ensures that a software program adheres to a predefined formal specification. For example, a study compared industrial standard compilers (e.g., GCC and LLVM) to CompCert, a C compiler verified using the Coq proof assistant (coq), and CompCert was the only one for which no bugs were found (Yang11a). Unfortunately, even though formal software verification provides valuable guarantees, its development is still too costly. CompCert took 100,000 lines of Coq code and 6 person-years to formally verify (leroy2016compcert). For this reason, it is important to find methods that decrease the cost of formal verification.

One recent approach to address this is the use of neural methods to automate proof synthesis. Neural theorem provers, given a partial proof and the proof state, use neural networks to predict the next likely proof steps. A proof assistant is then used to evaluate the candidate proof steps, returning either new proof states or errors. Neural theorem provers iterate on this procedure, performing proof search to traverse the space of possible proofs. A complementary task is automated proof repair (ringer2021proof), which is necessary to perform when specifications or dependencies change and result in broken proofs. To train these neural models and implement proof search, sufficient programmatic support is thus required for collecting training examples and driving the theorem prover.

In this paper, our target proof assistant is Coq due to its popularity for building verified software systems. Existing tools, such as Coq Serapy (sanchez2020generating) and PyCoq (pycoq), provide programmatic support for interacting with Coq in Python, and all current neural theorem provers for Coq utilize them. These tools implement Python bindings for Coq SerAPI (GallegoArias2016SerAPI), which is a library for machine-to-machine interaction with Coq. However, with the announcement that Coq SerAPI will be deprecated and replaced with Coq LSP (coq-lsp), a language server for Coq, the need arises for a new tool to continue supporting existing features. Fortunately, this also creates a great opportunity to support new features that prior tools did not.

In the emerging era of large language models (LLMs), additional functionality is desirable in tools for interfacing with Coq in Python in order to promote LLM-based approaches for Coq proof synthesis and repair. Namely, the ability to collect rich contextual data is not supported by any of the existing programmatic tools for controlling Coq. LLM-based neural theorem provers for Lean and Isabelle proof assistants rely on retrieval augmentation, wherein they retrieve premises, such as lemmas and definitions, that are relevant to the proof goal, and condition the next tactic generation on those premises (yang2023leandojo; mikuła2023magnushammer). For the approach to be effective, they collect fine-grained annotations of which premises are used in proofs as well as which premises are accessible from a proof. However, existing tools for Coq do not collect that type of premise data.

Refer to caption — Figure 1. UML diagram for a simplified view of the domain of CoqPyt. The selected attributes and methods are all public.

We introduce CoqPyt, a tool that enables interaction with Coq in Python using Coq LSP as the backend. CoqPyt continues to support existing capabilities essential to training data collection and proof search, but goes beyond state-of-the-art tools for Coq by supporting the ability to collect rich contextual data, among other novel features. Our tool supports the newest versions of Coq (8.17 to 8.19), which will facilitate mining software repositories for new, enriched training and evaluation data. We foresee that our work will help promote future research into neural proof synthesis and repair for Coq, especially as LLM capabilities continue to evolve. CoqPyt is open-source and available at: https://github.com/sr-lab/coqpyt.

2. CoqPyt

CoqPyt is a Python tool that enables developers and researchers to interact programmatically with Coq. It also allows users to extract textual and structural information from Coq files. We now describe the structure and main functionalities of CoqPyt, providing an illustrative example on how to interact with the tool.

2.1. The Coq Proof Assistant

Coq provides an interactive environment with a rich type system suitable for theorem proving. A theorem in Coq is a type definition, which can be proven by constructing a proof term with the stated theorem type. Since writing a proof term directly is difficult, Coq allows users to write a proof script consisting of a sequence of high-level tactics (e.g., induction or reflexivity). Each tactic guides Coq in a search for the desired proof term, refining the state until no new obligations hold. After a tactic application, Coq displays the current proof state, which includes the goals to prove and the local context. When executed, a complete proof script generates the proof term.

2.2. Coq Language Server Protocol

To interact with Coq files, we use Coq LSP (coq-lsp), a language server that provides Coq-related features. Coq LSP was developed to replace Coq SerAPI (GallegoArias2016SerAPI), a well-known tool in the Coq community that implements a protocol for interacting specifically with Coq files. Unlike Coq SerAPI, Coq LSP follows a standardized protocol for interacting with text documents known as Language Server Protocol (LSP) (lsp). LSP servers, such as Coq LSP, provide useful language features, such as auto complete. Client applications, such as IDEs, can use these features by running their own LSP client.

Besides the usual LSP features, Coq LSP provides Coq-specific functionality, namely requesting the proof state at any given point of a Coq file. Coq LSP also allows users to request the abstract syntax tree (AST) of the entire document, which corresponds to a list of steps in the file, each with its own AST representation. A step can be a term definition (e.g., definitions and theorems) or a proof step (e.g., mid-proof tactic application), determined by the AST contents of each step. For erroneous Coq files, Coq LSP is able to ignore erroneous steps and evaluate the remaining steps of the file, allowing any subsequent non-erroneous steps to define new terms. Coq LSP also enables project-wise imports by saving compiled .vo files, which may also benefit from this error-ignoring approach, given that valid files may still import terms from invalid files.

2.3. Interaction with Coq Files

CoqPyt implements a client for Coq LSP providing the interface shown in Figure 1. This client implementation is encapsulated within CoqFile, a class that abstracts the state of an actual Coq file, hiding the interaction with Coq LSP. For instance, the is_valid boolean flag indicates if the underlying Coq file is valid and the save_vo method allows the user to compile the file.

A CoqFile object provides an exec method which allows users to take an arbirary number of forward and backward steps through the file. The method returns a list of Steps corresponding to the steps that were executed or backtracked during the call. Each step contains its textual representation in the file and the AST contents.

Given that steps may be term definitions, a FileContext object is kept within the CoqFile, indexing by name all terms that have been defined until the last executed step. Any term defined inside a module has its name prefixed by the module path. Although notations are nameless terms, these are also stored in the context, indexed by the string pattern used to define them. For each defined term, a Term object is created containing its corresponding step, as well as its term type. This type depends on whether the term is a theorem, a notation, a tactic or any other definable Coq construct.

2.4. Proof Navigation

With CoqPyt, it is also possible to track the proof state throughout the file. A CoqFile can be instantiated through the ProofFile subclass, which allows users to manage the file’s proof context. As steps are taken, one can enter (or leave) proof mode, which will activate (or deactivate) the in_proof boolean flag. If there are any on-going goals, these can be accessed via the current_goals attribute. Goals are represented as a GoalAnswer object, which mimics the structure of goals in the Coq LSP response (coq-lsp).

In addition to allowing users to manage a file’s proof context, ProofFile instances also fetch all of the file’s imported terms. Thus, a ProofFile captures terms in a file and in its dependencies. In practice, the ProofFile context is initialized by instantiating a CoqFile for each library imported by the file and extracting the context of each library to the new ProofFile instance. Since creating a ProofFile requires instantiating multiple CoqFiles, it is more expensive to create a ProofFile than a CoqFile.

A ProofFile holds information about all proofs found in the file until the last executed step. Proofs with open goals are kept in the open_proofs attribute. Multiple open proofs occur in files with nested proofs, as we need to open inner proofs before closing outer proofs. Each proof is a ProofTerm, a Term with a context attribute, listing all terms found in the current context used to define the proof term. This contextual information is valuable for neural-based models and is not accessible from prior Coq-related tools.

A ProofTerm also contains an ordered list of ProofSteps, corresponding to all steps taken until the proof is closed (or until the last executed step if the proof is open). The ProofStep class enables a map** from each proof step to the intermediate proof goals it attempts to solve or simplify. A ProofStep is thus constructed by augmenting the respective Step with the intermediate proof goals before that step is taken. Similarly to ProofTerm, a ProofStep also has a context attribute listing the terms used in the step.

2.5. Proof Modification

CoqPyt allows steps to be added to or deleted from the file. Allowing such modifications on Coq files yields an interface well suited for proof development. Invalid changes, such as adding inexisting tactics, may lead to explicit errors or undefined behaviour, depending on the nature of the error. For erroneous modifications, the file will revert to its original state, ignoring the requested changes.

Through the add_step method, it is possible to define terms, introduce theorems and apply tactics to solve proof obligations. Conversely, delete_step enables the removal of steps, such as added tactics which do not alter the proof goals as intended. Both methods allow editing in arbitrary parts of the file: delete_step requires an index i to delete the i ${}^{\text{th}}$ step in the file; add_step requires the index of the step which precedes the new step. To add a new step, its textual representation must also be provided.

Modifications may also be batched as transactions. A sequence of CoqChanges can be provided to the change_steps method, which will perform all modifications. A CoqChange can be either a CoqAdd or a CoqDelete, which serve as data wrappers for the parameters of add_step and delete_step, respectively. During the transaction, intermediate invalid states are allowed as long as the final state is valid. For example, it is possible to add a Qed before adding the actual proof steps, which cannot be done through successive add_step calls. To directly modify a proof, ProofFile provides the methods append_step and pop_step for single step modifications, as well as a change_proof method for a transaction of ProofChanges.

2.6. Use Cases

The features that CoqPyt offers are particularly useful for proof search and learning-based tasks, such as neural theorem proving. To get a sense of CoqPyt’s interface for collecting data and conducting proof search, consider the file test.v in Listing 1, where we have a Coq file that defines a property about reversing a list. Listing 2 shows how we can leverage CoqPyt to collect data from the file shown in Listing 1. We can see which terms are available to the file via the pf.context attribute (line 3). For example, all the terms from the imported package List.v are available to the proofs in the file. Likewise, we show how to retrieve information from each proof step in the file via the ProofStep attributes (line 6).

⬇

1Require Import List.

2Lemma rev_append: forall {a} (l1 l2: list a),

3 rev (l1 ++ l2) = rev l2 ++ rev l1.

4Proof.

5intros a l1 l2. induction l1; intros.

6 - simpl. rewrite app_nil_r. reflexivity.

7 - simpl. rewrite IHl1.

8Admitted.

Listing 1: Example Coq file test.v with a property of rev.

⬇

1with ProofFile("test.v") as pf:

2 pf.exec(len(pf.steps))

3 print(pf.context)

4 for proof in pf.proofs:

5 for step in proof.steps:

6 print(step.text, step.ast, step.context, step.goals)

Listing 2: Proof data available with CoqPyt.

Develo** neural-based theorem provers requires the ability to attempt possibly erroneous proofs during proof search. For example, suppose we wanted to complete the proof rev_append from Listing 1. Listing 3 shows how we can use the change_proof method to conduct proof attempts. We first obtain the ProofTerm object of the proof rev_append (line 5). We then delete the ‘‘Admitted.’’ step from the proof of rev_append (line 7) and add the steps associated with our new proof attempt (line 9). When we attempt to add the erroneous steps from the proof attempt incorrect (line 11), an error is returned and test.v will remain in its original state. In turn, when the valid steps from the proof attempt correct are added (line 11), the changes are applied to test.v.

⬇

1incorrect = [" reflexivity.", "\nQed."]

2correct = [" rewrite app_assoc."] + incorrect

3with ProofFile("test.v") as pf:

4 pf.exec(len(pf.steps))

5 unproven = pf.unproven_proofs[0]

6 for attempt in [incorrect, correct]:

7 changes = [ProofPop()] # Admitted

8 for s in attempt:

9 changes.append(ProofAppend(s))

10 try:

11 pf.change_proof(unproven, changes)

12 print("Proof succeeded!")

13 break

14 except InvalidChangeException:

15 print("Proof attempt not valid.")

Listing 3: Proof attempts with CoqPyt.

Table 1. Evaluation of CoqPyt’s performance on CompCert. PCC stands for Pearson correlation coefficient.

	Count		Execution time (s)
Metric/Feature					Add step		Delete step		Change steps (Add)		Change steps (Delete)
Metric/Feature	Proofs	Steps	Load file	Execute file	Beginning	On pointer	Beginning	On pointer	Beginning	On pointer	Beginning	On pointer
$\mu(\sigma^{2})$	32.3 (48.1)	609.6 (828.1)	6.6 (7.3)	31.2 (63.8)	1.1 (1.8)	0.9 (1.5)	1.0 (1.7)	0.9 (1.6)	1.5 (2.7)	1.0 (1.7)	1.1 (1.8)	0.8 (1.5)
$\tilde{x}$	12.0	258.0	3.9	9.6	0.4	0.3	0.3	0.3	0.5	0.3	0.4	0.3
PCC (steps)	-	1.00	0.59	0.72	0.90	0.92	0.94	0.92	0.87	0.92	0.90	0.91

3. Related Work

Most neural theorem provers for Coq use either Coq Serapy (sanchez2020generating; thakur2023languageagent), PyCoq (tactician), or a custom Python class (yang2019learning; First20oopsla; Sanchez-Stern22passport; First22icse) to create training examples and perform proof search. The neural theorem prover GamePad (Huang2019Gamepad) instead modifies Coq itself to record intermediate proof states. The CoqGym benchmark dataset (yang2019learning), collected using a Python class, is the state-of-the-art benchmark for Coq. However, it was collected in 2019 and has only been updated to support versions 8.10 and 8.12 of Coq, while the newest version is 8.19. CoqPyt will render neural theorem provers useful for new Coq developments since it can be used for mining software repositories for newer Coq training data and for evaluation.

LeanDojo (yang2023leandojo) and PISA (jiang2021lisa) are learning environments for Lean and Isabelle, allowing for data extraction and proof search interaction. The neural theorem provers that use these environments are LLM-based and use retrieval augmentation (Jiang2022Thor; mikuła2023magnushammer; yang2023leandojo). CoqPyt would enable retrieval augmented LLM approaches in Coq.

Pretrained LLMs have been shown to have quantitative reasoning capabilities (palm; touvron2023llama; openai2023gpt4), especially when they undergo continued training on math data (Lewkowycz2022Minerva; azerbayev2023llemma). To synthesize proofs, they can be fine-tuned on proof data (first2023baldur; yang2023leandojo; jiang2021lisa), few-shot prompted (azerbayev2023llemma; jiang2023draft; zhang2023getting), or even zero-shot prompted (thakur2023languageagent; yang2023leandojo). However, the question of test leakage arises in any evaluation that uses pretrained LLMs. CoqPyt will allow for data collection after the pretraining cutoff date.

During proof development, proof engineers are constantly performing proof repair (ringer2020replica). The first work in automating this task uses symbolic tools for automated proof repair in Coq (ringer2021proof), and has since been applied to other proof systems (masci2022proof). Baldur is the first work to use neural methods (in particular, LLMs) to repair proofs, but as part of its proof synthesis approach (first2023baldur). With the creation of a large-scale dataset of proof repair instances (reichel2023proof), the creation of LLM methods to automate real-world proof repairs is forthcoming.

4. Evaluation

Table 2. Feature comparison between CoqPyt and similar tools (✓full support,

\sim

partial support, ✗ no support).

Feature	CoqPyt	Coq LSP	Coq Serapy
Get proof state	✓	✓	✓
Check file validity	✓	✓	✓
Execute/Modify steps	✓/✓	$\sim$ /✗	✓/ $\sim$
Extract step context	✓	✗	✗
Track modules/terms/proofs	✓/✓/✓	✗/✓/✗	✓/✗/ $\sim$

Feature Evaluation

Table 2 compares the features of CoqPyt to similar tools. When compared to Coq LSP, CoqPyt supports a more extensive range of features by using the data supplied by Coq LSP and constructing additional components of Coq program logic upon it. For instance, Coq LSP provides partial support for operating with steps by enabling retrieval of the proof state at any position of the file. However, it lacks implementation for navigating backward or forward from a specific step, which is provided by CoqPyt. Compared to Coq Serapy, CoqPyt offers complete support for an additional four features: (1) modifying steps – while Coq Serapy supports the addition or deletion of steps at the current point of execution, it does not offer the capability to add or delete steps at arbitrary positions within the file; (2) extract step context – Coq Serapy does not collect the terms used in each step; (3) track terms – Coq Serapy does not retain a record of the terms defined up to the current execution point; (4) track proofs – even though Coq Serapy supports commands such as navigating to the next proof, it does not maintain a record of all the proofs already defined in a file.

Performance Evaluation

To evaluate CoqPyt’s performance, we used the project CompCert (leroy2016compcert) since it is a popular project with over 1,700 stars on GitHub and 259 Coq files with considerable complexity. We ran the experiments inside a Docker container on an Intel Xeon Silver 4210R CPU @ 2.40GHz. Table 1 summarizes our results. CoqPyt was able to run on 193 files, 58 files contained Coq-related errors, 6 files generated out-of-memory errors, and 2 files crashed for unknown reasons. The column Count describes metrics for the number of steps and proofs of CompCert. We ran all the features in each Coq file. For features that modify steps, we performed the modification in the beginning of the file and at the current point of execution, after executing the whole file. We performed each case 5 times for each feature. This approach is driven by the understanding that the number of required operations for these features increases as the distance to the point of execution rises. To evaluate change_steps, we performed a single addition or deletion to make it simpler to compare to the other features. The results for file execution are dependent on a cache that saves our loading of Coq libraries. As more libraries were cached, the time to execute the files decreased. By calculating the Pearson correlation coefficient, we conclude that the execution time for all features is positively correlated to the number of steps in a file. A replication package is available at: https://zenodo.org/records/10292580.

5. Conclusion

We developed CoqPyt, a Python tool for interacting with Coq. It offers valuable features previously unavailable in similar Coq-related tools, namely per-step context extraction and arbitrary file and proof modification. CoqPyt does not yet explore the full potential of context extraction, as it does not consider a small subset of Coq’s rich environment (e.g., module types and section-local constructs) nor does it fetch context recursively, i.e., for each term in a step context we ignore the context of the term itself. Nonetheless, we have shown that CoqPyt provides rich features and reasonable execution times, enabling an expressive interaction style with Coq. These contributions are expected to ease the automation of machine-checked proof synthesis and repair in the emerging era of LLMs.

Acknowledgements

This work was partially funded by NSF, National Science Foundation grants CCF-1955457 and CCF-2220892, and by FCT, Fundação para a Ciência e a Tecnologia, under grant BD/04736/2023 and project UIDB/50021/2020 (DOI:10.54499/UIDB/50021/2020).

\printbibliography