-
DafnyBench: A Benchmark for Formal Software Verification
Authors:
Chloe Loughridge,
Qinyi Sun,
Seth Ahrenbach,
Federico Cassano,
Chuyue Sun,
Ying Sheng,
Anish Mudide,
Md Rakib Hossain Misu,
Nada Amin,
Max Tegmark
Abstract:
We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% succe…
▽ More
We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Opening the AI black box: program synthesis via mechanistic interpretability
Authors:
Eric J. Michaud,
Isaac Liao,
Vedang Lad,
Ziming Liu,
Anish Mudide,
Chloe Loughridge,
Zifan Carl Guo,
Tara Rezaei Kheirkhah,
Mateja Vukelić,
Max Tegmark
Abstract:
We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by G…
▽ More
We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by GPT-4 (which also solves 30). MIPS uses an integer autoencoder to convert the RNN into a finite state machine, then applies Boolean or integer symbolic regression to capture the learned algorithm. As opposed to large language models, this program synthesis technique makes no use of (and is therefore not limited by) human training data such as algorithms and code from GitHub. We discuss opportunities and challenges for scaling up this approach to make machine-learned models more interpretable and trustworthy.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Confirming the Labels of Coins in One Weighing
Authors:
Isha Agarwal,
Paul Braverman,
Patrick Chen,
William Du,
Kaylee Ji,
Akhil Kammila,
Tanya Khovanova,
Shane Lee,
Alicia Li,
Anish Mudide,
Jeffrey Shi,
Maya Smith,
Isabel Tu
Abstract:
There are $n$ bags with coins that look the same. Each bag has an infinite number of coins and all coins in the same bag weigh the same amount. Coins in different bags weigh 1, 2, 3, and so on to $n$ grams exactly. There is a unique label from the set 1 through $n$ attached to each bag that is supposed to correspond to the weight of the coins in that bag. The task is to confirm all the labels by u…
▽ More
There are $n$ bags with coins that look the same. Each bag has an infinite number of coins and all coins in the same bag weigh the same amount. Coins in different bags weigh 1, 2, 3, and so on to $n$ grams exactly. There is a unique label from the set 1 through $n$ attached to each bag that is supposed to correspond to the weight of the coins in that bag. The task is to confirm all the labels by using a balance scale once.
We study weighings that we call downhill: they use the numbers of coins from the bags that are in a decreasing order. We show the importance of such weighings. We find the smallest possible total weight of coins in a downhill weighing that confirms the labels on the bags. We also find bounds on the smallest number of coins needed for such a weighing.
△ Less
Submitted 30 June, 2020;
originally announced June 2020.