-
Automated Unit Test Improvement using Large Language Models at Meta
Authors:
Nadia Alshahwan,
Jubin Chheda,
Anastasia Finegenova,
Beliz Gokkaya,
Mark Harman,
Inna Harper,
Alexandru Marginean,
Shubho Sengupta,
Eddy Wang
Abstract:
This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Ins…
▽ More
This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers. We believe this is the first report on industrial scale deployment of LLM-generated code backed by such assurances of code improvement.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
Observation-based unit test generation at Meta
Authors:
Nadia Alshahwan,
Mark Harman,
Alexandru Marginean,
Rotem Tal,
Eddy Wang
Abstract:
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. We describe the development and deployment of TestGen at Meta. In particular, we focus on the scalability challenges overcome during development in order to deploy observation-based test carving at scale in industry. So far, TestGen has landed 518 tests into production…
▽ More
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. We describe the development and deployment of TestGen at Meta. In particular, we focus on the scalability challenges overcome during development in order to deploy observation-based test carving at scale in industry. So far, TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Meta is currently in the process of more widespread deployment. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86\% of the classes covered by end-to-end tests. Testing on 16 Kotlin Instagram app-launch-blocking tasks demonstrated that the TestGen tests would have trapped 13 of these before they became launch blocking.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Assured LLM-Based Software Engineering
Authors:
Nadia Alshahwan,
Mark Harman,
Inna Harper,
Alexandru Marginean,
Shubho Sengupta,
Eddy Wang
Abstract:
In this paper we address the following question: How can we use Large Language Models (LLMs) to improve code independently of a human, while ensuring that the improved code
- does not regress the properties of the original code?
- improves the original in a verifiable and measurable way?
To address this question, we advocate Assured LLM-Based Software Engineering; a generate-and-test approac…
▽ More
In this paper we address the following question: How can we use Large Language Models (LLMs) to improve code independently of a human, while ensuring that the improved code
- does not regress the properties of the original code?
- improves the original in a verifiable and measurable way?
To address this question, we advocate Assured LLM-Based Software Engineering; a generate-and-test approach, inspired by Genetic Improvement. Assured LLMSE applies a series of semantic filters that discard code that fails to meet these twin guarantees. This overcomes the potential problem of LLM's propensity to hallucinate. It allows us to generate code using LLMs, independently of any human. The human plays the role only of final code reviewer, as they would do with code generated by other human engineers.
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Brave new world: Artificial Intelligence in teaching and learning
Authors:
Adrian Groza,
Anca Marginean
Abstract:
We exemplify how Large Language Models are used in both teaching and learning. We also discuss the AI incidents that have already occurred in the education domain, and we argue for the urgent need to introduce AI policies in universities and for the ongoing strategies to regulate AI. Regarding policy for AI, our view is that each institution should have a policy for AI in teaching and learning. Th…
▽ More
We exemplify how Large Language Models are used in both teaching and learning. We also discuss the AI incidents that have already occurred in the education domain, and we argue for the urgent need to introduce AI policies in universities and for the ongoing strategies to regulate AI. Regarding policy for AI, our view is that each institution should have a policy for AI in teaching and learning. This is important from at least twofolds: (i) to raise awareness on the numerous educational tools that can both positively and negatively affect education; (ii) to minimise the risk of AI incidents in education.
△ Less
Submitted 27 September, 2023;
originally announced October 2023.
-
Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models
Authors:
Sarah J. Zhang,
Samuel Florin,
Ariel N. Lee,
Eamon Niknafs,
Andrei Marginean,
Annie Wang,
Keith Tyser,
Zad Chin,
Yann Hicke,
Nikhil Singh,
Madeleine Udell,
Yoon Kim,
Tonio Buonassisi,
Armando Solar-Lezama,
Iddo Drori
Abstract:
We curate a comprehensive dataset of 4,550 questions and solutions from problem sets, midterm exams, and final exams across all MIT Mathematics and Electrical Engineering and Computer Science (EECS) courses required for obtaining a degree. We evaluate the ability of large language models to fulfill the graduation requirements for any MIT major in Mathematics and EECS. Our results demonstrate that…
▽ More
We curate a comprehensive dataset of 4,550 questions and solutions from problem sets, midterm exams, and final exams across all MIT Mathematics and Electrical Engineering and Computer Science (EECS) courses required for obtaining a degree. We evaluate the ability of large language models to fulfill the graduation requirements for any MIT major in Mathematics and EECS. Our results demonstrate that GPT-3.5 successfully solves a third of the entire MIT curriculum, while GPT-4, with prompt engineering, achieves a perfect solve rate on a test set excluding questions based on images. We fine-tune an open-source large language model on this dataset. We employ GPT-4 to automatically grade model responses, providing a detailed performance breakdown by course, question, and answer type. By embedding questions in a low-dimensional space, we explore the relationships between questions, topics, and classes and discover which questions and classes are required for solving other questions and classes through few-shot learning. Our analysis offers valuable insights into course prerequisites and curriculum design, highlighting language models' potential for learning and improving Mathematics and EECS education.
△ Less
Submitted 24 June, 2023; v1 submitted 15 June, 2023;
originally announced June 2023.
-
A Machine Learning Enhanced Approach for Automated Sunquake Detection in Acoustic Emission Maps
Authors:
Vanessa Mercea,
Alin Razvan Paraschiv,
Daniela Adriana Lacatus,
Anca Marginean,
Diana Besliu-Ionescu
Abstract:
Sunquakes are seismic emissions visible on the solar surface, associated with some solar flares. Although discovered in 1998, they have only recently become a more commonly detected phenomenon. Despite the availability of several manual detection guidelines, to our knowledge, the astrophysical data produced for sunquakes is new to the field of Machine Learning. Detecting sunquakes is a daunting ta…
▽ More
Sunquakes are seismic emissions visible on the solar surface, associated with some solar flares. Although discovered in 1998, they have only recently become a more commonly detected phenomenon. Despite the availability of several manual detection guidelines, to our knowledge, the astrophysical data produced for sunquakes is new to the field of Machine Learning. Detecting sunquakes is a daunting task for human operators and this work aims to ease and, if possible, to improve their detection. Thus, we introduce a dataset constructed from acoustic egression-power maps of solar active regions obtained for Solar Cycles 23 and 24 using the holography method. We then present a pedagogical approach to the application of machine learning representation methods for sunquake detection using AutoEncoders, Contrastive Learning, Object Detection and recurrent techniques, which we enhance by introducing several custom domain-specific data augmentation transformations. We address the main challenges of the automated sunquake detection task, namely the very high noise patterns in and outside the active region shadow and the extreme class imbalance given by the limited number of frames that present sunquake signatures. With our trained models, we find temporal and spatial locations of peculiar acoustic emission and qualitatively associate them to eruptive and high energy emission. While noting that these models are still in a prototype stage and there is much room for improvement in metrics and bias levels, we hypothesize that their agreement on example use cases has the potential to enable detection of weak solar acoustic manifestations.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
Predicting the Geoeffectiveness of CMEs Using Machine Learning
Authors:
Andreea-Clara Pricopi,
Alin Razvan Paraschiv,
Diana Besliu-Ionescu,
Anca-Nicoleta Marginean
Abstract:
Coronal mass ejections (CMEs) are the most geoeffective space weather phenomena, being associated with large geomagnetic storms, having the potential to cause disturbances to telecommunication, satellite network disruptions, power grid damages and failures. Thus, considering these storms' potential effects on human activities, accurate forecasts of the geoeffectiveness of CMEs are paramount. This…
▽ More
Coronal mass ejections (CMEs) are the most geoeffective space weather phenomena, being associated with large geomagnetic storms, having the potential to cause disturbances to telecommunication, satellite network disruptions, power grid damages and failures. Thus, considering these storms' potential effects on human activities, accurate forecasts of the geoeffectiveness of CMEs are paramount. This work focuses on experimenting with different machine learning methods trained on white-light coronagraph datasets of close to sun CMEs, to estimate whether such a newly erupting ejection has the potential to induce geomagnetic activity. We developed binary classification models using logistic regression, K-Nearest Neighbors, Support Vector Machines, feed forward artificial neural networks, as well as ensemble models. At this time, we limited our forecast to exclusively use solar onset parameters, to ensure extended warning times. We discuss the main challenges of this task, namely the extreme imbalance between the number of geoeffective and ineffective events in our dataset, along with their numerous similarities and the limited number of available variables. We show that even in such conditions, adequate hit rates can be achieved with these models.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
Indexing Operators to Extend the Reach of Symbolic Execution
Authors:
Earl T. Barr,
David Clark,
Mark Harman,
Alexandru Marginean
Abstract:
Traditional program analysis analyses a program language, that is, all programs that can be written in the language. There is a difference, however, between all possible programs that can be written and the corpus of actual programs written in a language. We seek to exploit this difference: for a given program, we apply a bespoke program transformation Indexify to convert expressions that current…
▽ More
Traditional program analysis analyses a program language, that is, all programs that can be written in the language. There is a difference, however, between all possible programs that can be written and the corpus of actual programs written in a language. We seek to exploit this difference: for a given program, we apply a bespoke program transformation Indexify to convert expressions that current SMT solvers do not, in general, handle, such as constraints on strings, into equisatisfiable expressions that they do handle. To this end, Indexify replaces operators in hard-to-handle expressions with homomorphic versions that behave the same on a finite subset of the domain of the original operator, and return bottom denoting unknown outside of that subset. By focusing on what literals and expressions are most useful for analysing a given program, Indexify constructs a small, finite theory that extends the power of a solver on the expressions a target program builds.
Indexify's bespoke nature necessarily means that its evaluation must be experimental, resting on a demonstration of its effectiveness in practice. We have developed Indexif}, a tool for Indexify. We demonstrate its utility and effectiveness by applying it to two real world benchmarks --- string expressions in coreutils and floats in fdlibm53. Indexify reduces time-to-completion on coreutils from Klee's 49.5m on average to 6.0m. It increases branch coverage on coreutils from 30.10% for Klee and 14.79% for Zesti to 66.83%. When indexifying floats in fdlibm53, Indexifyl increases branch coverage from 34.45% to 71.56% over Klee. For a restricted class of inputs, Indexify permits the symbolic execution of program paths unreachable with previous techniques: it covers more than twice as many branches in coreutils as Klee.
△ Less
Submitted 26 June, 2018;
originally announced June 2018.