-
You Can REST Now: Automated Specification Inference and Black-Box Testing of RESTful APIs with Large Language Models
Authors:
Alix Decrop,
Gilles Perrouin,
Mike Papadakis,
Xavier Devroey,
Pierre-Yves Schobbens
Abstract:
RESTful APIs are popular web services, requiring documentation to ease their comprehension, reusability and testing practices. The OpenAPI Specification (OAS) is a widely adopted and machine-readable format used to document such APIs. However, manually documenting RESTful APIs is a time-consuming and error-prone task, resulting in unavailable, incomplete, or imprecise documentation. As RESTful API…
▽ More
RESTful APIs are popular web services, requiring documentation to ease their comprehension, reusability and testing practices. The OpenAPI Specification (OAS) is a widely adopted and machine-readable format used to document such APIs. However, manually documenting RESTful APIs is a time-consuming and error-prone task, resulting in unavailable, incomplete, or imprecise documentation. As RESTful API testing tools require an OpenAPI specification as input, insufficient or informal documentation hampers testing quality.
Recently, Large Language Models (LLMs) have demonstrated exceptional abilities to automate tasks based on their colossal training data. Accordingly, such capabilities could be utilized to assist the documentation and testing process of RESTful APIs.
In this paper, we present RESTSpecIT, the first automated RESTful API specification inference and black-box testing approach leveraging LLMs. The approach requires minimal user input compared to state-of-the-art RESTful API inference and testing tools; Given an API name and an LLM key, HTTP requests are generated and mutated with data returned by the LLM. By sending the requests to the API endpoint, HTTP responses can be analyzed for inference and testing purposes. RESTSpecIT utilizes an in-context prompt masking strategy, requiring no model fine-tuning. Our evaluation demonstrates that RESTSpecIT is capable of: (1) inferring specifications with 85.05% of GET routes and 81.05% of query parameters found on average, (2) discovering undocumented and valid routes and parameters, and (3) uncovering server errors in RESTful APIs. Inferred specifications can also be used as testing tool inputs.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Ethical Adversaries: Towards Mitigating Unfairness with Adversarial Machine Learning
Authors:
Pieter Delobelle,
Paul Temple,
Gilles Perrouin,
Benoît Frénay,
Patrick Heymans,
Bettina Berendt
Abstract:
Machine learning is being integrated into a growing number of critical systems with far-reaching impacts on society. Unexpected behaviour and unfair decision processes are coming under increasing scrutiny due to this widespread use and its theoretical considerations. Individuals, as well as organisations, notice, test, and criticize unfair results to hold model designers and deployers accountable.…
▽ More
Machine learning is being integrated into a growing number of critical systems with far-reaching impacts on society. Unexpected behaviour and unfair decision processes are coming under increasing scrutiny due to this widespread use and its theoretical considerations. Individuals, as well as organisations, notice, test, and criticize unfair results to hold model designers and deployers accountable. We offer a framework that assists these groups in mitigating unfair representations stemming from the training datasets. Our framework relies on two inter-operating adversaries to improve fairness. First, a model is trained with the goal of preventing the guessing of protected attributes' values while limiting utility losses. This first step optimizes the model's parameters for fairness. Second, the framework leverages evasion attacks from adversarial machine learning to generate new examples that will be misclassified. These new examples are then used to retrain and improve the model in the first step. These two steps are iteratively applied until a significant improvement in fairness is obtained. We evaluated our framework on well-studied datasets in the fairness literature -- including COMPAS -- where it can surpass other approaches concerning demographic parity, equality of opportunity and also the model's utility. We also illustrate our findings on the subtle difficulties when mitigating unfairness and highlight how our framework can assist model designers.
△ Less
Submitted 1 September, 2020; v1 submitted 14 May, 2020;
originally announced May 2020.
-
An SMT-Based Concolic Testing Tool for Logic Programs
Authors:
Sophie Fortz,
Fred Mesnard,
Etienne Payet,
Gilles Perrouin,
Wim Vanhoof,
German Vidal
Abstract:
Concolic testing mixes symbolic and concrete execution to generate test cases covering paths effectively. Its benefits have been demonstrated for more than 15 years to test imperative programs. Other programming paradigms, like logic programming, have received less attention. In this paper, we present a concolic-based test generation method for logic programs. Our approach exploits SMT-solving for…
▽ More
Concolic testing mixes symbolic and concrete execution to generate test cases covering paths effectively. Its benefits have been demonstrated for more than 15 years to test imperative programs. Other programming paradigms, like logic programming, have received less attention. In this paper, we present a concolic-based test generation method for logic programs. Our approach exploits SMT-solving for constraint resolution. We then describe the implementation of a concolic testing tool for Prolog and validate it on some selected benchmarks.
△ Less
Submitted 17 February, 2020;
originally announced February 2020.
-
Search-based Crash Reproduction using Behavioral Model Seeding
Authors:
Pouria Derakhshanfar,
Xavier Devroey,
Gilles Perrouin,
Andy Zaidman,
Arie van Deursen
Abstract:
Search-based crash reproduction approaches assist developers during debugging by generating a test case which reproduces a crash given its stack trace. One of the fundamental steps of this approach is creating objects needed to trigger the crash. One way to overcome this limitation is seeding: using information about the application during the search process. With seeding, the existing usages of c…
▽ More
Search-based crash reproduction approaches assist developers during debugging by generating a test case which reproduces a crash given its stack trace. One of the fundamental steps of this approach is creating objects needed to trigger the crash. One way to overcome this limitation is seeding: using information about the application during the search process. With seeding, the existing usages of classes can be used in the search process to produce realistic sequences of method calls which create the required objects. In this study, we introduce behavioral model seeding: a new seeding method which learns class usages from both the system under test and existing test cases. Learned usages are then synthesized in a behavioral model (state machine). Then, this model serves to guide the evolutionary process. To assess behavioral model-seeding, we evaluate it against test-seeding (the state-of-the-art technique for seeding realistic objects) and no-seeding (without seeding any class usage). For this evaluation, we use a benchmark of 124 hard-to-reproduce crashes stemming from six open-source projects. Our results indicate that behavioral model-seeding outperforms both test seeding and no-seeding by a minimum of 6% without any notable negative impact on efficiency.
△ Less
Submitted 10 December, 2019;
originally announced December 2019.
-
Towards Quality Assurance of Software Product Lines with Adversarial Configurations
Authors:
Paul Temple,
Mathieu Acher,
Gilles Perrouin,
Battista Biggio,
Jean-marc Jezequel,
Fabio Roli
Abstract:
Software product line (SPL) engineers put a lot of effort to ensure that, through the setting of a large number of possible configuration options, products are acceptable and well-tailored to customers' needs. Unfortunately, options and their mutual interactions create a huge configuration space which is intractable to exhaustively explore. Instead of testing all products, machine learning techniq…
▽ More
Software product line (SPL) engineers put a lot of effort to ensure that, through the setting of a large number of possible configuration options, products are acceptable and well-tailored to customers' needs. Unfortunately, options and their mutual interactions create a huge configuration space which is intractable to exhaustively explore. Instead of testing all products, machine learning techniques are increasingly employed to approximate the set of acceptable products out of a small training sample of configurations. Machine learning (ML) techniques can refine a software product line through learned constraints and a priori prevent non-acceptable products to be derived. In this paper, we use adversarial ML techniques to generate adversarial configurations fooling ML classifiers and pinpoint incorrect classifications of products (videos) derived from an industrial video generator. Our attacks yield (up to) a 100% misclassification rate and a drop in accuracy of 5%. We discuss the implications these results have on SPL quality assurance.
△ Less
Submitted 16 September, 2019;
originally announced September 2019.
-
Test them all, is it worth it? Assessing configuration sampling on the JHipster Web development stack
Authors:
Axel Halin,
Alexandre Nuttinck,
Mathieu Acher,
Xavier Devroey,
Gilles Perrouin,
Benoit Baudry
Abstract:
Many approaches for testing configurable software systems start from the same assumption: it is impossible to test all configurations. This motivated the definition of variability-aware abstractions and sampling techniques to cope with large configuration spaces. Yet, there is no theoretical barrier that prevents the exhaustive testing of all configurations by simply enumerating them, if the effor…
▽ More
Many approaches for testing configurable software systems start from the same assumption: it is impossible to test all configurations. This motivated the definition of variability-aware abstractions and sampling techniques to cope with large configuration spaces. Yet, there is no theoretical barrier that prevents the exhaustive testing of all configurations by simply enumerating them, if the effort required to do so remains acceptable. Not only this: we believe there is lots to be learned by systematically and exhaustively testing a configurable system. In this case study, we report on the first ever endeavour to test all possible configurations of an industry-strength, open source configurable software system, JHipster, a popular code generator for web applications. We built a testing scaffold for the 26,000+ configurations of JHipster using a cluster of 80 machines during 4 nights for a total of 4,376 hours (182 days) CPU time. We find that 35.70% configurations fail and we identify the feature interactions that cause the errors. We show that sampling strategies (like dissimilarity and 2-wise): (1) are more effective to find faults than the 12 default configurations used in the JHipster continuous integration; (2) can be too costly and exceed the available testing budget. We cross this quantitative analysis with the qualitative assessment of JHipster's lead developers.
△ Less
Submitted 11 June, 2018; v1 submitted 22 October, 2017;
originally announced October 2017.
-
State Machine Flattening: Map** Study and Assessment
Authors:
Xavier Devroey,
Gilles Perrouin,
Maxime Cordy,
Axel Legay,
Pierre-Yves Schobbens,
Patrick Heymans
Abstract:
State machine formalisms equipped with hierarchy and parallelism allow to compactly model complex system behaviours. Such models can then be transformed into executable code or inputs for model-based testing and verification techniques. Generated artifacts are mostly flat descriptions of system behaviour. \emph{Flattening} is thus an essential step of these transformations. To assess the importanc…
▽ More
State machine formalisms equipped with hierarchy and parallelism allow to compactly model complex system behaviours. Such models can then be transformed into executable code or inputs for model-based testing and verification techniques. Generated artifacts are mostly flat descriptions of system behaviour. \emph{Flattening} is thus an essential step of these transformations. To assess the importance of flattening, we have defined and applied a systematic map** process and 30 publications were finally selected. However, it appeared that flattening is rarely the sole focus of the publications and that care devoted to the description and validation of flattening techniques varies greatly. Preliminary assessment of associated tool support indicated limited tool availability and scalability on challenging models. We see this initial investigation as a first step towards generic flattening techniques and scalable tool support, cornerstones of reliable model-based behavioural development.
△ Less
Submitted 21 March, 2014;
originally announced March 2014.
-
Towards Statistical Prioritization for Software Product Lines Testing
Authors:
Xavier Devroey,
Maxime Cordy,
Gilles Perrouin,
Pierre-Yves Schobbens,
Axel Legay,
Patrick Heymans
Abstract:
Software Product Lines (SPL) are inherently difficult to test due to the combinatorial explosion of the number of products to consider. To reduce the number of products to test, sampling techniques such as combinatorial interaction testing have been proposed. They usually start from a feature model and apply a coverage criterion (e.g. pairwise feature interaction or dissimilarity) to generate trac…
▽ More
Software Product Lines (SPL) are inherently difficult to test due to the combinatorial explosion of the number of products to consider. To reduce the number of products to test, sampling techniques such as combinatorial interaction testing have been proposed. They usually start from a feature model and apply a coverage criterion (e.g. pairwise feature interaction or dissimilarity) to generate tractable, fault-finding, lists of configurations to be tested. Prioritization can also be used to sort/generate such lists, optimizing coverage criteria or weights assigned to features. However, current sampling/prioritization techniques barely take product behavior into account. We explore how ideas of statistical testing, based on a usage model (a Markov chain), can be used to extract configurations of interest according to the likelihood of their executions. These executions are gathered in featured transition systems, compact representation of SPL behavior. We discuss possible scenarios and give a prioritization procedure illustrated on an example.
△ Less
Submitted 23 January, 2014; v1 submitted 9 October, 2013;
originally announced October 2013.
-
Bypassing the Combinatorial Explosion: Using Similarity to Generate and Prioritize T-wise Test Suites for Large Software Product Lines
Authors:
Christopher Henard,
Mike Papadakis,
Gilles Perrouin,
Jacques Klein,
Patrick Heymans,
Yves Le Traon
Abstract:
Software Product Lines (SPLs) are families of products whose commonalities and variability can be captured by Feature Models (FMs). T-wise testing aims at finding errors triggered by all interactions amongst t features, thus reducing drastically the number of products to test. T-wise testing approaches for SPLs are limited to small values of t -- which miss faulty interactions -- or limited by the…
▽ More
Software Product Lines (SPLs) are families of products whose commonalities and variability can be captured by Feature Models (FMs). T-wise testing aims at finding errors triggered by all interactions amongst t features, thus reducing drastically the number of products to test. T-wise testing approaches for SPLs are limited to small values of t -- which miss faulty interactions -- or limited by the size of the FM. Furthermore, they neither prioritize the products to test nor provide means to finely control the generation process. This paper offers (a) a search-based approach capable of generating products for large SPLs, forming a scalable and flexible alternative to current techniques and (b) prioritization algorithms for any set of products. Experiments conducted on 124 FMs (including large FMs such as the Linux kernel) demonstrate the feasibility and the practicality of our approach.
△ Less
Submitted 23 November, 2012;
originally announced November 2012.