-
Towards debiasing code review support
Authors:
Tobias Jetzen,
Xavier Devroey,
Nicolas Matton,
Benoît Vanderose
Abstract:
Cognitive biases appear during code review. They significantly impact the creation of feedback and how it is interpreted by developers. These biases can lead to illogical reasoning and decision-making, violating one of the main hypotheses supporting code review: developers' accurate and objective code evaluation. This paper explores harmful cases caused by cognitive biases during code review and p…
▽ More
Cognitive biases appear during code review. They significantly impact the creation of feedback and how it is interpreted by developers. These biases can lead to illogical reasoning and decision-making, violating one of the main hypotheses supporting code review: developers' accurate and objective code evaluation. This paper explores harmful cases caused by cognitive biases during code review and potential solutions to avoid such cases or mitigate their effects. In particular, we design several prototypes covering confirmation bias and decision fatigue. We rely on a developer-centered design approach by conducting usability tests and validating the prototype with a user experience questionnaire (UEQ) and participants' feedback. We show that some techniques could be implemented in existing code review tools as they are well accepted by reviewers and help prevent behavior detrimental to code review. This work provides a solid first approach to treating cognitive bias in code review.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
You Can REST Now: Automated Specification Inference and Black-Box Testing of RESTful APIs with Large Language Models
Authors:
Alix Decrop,
Gilles Perrouin,
Mike Papadakis,
Xavier Devroey,
Pierre-Yves Schobbens
Abstract:
RESTful APIs are popular web services, requiring documentation to ease their comprehension, reusability and testing practices. The OpenAPI Specification (OAS) is a widely adopted and machine-readable format used to document such APIs. However, manually documenting RESTful APIs is a time-consuming and error-prone task, resulting in unavailable, incomplete, or imprecise documentation. As RESTful API…
▽ More
RESTful APIs are popular web services, requiring documentation to ease their comprehension, reusability and testing practices. The OpenAPI Specification (OAS) is a widely adopted and machine-readable format used to document such APIs. However, manually documenting RESTful APIs is a time-consuming and error-prone task, resulting in unavailable, incomplete, or imprecise documentation. As RESTful API testing tools require an OpenAPI specification as input, insufficient or informal documentation hampers testing quality.
Recently, Large Language Models (LLMs) have demonstrated exceptional abilities to automate tasks based on their colossal training data. Accordingly, such capabilities could be utilized to assist the documentation and testing process of RESTful APIs.
In this paper, we present RESTSpecIT, the first automated RESTful API specification inference and black-box testing approach leveraging LLMs. The approach requires minimal user input compared to state-of-the-art RESTful API inference and testing tools; Given an API name and an LLM key, HTTP requests are generated and mutated with data returned by the LLM. By sending the requests to the API endpoint, HTTP responses can be analyzed for inference and testing purposes. RESTSpecIT utilizes an in-context prompt masking strategy, requiring no model fine-tuning. Our evaluation demonstrates that RESTSpecIT is capable of: (1) inferring specifications with 85.05% of GET routes and 81.05% of query parameters found on average, (2) discovering undocumented and valid routes and parameters, and (3) uncovering server errors in RESTful APIs. Inferred specifications can also be used as testing tool inputs.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Basic Block Coverage for Search-based Unit Testing and Crash Reproduction
Authors:
Pouria Derakhshanfar,
Xavier Devroey,
Andy Zaidman
Abstract:
Search-based techniques have been widely used for white-box test generation. Many of these approaches rely on the approach level and branch distance heuristics to guide the search process and generate test cases with high line and branch coverage. Despite the positive results achieved by these two heuristics, they only use the information related to the coverage of explicit branches (e.g., indicat…
▽ More
Search-based techniques have been widely used for white-box test generation. Many of these approaches rely on the approach level and branch distance heuristics to guide the search process and generate test cases with high line and branch coverage. Despite the positive results achieved by these two heuristics, they only use the information related to the coverage of explicit branches (e.g., indicated by conditional and loop statements), but ignore potential implicit branchings within basic blocks of code. If such implicit branching happens at runtime (e.g., if an exception is thrown in a branchless-method), the existing fitness functions cannot guide the search process. To address this issue, we introduce a new secondary objective, called Basic Block Coverage (BBC), which takes into account the coverage level of relevant basic blocks in the control flow graph. We evaluated the impact of BBC on search-based unit test generation (using the DynaMOSA algorithm) and search-based crash reproduction (using the STDistance and WeightedSum fitness functions). Our results show that for unit test generation, BBC improves the branch coverage of the generated tests. Although small (around 1.5%), this improvement in the branch coverage is systematic and leads to an increase of the output domain coverage and implicit runtime exception coverage, and of the diversity of runtime states. In terms of crash reproduction, in the combination of STDistance and WeightedSum, BBC helps in reproducing 3 new crashes for each fitness function. BBC significantly decreases the time required to reproduce 43.5% and 45.1% of the crashes using STDistance and WeightedSum, respectively. For these crashes, BBC reduces the consumed time by 71.7% (for STDistance) and 68.7% (for WeightedSum) on average.
△ Less
Submitted 4 March, 2022;
originally announced March 2022.
-
JUGE: An Infrastructure for Benchmarking Java Unit Test Generators
Authors:
Xavier Devroey,
Alessio Gambi,
Juan Pablo Galeotti,
René Just,
Fitsum Kifetew,
Annibale Panichella,
Sebastiano Panichella
Abstract:
Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and for various platforms (e.g., desktop, web, or mobile applications). Such generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g.…
▽ More
Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and for various platforms (e.g., desktop, web, or mobile applications). Such generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries vs. system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved through the systematic execution of large-scale evaluations of different generators. However, the execution of such empirical evaluations is not trivial and requires a substantial effort to collect benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this paper, we present our JUnit Generation benchmarking infrastructure (JUGE) supporting generators (e.g., search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (e.g., validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, eight editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place and used and updated JUGE. As a result, an increasing amount of tools (over ten) from both academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions.
△ Less
Submitted 28 October, 2022; v1 submitted 14 June, 2021;
originally announced June 2021.
-
Pandemic Programming: How COVID-19 affects software developers and how their organizations can help
Authors:
Paul Ralph,
Sebastian Baltes,
Gianisa Adisaputri,
Richard Torkar,
Vladimir Kovalenko,
Marcos Kalinowski,
Nicole Novielli,
Shin Yoo,
Xavier Devroey,
Xin Tan,
Minghui Zhou,
Burak Turhan,
Rashina Hoda,
Hideaki Hata,
Gregorio Robles,
Amin Milani Fard,
Rana Alkadhi
Abstract:
Context. As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective. This study investigates the effects of the pandemic on developers' wellbeing and productivity. Method. A questionnaire survey was created mainly from existing, validated scales and translated into…
▽ More
Context. As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective. This study investigates the effects of the pandemic on developers' wellbeing and productivity. Method. A questionnaire survey was created mainly from existing, validated scales and translated into 12 languages. The data was analyzed using non-parametric inferential statistics and structural equation modeling. Results. The questionnaire received 2225 usable responses from 53 countries. Factor analysis supported the validity of the scales and the structural model achieved a good fit (CFI = 0.961, RMSEA = 0.051, SRMR = 0.067). Confirmatory results include: (1) the pandemic has had a negative effect on developers' wellbeing and productivity; (2) productivity and wellbeing are closely related; (3) disaster preparedness, fear related to the pandemic and home office ergonomics all affect wellbeing or productivity. Exploratory analysis suggests that: (1) women, parents and people with disabilities may be disproportionately affected; (2) different people need different kinds of support. Conclusions. To improve employee productivity, software companies should focus on maximizing employee wellbeing and improving the ergonomics of employees' home offices. Women, parents and disabled persons may require extra support.
△ Less
Submitted 20 July, 2020; v1 submitted 3 May, 2020;
originally announced May 2020.
-
Generating Class-Level Integration Tests Using Call Site Information
Authors:
Pouria Derakhshanfar,
Xavier Devroey,
Annibale Panichella,
Andy Zaidman,
Arie van Deursen
Abstract:
Search-based approaches have been used in the literature to automate the process of creating unit test cases. However, related work has shown that generated unit-tests with high code coverage could be ineffective, i.e., they may not detect all faults or kill all injected mutants. In this paper, we propose CLING, an integration-level test case generation approach that exploits how a pair of classes…
▽ More
Search-based approaches have been used in the literature to automate the process of creating unit test cases. However, related work has shown that generated unit-tests with high code coverage could be ineffective, i.e., they may not detect all faults or kill all injected mutants. In this paper, we propose CLING, an integration-level test case generation approach that exploits how a pair of classes, the caller and the callee, interact with each other through method calls. In particular, CLING generates integration-level test cases that maximize the Coupled Branches Criterion (CBC). Coupled branches are pairs of branches containing a branch of the caller and a branch of the callee such that an integration test that exercises the former also exercises the latter. CBC is a novel integration-level coverage criterion, measuring the degree to which a test suite exercises the interactions between a caller and its callee classes. We implemented CLING and evaluated the approach on 140 pairs of classes from five different open-source Java projects. Our results show that (1) CLING generates test suites with high CBC coverage, thanks to the definition of the test suite generation as a many-objectives problem where each couple of branches is an independent objective; (2) such generated suites trigger different class interactions and can kill on average 7.7% (with a maximum of 50%) of mutants that are not detected by tests generated at the unit level; (3) CLING can detect integration faults coming from wrong assumptions about the usage of the callee class (32 for our subject systems) that remain undetected when using automatically generated unit-level test suites.
△ Less
Submitted 13 September, 2022; v1 submitted 13 January, 2020;
originally announced January 2020.
-
Search-based Crash Reproduction using Behavioral Model Seeding
Authors:
Pouria Derakhshanfar,
Xavier Devroey,
Gilles Perrouin,
Andy Zaidman,
Arie van Deursen
Abstract:
Search-based crash reproduction approaches assist developers during debugging by generating a test case which reproduces a crash given its stack trace. One of the fundamental steps of this approach is creating objects needed to trigger the crash. One way to overcome this limitation is seeding: using information about the application during the search process. With seeding, the existing usages of c…
▽ More
Search-based crash reproduction approaches assist developers during debugging by generating a test case which reproduces a crash given its stack trace. One of the fundamental steps of this approach is creating objects needed to trigger the crash. One way to overcome this limitation is seeding: using information about the application during the search process. With seeding, the existing usages of classes can be used in the search process to produce realistic sequences of method calls which create the required objects. In this study, we introduce behavioral model seeding: a new seeding method which learns class usages from both the system under test and existing test cases. Learned usages are then synthesized in a behavioral model (state machine). Then, this model serves to guide the evolutionary process. To assess behavioral model-seeding, we evaluate it against test-seeding (the state-of-the-art technique for seeding realistic objects) and no-seeding (without seeding any class usage). For this evaluation, we use a benchmark of 124 hard-to-reproduce crashes stemming from six open-source projects. Our results indicate that behavioral model-seeding outperforms both test seeding and no-seeding by a minimum of 6% without any notable negative impact on efficiency.
△ Less
Submitted 10 December, 2019;
originally announced December 2019.
-
Test them all, is it worth it? Assessing configuration sampling on the JHipster Web development stack
Authors:
Axel Halin,
Alexandre Nuttinck,
Mathieu Acher,
Xavier Devroey,
Gilles Perrouin,
Benoit Baudry
Abstract:
Many approaches for testing configurable software systems start from the same assumption: it is impossible to test all configurations. This motivated the definition of variability-aware abstractions and sampling techniques to cope with large configuration spaces. Yet, there is no theoretical barrier that prevents the exhaustive testing of all configurations by simply enumerating them, if the effor…
▽ More
Many approaches for testing configurable software systems start from the same assumption: it is impossible to test all configurations. This motivated the definition of variability-aware abstractions and sampling techniques to cope with large configuration spaces. Yet, there is no theoretical barrier that prevents the exhaustive testing of all configurations by simply enumerating them, if the effort required to do so remains acceptable. Not only this: we believe there is lots to be learned by systematically and exhaustively testing a configurable system. In this case study, we report on the first ever endeavour to test all possible configurations of an industry-strength, open source configurable software system, JHipster, a popular code generator for web applications. We built a testing scaffold for the 26,000+ configurations of JHipster using a cluster of 80 machines during 4 nights for a total of 4,376 hours (182 days) CPU time. We find that 35.70% configurations fail and we identify the feature interactions that cause the errors. We show that sampling strategies (like dissimilarity and 2-wise): (1) are more effective to find faults than the 12 default configurations used in the JHipster continuous integration; (2) can be too costly and exceed the available testing budget. We cross this quantitative analysis with the qualitative assessment of JHipster's lead developers.
△ Less
Submitted 11 June, 2018; v1 submitted 22 October, 2017;
originally announced October 2017.
-
State Machine Flattening: Map** Study and Assessment
Authors:
Xavier Devroey,
Gilles Perrouin,
Maxime Cordy,
Axel Legay,
Pierre-Yves Schobbens,
Patrick Heymans
Abstract:
State machine formalisms equipped with hierarchy and parallelism allow to compactly model complex system behaviours. Such models can then be transformed into executable code or inputs for model-based testing and verification techniques. Generated artifacts are mostly flat descriptions of system behaviour. \emph{Flattening} is thus an essential step of these transformations. To assess the importanc…
▽ More
State machine formalisms equipped with hierarchy and parallelism allow to compactly model complex system behaviours. Such models can then be transformed into executable code or inputs for model-based testing and verification techniques. Generated artifacts are mostly flat descriptions of system behaviour. \emph{Flattening} is thus an essential step of these transformations. To assess the importance of flattening, we have defined and applied a systematic map** process and 30 publications were finally selected. However, it appeared that flattening is rarely the sole focus of the publications and that care devoted to the description and validation of flattening techniques varies greatly. Preliminary assessment of associated tool support indicated limited tool availability and scalability on challenging models. We see this initial investigation as a first step towards generic flattening techniques and scalable tool support, cornerstones of reliable model-based behavioural development.
△ Less
Submitted 21 March, 2014;
originally announced March 2014.
-
Towards Statistical Prioritization for Software Product Lines Testing
Authors:
Xavier Devroey,
Maxime Cordy,
Gilles Perrouin,
Pierre-Yves Schobbens,
Axel Legay,
Patrick Heymans
Abstract:
Software Product Lines (SPL) are inherently difficult to test due to the combinatorial explosion of the number of products to consider. To reduce the number of products to test, sampling techniques such as combinatorial interaction testing have been proposed. They usually start from a feature model and apply a coverage criterion (e.g. pairwise feature interaction or dissimilarity) to generate trac…
▽ More
Software Product Lines (SPL) are inherently difficult to test due to the combinatorial explosion of the number of products to consider. To reduce the number of products to test, sampling techniques such as combinatorial interaction testing have been proposed. They usually start from a feature model and apply a coverage criterion (e.g. pairwise feature interaction or dissimilarity) to generate tractable, fault-finding, lists of configurations to be tested. Prioritization can also be used to sort/generate such lists, optimizing coverage criteria or weights assigned to features. However, current sampling/prioritization techniques barely take product behavior into account. We explore how ideas of statistical testing, based on a usage model (a Markov chain), can be used to extract configurations of interest according to the likelihood of their executions. These executions are gathered in featured transition systems, compact representation of SPL behavior. We discuss possible scenarios and give a prioritization procedure illustrated on an example.
△ Less
Submitted 23 January, 2014; v1 submitted 9 October, 2013;
originally announced October 2013.