Systematic Literature Review on Application of Learning-based Approaches in Continuous Integration ^†^†thanks: Citation: Arani, A. K., M., Le, T. H., M., Zahedi & Babar, M. A. (2023, May). Systematic Literature Review on Application of Learning-based Approaches in Continuous Integration. arXiv preprint (2024)

Ali Kazemi Arani
CREST-The Centre for Research
on Engineering Software Technologies
University of Adelaide
Adelaide, SA 5005, Australia
[email protected]
&Triet Huynh Minh Le
CREST-The Centre for Research
on Engineering Software Technologies
University of Adelaide
Adelaide, SA 5005, Australia
[email protected]
&Mansooreh Zahedi
School of Computing and Information Systems
University of Melbourne
Melbourne, VIC 3010, Australia
[email protected]
&M. Ali Babar
CREST-The Centre for Research
on Engineering Software Technologies
University of Adelaide
Adelaide, SA 5005, Australia
[email protected]

Abstract

Context: Machine learning (ML) and deep learning (DL) analyze raw data to extract valuable insights in specific phases. The rise of continuous practices in software projects emphasizes automating Continuous Integration (CI) with these learning-based methods, while the growing adoption of such approaches underscores the need for systematizing knowledge. Objective: Our objective is to comprehensively review and analyze existing literature concerning learning-based methods within the CI domain. We endeavour to identify and analyse various techniques documented in the literature, emphasizing the fundamental attributes of training phases within learning-based solutions in the context of CI. Method: We conducted a Systematic Literature Review (SLR) involving 52 primary studies. Through statistical and thematic analyses, we explored the correlations between CI tasks and the training phases of learning-based methodologies across the selected studies, encompassing a spectrum from data engineering techniques to evaluation metrics. Results: This paper presents an analysis of the automation of CI tasks utilizing learning-based methods. We identify and analyze nine types of data sources, four steps in data preparation, four feature types, nine subsets of data features, five approaches for hyperparameter selection and tuning, and fifteen evaluation metrics. Furthermore, we discuss the latest techniques employed, existing gaps in CI task automation, and the characteristics of the utilized learning-based techniques. Conclusion: This study provides a comprehensive overview of learning-based methods in CI, offering valuable insights for researchers and practitioners develo** CI task automation. It also highlights the need for further research to advance these methods in CI.

Keywords Continuous Integration $\cdot$ Machine Learning $\cdot$ Model Training $\cdot$ Automation $\cdot$ Systematic Literature Review

1 Introduction

Continuous Integration (CI) refers to the software development practice of automatically integrating code changes through frequent automated build processes, has gained popularity for enhancing software delivery speed and reliability through early issue detection [1]. CI enables rapid testing, building, and software preparation at any time [2]. This facilitates bug detection and fixation, resulting in faster delivery and improved software quality. This approach reduces the development costs and enhances customer satisfaction [1].

Frequent changes in CI environments produce large amounts of data. It can be difficult and demanding in terms of time and resources to extract and analyze these data, particularly in large-scale projects [3]. CI imposes considerable expenses, significantly impacting software development expenses [4]. Notably, Google and Mozilla reported monthly CI process costs in millions [2, 5]. The CI phase consumes more than half of the resources in software development [6], posing a barrier for smaller companies. Thus, enhancing CI pipeline performance and reducing associated costs are crucial for software development.

CI’s popularity, complex data, and costs drive the adoption of learning-based methods including Machine Learning (ML) and Deep Learning (DL) techniques for analysis, aiding software engineers with valuable insights. These insights enhance the CI feedback loop, analyzing development data, test logs, CI phase outcomes, and operational environment data [7].

Learning-based methods use mathematical models to acquire knowledge, make decisions, or improve performance based on data and experience [8]. These techniques can efficiently predict complex task outcomes. In Continuous Integration (CI) environments, learning-based methods can enhance task performance by predicting software defects without executing the current version. They provide benefits such as accurately predicting outcomes of unit tests [9] and regression tests [10]. They also aid software engineers in timely decision-making, such as task assignment to different profiles [11].

Given the broad scope of learning-based methods and their diverse applications in CI, understanding their impact on CI processes is crucial. This includes discerning the decision-making steps in develo** these methods, termed Machine Learning for Continuous Integration (ML4CI). An exploration of this area offers cross-disciplinary insights, emphasizing the necessity of a systematic literature review in ML4CI.

This study endeavours to address the existing gap by conducting a comprehensive Systematic Literature Review (SLR) that concentrates on learning-based methodologies within CI development phases. Through the analysis of 52 carefully selected papers spanning from 2000 to August 2023, the objective is to elucidate the progress made in the automation of CI tasks through the application of learning-based techniques. The developmental steps involve data collection/preparation, feature engineering, training, tuning, and evaluation of these methods.

Comprehensively analyzing data from all CI tasks offers researchers and practitioners a synthesized overview of information sources and applied techniques. This equips them to leverage these resources in CI pipelines, potentially automating processes and guiding future endeavours.

Insights from this review benefit researchers and practitioners, offering a foundational platform to advance CI phases. By reusing or refining learning-based methods, more efficient practices can emerge. This research is significant for those employing existing approaches or develo** new solutions for CI, considering identified gaps. Additionally, our discussion addresses opportunities to optimize current solutions or address neglected areas in real-world CI settings.

In summary, the main contributions of this paper are threefold:

•

First, the paper provides a clear and comprehensive understanding of the six identified phases and ten tasks in CI that can be automated by using learning-based methods, along with their sequence and relationships in the CI pipeline.
•

Second, the paper analyses the data and techniques used for data preparation, feature engineering, training, tuning and evaluation of the learning-based methods concerning the CI phases and tasks.
•

Finally, the paper discusses the future directions in applying learning-based solutions to CI and the existing inconsistencies, which can serve as a guide for researchers and practitioners in the field.

In previous research, our focus was on using ML methods to extract automated CI tasks. We specifically looked at the training approaches used in state-of-the-art (SOTA) learning-based methods for each CI task [12]. This paper extends our previous work by updating the list of papers through searches in two additional indexing databases and reviewing earlier publications. Furthermore, we conduct an analysis specifically focusing on each training phase of ML methods. Additionally, we employ thematic analysis to systematically extract information from the published works and conduct deeper analysis. The objective is to offer a more comprehensive and detailed account, providing additional information and insights into learning-based methods within the context of CI.

The rest of this paper is organized as follows: Section 2 provides background and related works, followed by the research methodology in Section 3. Section 4 presents synthesized data based on the proposed research questions (RQs). Section 5 outlines the study’s limitations. Finally, Section 6 discusses the findings, and Section 7 concludes the study.

Henceforth, we will use the term ‘ML’ as an abbreviation for the combined term ‘Learning-based’.

2 Background and Related Works

In this section, we will introduce the essential concepts of CI and the applications of ML in software engineering, which are the primary focus of this SLR.

2.1 Overview of Continuous Integration

CI encompasses the processes of building, testing, validating software, and managing commit-related actions, including addressing reported bugs and messages from developers, before advancing to the deployment phase [13]. A development team integrates and merges their source code frequently, often multiple times a day, to build the software [1]. Automated testing and quick feedback are crucial to preventing issues from propagating to the delivery phase or affecting the development process of other team members [13]. The validation phase in CI provides feedback to developers on detected bugs or performance issues, and they also monitor deployed software and the related materials in operations and version control systems (VCS) to maintain its performance [14].

Refer to caption — Figure 1: The four phases of ML life cycle. Note: the required steps for training an ML model are distinguished by numbers

2.2 Machine Learning in Software Engineering

ML methods enhance software product quality by automating tasks through pattern analysis and learning from historical data [15]. During training, models use a subset of input data to recognize patterns, making predictions or classifications for new data based on their training methodology, which includes supervised, unsupervised [16], semi-supervised [17], and Reinforcement Learning (RL) [18].

For instance, a supervised ML model predicts build outcomes with features like changed classes and committer experience [19]. Unsupervised learning clusters data based on natural language features to assign tasks to related profiles [11]. Section 4.5 further explores these learning types, highlighting their common development process.

Previous studies have explored ML techniques in software engineering, with a notable focus on ML methods exclusively in Test Case Selection and Prioritization (TSP) [20]. This systematic review included 29 studies but did not examine data-related concepts. Their primary focus was on comparisons between studies and considerations related to reproducibility and repeatability in the TSP domain.

In a broader perspective, a systematic map** conducted by Durelli et al. [21] explored the application of machine learning methods in software testing, emphasizing their scalability and effectiveness in addressing complex issues within software testing systems.

Unlike the mentioned studies concentrating on the testing phase, Shafiq et al. [8] conducted a comprehensive literature review covering various aspects of ML methods in software engineering, including software requirements, design, construction, quality, and maintenance. They highlighted demographic data and identified challenges, such as the uncertain nature of ML techniques, data availability, and the increasing complexity of software products.

Zhang, and Tsai [22] compiled a comprehensive list of SE problems addressed with ML techniques, discussing their application and the pros and cons of using ML models in SE. Ali and Gravino [23] conducted a systematic review of 75 studies to assess commonly used ML models, datasets, and accuracy measurement methods in software development effort estimation (SDEE).

Our study exclusively explores the application of ML within the CI domain. Unlike other studies that address broader areas of software engineering, our research offers a detailed analysis of ML-based solutions developed to be employed in CI.

The literature proposes four primary steps in ML model development for CI [24]: Data Collection and Data Engineering (Step 1), Feature Engineering (Step 2), Training and Hyper-parameter Tuning (Step 3), and Model Evaluation (Step 4). The representation in Figure 1 corresponds to these phases as outlined in reference [24].

During the initial Data Collection and Engineering step, data can be gathered from various sources, including raw data from CI tools and monitoring sensors [25]. Post-collection, it is crucial to extract relevant features and adapt them for the subsequent training phase [26]. For effective ML model training, selecting an appropriate algorithm and adjusting hyperparameters for performance optimization is essential. Finally, thorough evaluation and validation are necessary to ensure the models’ effectiveness for real-world applications [27]. Regularly repeating these steps is important to keep models updated with new data.

3 Research Methodology

To gain insights into the developmental stages of ML-based methods in CI, this SLR followed Kitchenham’s guidelines [28]. Utilizing the ACM¹¹1https://dl.acm.org/, IEEE²²2https://www.ieee.org/, and Scopus³³3https://www.scopus.com/ indexing systems with a specified search string method facilitated the retrieval of relevant studies. These databases were chosen for their extensive coverage of journals and conferences in Software Engineering and Computer Science domains [29].

Previous studies affirm the effectiveness of this search strategy in collecting high-quality papers [29]. A summary of the research methodology and the number of studies in each phase is presented in Figure 2. Further details are provided in the subsequent section.

3.1 Research Questions

Table 1: Research questions and motivations.

Research Questions	Motivation
RQ1: What CI tasks can be automated using ML-based approaches?	In this first research question, we seek to explain the specific CI tasks and phases that ML methods can effectively automate. Also, we present the output and input of these CI phases. This insight not only enhances our understanding of the integration process but also sheds light on underexplored CI phases and tasks that practitioners can further investigate.
RQ2: What datasets and data preparation techniques are used in automating CI tasks using ML?	Given the data-driven nature of ML models, the characteristics and preparation of input datasets significantly influence the efficacy of ML-based solutions in CI. This research question investigates the frequently employed datasets and associated data engineering methodologies, including approaches to address challenges like class imbalance.
RQ3: What feature extraction strategies are utilized to train ML models for automating the CI tasks?	Beyond raw data, the selection and engineering of features play a pivotal role in sha** the effectiveness of ML models. By identifying the array of feature types and techniques for transforming raw data into machine-readable inputs, we contribute to a repository of knowledge that can be used in future studies. This knowledge empowers researchers and practitioners to optimize the design of ML models for automating CI tasks.
RQ4: What ML modeling and tuning techniques are used to automate CI tasks?	This research question focuses on the connection between ML model types and CI tasks. By uncovering the relationship between these two elements, we aim to pinpoint potential gaps in the application of ML models for automating CI tasks. Moreover, we extract the common approaches for tuning these models, shedding light on hyper-parameter tuning methods that enhance their performance.
RQ5: How the ML models are evaluated in case of automating the CI tasks?	In the final research question, we focus on the evaluation of ML model performance within the realm of CI. Our objective is to categorize the commonly employed evaluation metrics and techniques, elucidating their correlation with the four distinct ML algorithm categories outlined in Section 2.2. This categorization enables researchers and practitioners to conduct informed comparisons of various ML methodologies for automating the CI tasks fostering a good perspective based on existing literature.

To initiate this SLR, five research questions (RQs) were formulated to identify the phases of CI and the ML model development techniques used for these tasks. Section 4.2 provides an overview of the ML models and CI tasks identified based on these RQs. You can find a detailed list of these RQs and their motivations in Table 1.

3.2 Search Strategy

The next step in this SLR is designing an appropriate search strategy for finding relevant studies [28]. The search strategy configuration for extracting studies from databases is detailed below.

Table 2: Search string used to extract relevant papers on the application of ML methods in CI. The search string is composed of two segments, including “Machine Learning” and its associated synonyms and “Continuous Integration”.

A	TITLE-ABS-KEY ((“Machine Learning” OR “Deep Learning” OR “Reinforcement Learning” OR “Supervised Learning” OR “Unsupervised Learning” OR “Semi-supervised Learning” OR “Data Mining” OR “Text Mining” OR “Natural Language Processing” )
\cdashline1-2 B	AND (“Continuous Integration”)

3.2.1 Search String

Our search string consists of two segments: A) “Machine Learning” and related synonyms, and B) “Continuous Integration”. These choices were informed by a review of previous studies and SLRs [1, 30, 31, 32], with the complete search string detailed in Table 2. The search string was executed on 26th July 2023, and snowballing was conducted on 7th August 2023.

Before implementing our search string to identify relevant literature, the first and third authors conducted a preliminary search on Google Scholar using keywords such as ‘machine learning,’ ‘Continuous Integration’, and related terms. This search aimed to identify initial relevant studies on the application of machine learning (ML) methods in Continuous Integration (CI). This preliminary search served as a validation set of papers. Subsequently, we iteratively refined our search string to retrieve all the papers included in our validation set. However, due to its tendency to yield numerous irrelevant results [1], Google Scholar was not used for our primary search.

To refine our search string, we iteratively ran it on ACM, IEEE, and Scopus databases, identifying and adding missing search terms for comprehensive coverage in the initial phase. The refined search string was then executed on these three databases, resulting in 1189 hits.

3.2.2 Inclusion and Exclusion Criteria

Table 3: Inclusion and exclusion criteria.

ID	Inclusion criteria
I1	Studies that employed ML-based methods in the CI environment.
I2	Related to software development and engineering.
ID	Exclusion criteria
E1	The paper is tagged as non-English language.
E2	Non-research papers including conference reviews and reports, notes, and short surveys.
E3	Short papers (i.e., less than five pages)
E4	Duplicate papers

For collecting a list of high-quality studies, we established both inclusion and exclusion criteria. You can find the selection process and these criteria in Figure 2 and Table 3 respectively. We applied these criteria to all 1189 studies from the three databases, involving several steps in the paper selection process. Initially, non-English papers were excluded. Non-research documents, such as reports and notes, were disregarded. Additionally, papers with fewer than five pages were excluded for comprehensive method explanations. To prevent redundancy, duplicated papers across all databases were removed before reviewing the content.

In subsequent stages, the first and second authors reviewed paper titles and abstracts, categorizing them into two lists and curating a selection of potential papers relevant to ML in CI. A comprehensive examination of the full text followed, ensuring relevance and content quality, with the cooperation of the first and second authors and continuous input from the third and fourth authors. According to Wohlin’s guidelines [33], the selected papers must be published in peer-reviewed venues.

To ensure comprehensive coverage of relevant literature, we employed snowballing on the initial set of 50 papers [33], with the assistance of researchers whose names are presented in the acknowledgment section. Backward and forward snowballing involved reviewing references in selected papers and papers citing them, respectively. We repeated this process and reviewed steps until no more related papers could be identified.

In backward snowballing, 11 papers were identified, and in forward snowballing, four potential papers were found before full-text review. After reviewing these 15 potential papers, we filtered out 13 based on inclusion and exclusion criteria. In the end, 52 papers on the application of ML in CI were selected. The list and corresponding IDs are presented in the appendices, specifically in Table 22.

3.3 Data Extraction and Data Synthesis

Data extraction: Following Kitchenham’s guidelines [28], we designed a form to systematically collect information, including demographic data, from selected studies. The first author organized the extracted data in a Microsoft Excel spreadsheet according to the extraction form provided in Table 23, with random checks performed by the second author.

Data synthesis: This study aims to comprehensively analyze, classify, and report both qualitative and quantitative data. Thematic analysis was utilized for synthesizing qualitative data [34] a method recognized for its applicability in the software engineering domain [35]. Quantitative data were presented in a raw format, including demographic information, or were derived from the synthesized qualitative data. The process of qualitative thematic analysis comprised six steps, as outlined below. The initial execution of these steps was undertaken by the first author and thoroughly discussed with the second author for potential modifications. The third and fourth authors oversaw this process and offered feedback for necessary adjustments.

1.

Familiarizing with data: We reviewed and annotated each field of the extracted data.
2.

Generating initial codes: At this stage, we broke down the data into smaller parts and assigned codes, meaningful words or phrases acting as labels. This step involved iterative merging and revision of codes.
3.

Searching for themes: After finalizing the codes, we reorganized and gathered relevant data for each code, defining potential themes.
4.

Reviewing themes: We reviewed potential themes for relevance with the extracted data and codes for each research question.
5.

Defining and naming themes: In the last step, we defined coherent and precise names for each theme, along with clear definitions.
6.

Reporting: Our analysis and findings were mapped to each research question and reported in section 4

4 Findings

In this section, we present the findings derived from our comprehensive analysis of demographic data and synthesized data, and we will provide answers to our five research questions as outlined in Table 1.

4.1 Demographic Data

This section presents the demographic data concerning the four research design attributes, namely (1) publication year, (2) publisher venue and type of the venues: conference, journal or workshop, (3) study context and research methods, and (4) types of learning algorithms. The demographic information is valuable for researchers who wish to conduct research in this area, as it provides insights into where relevant studies can be found and their trends over the years, as well as critical areas by reviewing the list of keywords. Furthermore, it offers an overview of the study context and research methods employed in conducted studies [1].

4.1.1 Distribution of Studies

Since we ran the search string on July 26th 2023 and conducted the snowballing on August 7th 2023, we can say that a total of 52 studies were published between 2000 and August 2023. As shown in Figure 3, the first study on the application of ML techniques in CI was published in 2006 [36], which focused on predicting build validation in the CI phase. Notably, demographic data reveals that no paper was published on this topic between 2006 and 2014. However, the application of ML methods in CI has gradually gained more attention from researchers, and the number of studies started increasing since 2014. Additionally, the results indicate that a majority of papers (65.4%) were published at conferences. It is worth mentioning that from 2021 all the studies focused only on enhancing the regression testing and predicting the build validations. This point presents the importance of these two phases in the CI pipeline.

4.1.2 Distribution of Keywords

The concept of CI encompasses various areas and tools. To illustrate the areas that researchers have focused on, we can examine the keywords used in the selected studies. Figure 4 displays the frequency of each keyword in the primary studies and highlights the most frequent ones. The domains covered in this SLR are diverse, including Software Engineering concepts (e.g. Continuous Integration, Test Case Prioritization, Monitoring), software tools (e.g. Travis CI), and ML concepts (e.g. Classifier, Reward Function). The most common keywords in the selected studies are Continuous Integration, Machine Learning, Regression Testing, and Test Case Prioritization. This point depicts the importance of the regression testing phase of CI and especially its test case prioritization task.

Table 4: Distribution of the 52 selected primary studies on publication venues. The color intensity in this table corresponds to the number of studies, with lighter shades indicating lower counts and darker blues representing higher counts.

Publication Venue	#	%
IEEE/ACM International Conference on Automated Software Engineering (ASE)	3	5.8
International Conference on Software Engineering (ICSE)	3	5.8
International Conference on Software Testing, Verification and Validation Workshops (ICSTW)	3	5.8
Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (STA)	3	5.8
IEEE Transactions on Software Engineering (TSE)	3	5.8
Proceedings of the Asia-Pacific Symposium on Internetware (APSI)	2	3.8
CEUR Workshop Proceedings (CEUR)	2	3.8
Proceedings of the Annual International Conference on Computer Science and Software Engineering (CSSE)	2	3.8
IEEE Conference on Software Testing, Validation and Verification (ICST)	2	3.8
International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (IEA/AIE)	2	3.8
Journal of Software: Evolution and Process (JSEP)	2	3.8
IEEE International Conference on Software Analysis, Evolution and Re-engineering (SANER)	2	3.8
Others (These venues only published one study)	23	44.2

Table 4 displays the distribution of published papers on the application of ML in CI across 35 venues, with 12 venues publishing more than one study. Among these 12 venues, IEEE/ACM International Conference on Automated Software Engineering (ASE), International Conference on Software Engineering (ICSE), International Conference on Software Testing, Verification and Validation Workshops (ICSTW), Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (STA), and IEEE Transactions on Software Engineering (TSE) stand out as the top venues for publishing research on the topic of application of ML in CI, with each venue publishing three papers. It is worth noting that most of the papers (44.2%) were published in 23 different venues, with Software Engineering venues being responsible for the majority of papers (65.2%). The other domains of venues include Computer Science, Artificial Intelligence, Web and databases.

4.1.3 Study Context and Research Methods

Table 5: The number of papers about each research and data analysis type. The color intensity in this table corresponds to the number of studies, with lighter shades indicating lower counts and darker blues representing higher counts. The bolded numbers represent the count of studies and their percentage in total.

	Industry	Open-Source	Simulation
\CenterstackStudy IDs
(Number of
studies,
percentage)	\CenterstackS1, S4, S8, S9, S13, S18, S21,
S24, S26, S27, S28, S29, S30,
S37, S39, S42, S43, S44, S46,
S47, S49 (21, 40.4%)	\CenterstackS2, S3, S5, S6, S7, S10, S11, S14, S15, S16,
S17, S20, S22, S23, S25, S26, S31, S32, S33,
S34, S35, S36, S37, S38, S40, S41, S45,
S48, S50, S51, S52 (31, 59.6%)	\CenterstackS12, S19
(2, 3.8%)

Table 5 provides a classification of the study context of the reviewed papers into three groups:“Industry”, “Open-Source”, and “Non-industry (simulation)”. Industrial studies were conducted with real-world data sets from software companies, such as projects in Microsoft company (S1) [37], to verify the applicability of the proposed methods in a practical closed-source software environment. On the other hand, studies in the “Open-Source” category validated their methods on real-world data projects from industrial open-source software, such as the Apache projects (S10) [11]. The distribution of the studies in Industrial and Open-Source categories presents the applicability of the ML methods in different CI environments.

Based on Table 5, it can be observed that only two studies (S12, S19) conducted research in simulated environments, while the majority of studies (96.2%) were situated in the Industry and Open-source categories. Simulation-based data sets refer to studies that implemented and evaluated their methods in simulated environments. In S19, the authors used the Gym library to simulate a CI environment by using execution logs of test cases and training a Reinforcement Learning (RL) agent. The simulator recreates a similar environment as a real environment for testing execution history. In S12, the authors simulated a cloud environment by installing packages manually and validated their ML method for discovering the installed software on containers and Virtual Machines (VMs) in testing the whole system. This highlights the usefulness of simulation-based data sets in evaluating and testing ML-based methods in a controlled and reproducible environment.

Moreover, we conducted a clustering analysis of the primary studies based on the types of research methods used. Typically, studies are categorized into two groups, namely quantitative and qualitative research, and we assigned the studies to each group based on their presented results and conclusions. Quantitative research methods involve the collection of empirical data through measurements and procedures, while qualitative studies are more descriptive and leave more room for interpretation [38].

In evaluating the effectiveness of quantitative research methods, researchers often rely on metrics such as the accuracy and performance of ML models, whereas qualitative methods may involve surveys, interviews, and other techniques to gather feedback from users of the trained ML models [39]. The results presented that none of the studies in this area employed qualitative methods for evaluating their solutions. Hence, a significant gap exists in the case of using qualitative analysis in the area of ML for CI. Furthermore, in the SLR paper, we provide additional information on the evaluation methods of quantitative research, which can assist researchers and practitioners in understanding how quantitative results can be validated and what are the most commonly used evaluation metrics in the area of ML for CI.

Summary: $\bullet$ The number of published studies related to the application of ML in CI has been on the rise, and gradually narrowed down to only regression testing and build validation over time. $\bullet$ All of the studies included in this SLR utilized quantitative research methods. However, a lack of qualitative assessment of the results is visible in the literature.

4.2 RQ1: Continuous Integration Phases and Tasks

Our analysis uncovered six distinct ML-enhanced phases within CI, interconnected as illustrated in Figure 5. These phases are presented in the order of their sequence within the CI pipeline. To highlight automated tasks within each CI phase, we have identified them in an underlined format, briefly explained in the description of each phase. Table 6 lists ten identified automated CI tasks across these six phases. It is worth mentioning that, due to the limited number of studies and the use of different evaluation metrics, we did not compare state-of-the-art methods in each of the other tasks except for the Test Optimization and Build Prediction tasks.

Table 6: List of papers in each CI phase with related tasks and total count in parentheses. The table’s color intensity corresponds to the number of studies, with lighter shades indicating lower counts and darker blues representing higher counts. The bolded numbers represent the count of studies and their percentage in total.

\CenterstackCI
Tasks	\Centerstack [l]Unit Test Prediction	\Centerstack [l]Branch Coverage Prediction	\Centerstack [l]Integration Test Prediction	Test Optimization	Defect Prediction	\Centerstack [l]Flaky Test Detection	Build Prediction	\Centerstack [l]Installed Software Discovery	\Centerstack [l]Performance Test Optimization	\Centerstack [l]Activity Management
\CenterstackCI
Phase	\CenterstackUnit
Test	\CenterstackIntegration
Test	Regression Test			\CenterstackBuild
Validation	\CenterstackSystem
Test	\CenterstackProcess
Management
Studies	\CenterstackS2
(1, 1.9%)	\CenterstackS3, S15
(2, 3.9%)	\CenterstackS18, S23
(2, 3.9%))	\CenterstackS7, S8, S11, S14,
S16, S19, S20, S22,
S26, S28, S29, S30,
S31, S33, S34, S35,
S36, S37, S39, S41,
S45, S46, S47, S48,
S50, S52 (26, 50%)	\CenterstackS9, S13,
S17, S32
(4, 7.7%)	\CenterstackS40
(1, 1.9%)	\CenterstackS1, S5, S6,
S21, S24, S25,
S27, S38, S42,
S43, S44, S51
(12, 23.1%)	\CenterstackS12
(1, 1.9%)	\CenterstackS4
(1, 1.9%)	\CenterstackS10, S17
(2, 3.9%)

Unit Test (UT): According to Stolberg’s definition [40], the Unit Test phase serves as the initial step in the CI pipeline. This phase involves validating newly developed code within an isolated environment. It occurs whenever a developer commits new or modified compiled code. The code built during this step is referred to as the “Developer build”, as depicted in Figure 5.

In CI, the master branch is typically kept free from failures, with code changes made on a separate “developer branch” throughout this study. Among the 52 studies, only S2 focused on this phase, conducting Unit Test Prediction. The authors enhanced static code checkers’ accuracy using ML models to predict false positives (wrongly predicting safe code) (S2), reducing the risk of releasing buggy software.

Integration Test (IT): After validating units with Unit Test in the developer branch, the next step is to integrate the new module with other developed modules on the same branch [41]. This phase involves testing various modules and new functionalities in combination with the entire system or sub-systems. ML-based methods in this CI phase aim to predict test outcomes (S18) and reduce the computational load of integration tests by skip** safe commits (S23), executing fewer test cases, and reducing overall costs (S3, S15) through ML-based Integration Test Prediction or Branch Coverage Prediction.

Table 7: Comparison of the three SOTA methods in optimizing the regression testing. Note: TCP is Test Case Prioritization and TCS is Test Case Selection Tasks; RL is Reinforcement Learning, NN is Neural Networks, and MAB is Multi-Armed Bandit.

SID	Method Name	Strategy	Model Properties	Minimum Required Data	\CenterstackAdvantages
S7	RETECS	TCP & TCS	RL (NN-based agent)	60 test cycles	\Centerstack[l] $\bullet$ Language-Free
$\bullet$ Not required source code	\Centerstack[l] $\bullet$ Not appropriate for
large scale datasets
\cdashline1-7 S20	COLEMAN	TCP	RL (MAB-based agent)	100 test cycles	\Centerstack[l] $\bullet$ Language-Free
$\bullet$ Lightweight	\Centerstack[l] $\bullet$ Cold start
\cdashline1-7 S43	DeepOrder	TCP	DL	4 test cycles	\Centerstack[l] $\bullet$ Language-Free
$\bullet$ Efficient for large datasets
$\bullet$ Time-efficient	\Centerstack[l] $\bullet$ Non-interpratable

Regression Test (RT): Following integrated code validation and testing, the software product’s current version undergoes comprehensive testing based on previously designed test cases [42]. Regression tests encompass various types, such as structural and functional testing [43]. Given the time and resource intensity of regression testing [44], the majority of studies (31 out of 52) focus on this area. The main ML-based solutions explored by 26 studies are Test Case Prioritization (TCP) and Test Case Selection (TCS), contributing to Test Optimization. In TCP, ML-based solutions expedite the detection of software defects by ordering test cases based on the likelihood of revealing faults more quickly [45]. TCS focuses on selecting test cases that assess code changes while consuming fewer resources and less time [46]. Additionally, selected studies introduced other ML-based solutions, including Defect Prediction (S9, S13, S17, and S32) and Flaky Test Detection (S40).

Among the 26 papers primarily focused on Test Optimization, three stand out in terms of the accuracy of the trained ML models selected as State-Of-The-Art (SOTA) approaches: S7, S20, and S43. These papers introduced the Reinforced Test Case Prioritization and Selection (RETECS), Combinatorial VOlatiLE Multi-Armed BANdit (COLEMAN), and Deep learning for test case prioritization (DeepOrder) methods, respectively. According to the authors, both COLEMAN and DeepOrder outperformed the RETECS method, previously recognized as the latest SOTA approach for test optimization in regression testing. Notably, upon reviewing these COLEMAN and DeepOrder studies, it was observed that both employed the IOF/ROL dataset. In this dataset, DeepOrder yielded slightly better Normalized Average Percentage of Fault Detection (NAPFD) (see section 4.6) results than the COLEMAN method. However, it is essential to emphasize that we cannot conclusively assert that DeepOrder universally outperforms the COLEMAN method, as further examinations are required.

For a comprehensive overview of these three SOTA methods and their respective advantages, please refer to Table 7. Importantly, all these SOTA methods share a common characteristic: they do not require access to the source code for test case prioritization, making them language-independent. Their sole prerequisite is the availability of test case metadata, reducing training time and facilitating the use of larger datasets.

Regarding other tasks in the RT phase, only one study focuses on Flaky Test Detection, while four studies address Defect Prediction CI tasks. Our comparative analysis will exclusively concentrate on the body of published works concerning Defect Prediction tasks within the RT phase.

However, it is imperative to acknowledge that these studies have diverse objectives and methodological properties. The wide range of research goals and methodologies within this limited corpus prevents direct comparisons. Here, we present the objectives of these studies to afford a comprehensive understanding of their respective aims and contributions.

Performance Enhancement for Low Defect Percentage Datasets (S9): This study proposes methods with good performance on datasets characterized by a low defect percentage.

Impact of Feature Extraction Methods (S13): Investigations in this category assess the influence of feature extraction techniques, such as bag-of-words and word embedding, on ML method effectiveness.

Transfer Learning for New Project Implementation (S17): Studies in this category explore transfer learning techniques for using trained ML models in new projects to predict defects.

Continuous Defect Prediction Based on Code Change Features (S32): Research efforts in this context introduce ML methods specifically designed to continuously predict defects by leveraging code change features.

Table 8: Comparison of the three SOTA methods in build outcome validation. Note: DNN is Deep Neural Network, LSTM is Long Short Term Memory.

SID	Method Name	Strategy	Model Properties	Results	Advantages
S21	SmartBuildSkip	Predicting pass builds	Random Forest	\Centerstack[l] $\bullet$ Outperform SOTA (S24)
$\bullet$ Save 30% on building time	$\bullet$ Lightweight
\cdashline1-6 S25	BuildFast	Predicting fail builds	XGBoost	\Centerstack[l] $\bullet$ Outperform SOTA (S6 and S24)
$\bullet$ 47.5% improved F1 score of SOTA method	\Centerstack[l] $\bullet$ Chronological
$\bullet$ Lightweight
\cdashline1-6 S38	DL-CIBuild	Predicting build outcome	DNN (LSTM)	$\bullet$ Improved F1-score of best ML methods by 10%	\Centerstack[l] $\bullet$ Chronological
$\bullet$ No feature engineering required
$\bullet$ Works within and cross projects

Build Validation (BV): Following the preceding CI phases, developers generally have confidence in the functionalities and performance of developed software units. At this stage, the software version is ready for merging into the master branch, representing the final product for building based on the developed units within this branch. The Build Validation phase focuses on ensuring the stability of integrated codes before releasing the software for system testing [47]. Given the significant computational cost associated with building software products [3], ML-based solutions in this phase primarily target Build Prediction and reduce building efforts.

Among the 12 studies concentrating on the BV task, eleven predicted the build outcome using data from the same project (within-project). In contrast, study S5 used building information from various projects to train a cross-project model, allowing build prediction for projects lacking sufficient data [48]. Given the substantial focus on build validation, we reviewed published studies in this area and identified three SOTA methods in terms of the top three accurate ML models, namely S21, S25, and S38, each employing distinct strategies.

S21 introduced a method to predict passing builds and save time by skip** unnecessary building tests (SmartBuildSkip) using a Random Forest method. Their approach reduced the time spent on building tests by 30%, outperforming the reviewed ML method in predicting passing builds. It is worth noting that S21 considered a build as likely to fail if it had failed in the previous execution, excluding failed builds from evaluations.

In contrast, S25 focused on predicting failing builds, utilizing a decision-tree-based method (XGBoost) for their BuildFast approach. Results showed a remarkable 47.5% improvement in the F1-score for predicting failed builds compared to existing SOTA methods. An advantage of BuildFast over SmartBuildSkip is its consideration of the chronological order of input data during model training, enabling its use in real-world environments.

The latest SOTA method in build validation, S38, employed Long Short-Term Memory (LSTM), a deep learning-based approach, known as DL-CIBuild. DL-CIBuild, a chronological method, outperformed existing SOTA ML-based methods in F1-score by 10%. DL-CIBuild has advantages over SmartBuildSkip and BuildFast: it does not require feature engineering and can be applied to both intra-project and cross-project build validation tasks. Table 8 offers a concise overview of these three SOTA methods in build validation in CI.

System Test (ST): In the final step, System Test, the complete software system undergoes testing to ensure all aspects, including performance, functionality, and compatibility, are correct. ML-based solutions in this phase are limited, mainly focusing on Installed Software Discovery for compliance, security, and efficiency (S12) and generating test cases for Performance Test Optimization to detect defects in the system (S4).

Process Management (PM): This step involves communication among developers and the production of numerous documents, ML methods are applied for Activity Management. S10 classifies CI environment information for project management committee members, while S17 proposes an ML-based approach for automatically labeling reported issues, aiding software engineers in assigning the correct issues to developers. Notably, this phase primarily relies on natural language-based information, and the limited use of Large Language Models (LLM) is noteworthy.

In industrial projects, the sequence of Integration and Regression tests may be altered or combined based on project requirements and priorities, deviating from the sequence in Figure 5. Selected studies for each CI phase and task are listed in Table 6.

Summary: $\bullet$ Six phases and ten tasks are identified in CI pipelines: Unit Test, Integration Test, Regression Test, Build Validation, System Test, and Process Management. Detailed descriptions of their sequence and input/output are provided. $\bullet$ Out of the 52 selected studies, 31 focused on Regression Testing and 12 on Build Validation due to their high costs, emphasizing their critical roles in ensuring software quality. Other CI phases had nine studies collectively, highlighting the significance of RT and BV. $\bullet$ Key tasks include Test Optimization in Regression Testing and Build Prediction in Build Validation. SOTA methods for these tasks are RETECS, COLEMAN, DeepOrder, SmartBuildSkip, BuildFast, and DL-CIBuild, demonstrating advancements in CI.

4.3 RQ2: Data Sets and Data Engineering Methods

Input data selection, data engineering, and model training are crucial factors influencing ML model performance [49]. The nature and availability of input data vary across CI tasks, emphasizing the importance of understanding data sources and engineering methods.

To facilitate informed decisions, this section offers insights into data sources, types, and engineering techniques employed in selected studies. Analyzing this information aids researchers and practitioners in understanding diverse approaches in ML4CI and discovering potential research directions.

4.3.1 Data Sources

In the reviewed literature, 67 diverse data sources were utilized across the 52 selected ML-based CI studies. These sources differ in characteristics like lines of code (LOC), number of builds, and project specificity. For instance, study S3 used Google Dagger (848 LOC), contrasting with S2 employing a Samsung dataset with over 27 million LOC. Similarly, builds in different studies varied significantly, e.g., the Jazz project in S44 had 199 builds, while S5 used the TravisTorrent dataset with over 300,000 builds. These variations emphasize the need to comprehend data source characteristics in ML-based CI research.

Table 9: This table summarizes frequently used study data sources, their correlation with identified CI tasks in each CI phase, and study IDs. color intensity in the table shows numerical values, with lighter shades indicating lower and darker blues representing higher values. The bolded numbers represent the count of studies and their percentage in total. Acronyms: UT: Unit Test, IT: Integration Test, RT: Regression Test, BV: Build Validation, ST: System Test and PM: Process Management.

# Test Cases

# CI Cycles

# Test Outcomes

Fail Test Rate

\Centerstack

[l]Unit Test Prediction

\Centerstack

[l]Branch Coverage Prediction

\Centerstack

[l]Integration Test Prediction

\Centerstack

[l]Test Optimization

\Centerstack

[l]Defect Prediction

\Centerstack

[l]Flaky Test Detection

\Centerstack

[l]Build Prediction

\Centerstack

[l]Installed Software Discovery

\Centerstack

[l]Performance Test Optimization

\Centerstack

[l]Activity Management

Data Properties

CI Datasets

1941

320

32260

28.79%

IOF/ROL

–

\CenterstackS7, S11, S14, S19,

S20, S22, S28, S31,

S33, S35, S37, S48

(12, 23.1%)

–

\cdashline1-15 89

352

25594

19.36%

Paint Control

–

\CenterstackS7, S11, S14, S19,

S20, S22, S31, S33,

S35, S37, S48

(11, 21.2%)

–

\cdashline1-15 457

434

332650

0.02%

Apache Commons

–

\CenterstackS3, S15

(2, 3.8%)

–

\CenterstackS19, S20, S22, S33,

S35, S48 (6, 11.6%)

–

\CenterstackS10

(1, 1.9%)

\cdashline1-15 5555

336

1260618

0.25%

GSDTSR

–

\CenterstackS7, S14, S16, S22,

S31, S33, S35, S37,

S48 (9, 17.3%)

–

\cdashline1-15 568

1312

694395

0.03%

Google Guava

–

\CenterstackS3, S15

(2, 3.8%)

–

\CenterstackS20, S28, S48

(3, 5.8%)

–

\cdashline1-15 2010

3263

781273

0.62%

Rails

–

\CenterstackS20, S22, S33, S35,

S48 (5, 9.7%)

–

\cdashline1-15 N/A

N/A

TravisTorrent

–

\CenterstackS5, S6, S21,

S29 (4, 7.7%)

–

\cdashline1-15 303

988

815598

0.15%

MyBatis

–

\Centerstack S20, S33, S48

(3, 5.8%)

\Centerstack S32

(1, 1.9%)

–

\cdashline1-15 360

2257

663470

0.07%

Google Closure

–

\Centerstack S20, S33, S48

(3, 5.8%)

–

\cdashline1-15 46

638

14601

0.19%

Google Auto

–

\Centerstack S20, S33, S48

(3, 5.8%)

–

\cdashline1-15 106

3813

204161

1.82%

Dspace

–

\Centerstack S20, S33, S48

(3, 5.8%)

–

\cdashline1-15 N/A

N/A

Google Dagger

–

\CenterstackS3, S15

(2, 3.8%)

–

To highlight frequently used sources, 11 publicly available datasets were employed in more than one study, as indicated by the numbers in parentheses.

To underscore the applicability of the data sources, we have presented 11 datasets that have been utilized in multiple studies and are reported as publicly available at the time of the studies’ publication. The number in parentheses indicates the frequency with which each dataset has been utilized in these studies.

IOF/ROL (12), Paint-Control (11) from ABB Robotics company, Apache-Commons (9), GSDTSR (9), Google Guava (5), Rails (5), TravisTorrent CI projects (5), MyBatis (4), Google Closure (3), Google Auto (3), DSpace (3), and Google Dagger (2).

Table 9 provides a summary of the properties of these datasets, with particular emphasis on datasets related to testing data in CI. Additionally, the data properties of Travis Torrent and Google Dagger are presented as N/A due to their unavailability at the time of writing this paper.

To assist researchers in selecting suitable datasets based on their defined research problems, this section presents key statistics and summaries of selected datasets. For instance, ABB company, a prominent industrial robot supplier, provides robot software and equipment, with datasets such as Paint Control (PC) and IOF/ROL containing historical information on test results and over 300 CI cycles. CI cycles encompass all development tasks, including coding, building, and testing, occurring continuously to ensure code integration and quality throughout the software development lifecycle before preparing a software product for deployment [50]. PC includes 89 test cases, 352 CI cycles, 25,594 verdicts, and a 19.36% failure rate, while IOF/ROL consists of 1941 test cases, 320 CI cycles, 32,260 verdicts, and a 28.79% failure rate. The GSDTSR dataset, an open resource from Google, has 336 CI cycles, with a notably low failure rate of around 0.25%. Considering these datasets alongside the previously discussed sources enables researchers to make informed choices based on their research objectives and requirements, facilitating evaluations in diverse dataset scenarios.

Table 9 highlights notable characteristics of datasets. The IOF/ROL and PC datasets stand out for their higher failure rates, a rarity in CI, making them popular among researchers. The Apache datasets gain popularity for their good structure and quality [51]. Additionally, the GSDTSR, Google Guava, Rails, MyBatis, and Google Closure datasets find frequent use due to their substantial test outcome volume, as indicated in Table 9.

Table 10: Definition of the identified data types in the selected studies. Acronyms: MD: Meta Data.

\CenterstackData Type	\CenterstackDescription	\CenterstackML-based Example
Source Code	Actual code written by developers that forms the basis of a software application. Usually, it requires preprocessing to be understandable for ML methods.	Detecting text similarity or source-code coverage by tests to predict the outcome of a test through tokenizing it or making the Abstract Syntax Trees (AST).
\cdashline1-3 Code MD	This data type extracts more information than the Source Code data type by considering specific characteristics of the source code.	Predicting test outcomes by calculating the changes and complexity of the code (e.g. number of changed LOC, or the depth changed module in the inheritance tree)
\cdashline1-3 Test Case	This data type covers the textual information of test cases, including names and test codes.	Prioritizing the execution of tests based on analyzing the test codes.
\cdashline1-3 Test MD	This data type is related to the result of test executions and is gathered by analyzing test logs.	Extracting metadata information such as the time and duration of the test, result, and code coverage of each test case, and selecting a portion of them for execution.
\cdashline1-3 Commit MD	This data type refers to all development changes to the system under test (SUT) except the text of the submitted codes.	Develo** a model by using the number of commits, commit time, the branch of code, and the committer’s experience as input data.
\cdashline1-3 Build Logs	These logs present historical information about the outcome of previous builds.	Training an ML model for validating future builds based on analyzing the text of build configurations and their outcomes.
\cdashline1-3 Project MD	This data type depicts a holistic view of the project and is usually used in combination with other data types for training ML models.	The size of the development team and the age of the project are examples of this data type.
\cdashline1-3 Texts	This data type includes all texts except source codes, logs, and test codes and requires text processing techniques	Documents, user stories, reported issues, and commit messages are some examples of this data source.
\cdashline1-3 System Logs	System logs are files that show the behavior of the system in different situations and the impact of taking actions in the system.	Discovering the installed software on cloud systems by analyzing the tree of files and paths

TravisTorrent and Apache datasets are extensively employed in CI research. TravisTorrent, spanning 1,359 projects (402 Java, 898 Ruby, 59 in other languages), encompasses 2,640,825 builds. The Apache dataset comprises diverse projects like Cassandra, Ivy, Lang, Drill, and Math, featuring CI cycles ranging from 55 to 438. Researchers favor these datasets for their versatility, enabling evaluations across a broad spectrum of scenarios.

Table 9 illustrates the correlation between frequently used data sources and identified CI tasks in selected studies. Apache datasets prove versatile and employed across various CI phases due to project diversity. Conversely, TravisTorrent is notably applied in Build Prediction, and Google Dagger in predicting branch coverage during the Integration Test phase.

4.3.2 Data Types

This section categorizes and explains data types within CI pipelines. Nine categories, including Source Code, Code Meta-Data (MD), Test Case, Test MD, Commit MD, Build Logs, Project MD, Textual Data or Texts, and System Logs, are identified. Table 10 provides a detailed overview of these data types, offering valuable insights for research purposes. Integrating innovative data types could unlock further exploration opportunities.

Here we aim to elucidate the correlation between data types and CI tasks, providing valuable insights into how different data types can enhance the efficiency of each task. This understanding informs the development and deployment of CI strategies, guiding the selection of data types for future research in automating CI tasks.

Table 11: Relation between data types and CI tasks. color intensity in the table shows numerical values, with lighter shades indicating lower and darker blues representing higher values. The bolded numbers represent the count of studies and their percentage in total. Acronyms: MD: Meta Data, UT: Unit Test, IT: Integration Test, RT: Regression Test, BV: Build Validation, ST: System Test and PM: Process Management.

CI Phases	CI Tasks	Source Code	Code MD	Test Case	Test MD	Commit MD	Build Logs	Project MD	Texts	System Logs
UT	\Centerstack[l]Unit Test
Prediction	\CenterstackS2
(1, 1.9%)	–	–	–	–	–	–	–	–
\cdashline1-11 IT	\Centerstack[l]Branch Coverage
Prediction	\CenterstackS15
(1, 1.9%)	\CenterstackS3, S15
(2, 3.8%)	–	–	–	–	–	–	–
\cdashline2-11	\Centerstack[l]Integration Test
Prediction	\CenterstackS18
(1, 1.9%)	\CenterstackS18
(1, 1.9%)	–	\CenterstackS18
(1, 1.9%)	\CenterstackS23
(1, 1.9%)	–	–	\CenterstackS23
(1, 1.9%)	–
\cdashline1-11 RT	\Centerstack[l]Test
Optimization	\CenterstackS8, S39, S47
(3, 5.8%)	\CenterstackS19, S36, S52
(3, 5.8%)	\CenterstackS36, S39
(2, 3.8%)	\CenterstackS7, S8, S11, S16,
S19, S22, S26,
S28, S30, S31,
S33, S35, S37,
S39, S41, S45,
S46, S48, S50,
S52 (20, 38.5%)	\CenterstackS28
(1, 1.9%)	\CenterstackS36
(1, 1.9%)	\CenterstackS30, S31
(2, 3.8%)	\CenterstackS45
(1, 1.9%)	–
\cdashline2-11	\Centerstack[l]Defect
Prediction	\CenterstackS13
(1, 1.9%)	\CenterstackS9, S17, S32
(3, 5.8%)	–	\CenterstackS13, S14, S32
(3, 5.8%)	\CenterstackS9
(1, 1.9%)	–	–	–	–
\cdashline2-11	\Centerstack[l]Flaky Test
Detection	–	–	\CenterstackS40
(1, 1.9%)	\CenterstackS40
(1, 1.9%)	–	–	–	–	–
\cdashline1-11 BV	\Centerstack[l]Build
Prediction	\CenterstackS5
(1, 1.9%)	\CenterstackS6, S21, S24,
S25, S38, S42,
S44, S51
(8, 15.4%)	–	\CenterstackS1, S5,
S25, S38
(4, 7,7%)	\CenterstackS1, S5, S6, S21,
S24, S25, S43,
S51 (8, 15.4%)	\CenterstackS24, S25,
S27, S43
(4, 7.7%)	\CenterstackS21
(1, 1.9%)	–	–
\cdashline1-11 ST	\Centerstack[l]Installed Software
Discovery	–	–	–	–	–	–	–	–	\CenterstackS12
(1, 1.9%)
\cdashline2-11	\Centerstack[l]Performance Test
Optimization	–	–	–	\CenterstackS4
(1, 1.9%)	–	–	–	–	–
\cdashline1-11 PM	\Centerstack[l]Activity
Management	–	–	–	–	\CenterstackS10
(1, 1.9%)	–	–	\CenterstackS10, S49
(2, 3.8%)	–
Total usage in studies		8, 15.4%	17, 32.7%	3, 5.8%	30, 57.7%	12, 23.1%	5, 9.6%	3, 5.8%	4, 7.7%	1, 1.9%

Table 11 highlights the prevalent use of source code metadata over raw source code data in the selected studies. This tendency is likely attributed to the growing availability of code analysis tools like the CK tool⁴⁴4https://github.com/mauricioaniche/ck, which automates the computation of established metrics such as Chidamber and Kemerer (CK). These tools have significantly streamlined the extraction of source code metadata for researchers. Notably, S3, S9, and S15 explicitly mentioned utilizing these tools for extracting CK metrics.

S15, S40, and S44 utilized Halstead metrics with available tools to delve into source code attributes, enriching the development of ML-based CI solutions. Extracting source code metadata from source code is cost-effective and straightforward, involving metrics like comments, methods, lines of code, and parameters. These metrics are easily obtainable and do not demand extensive computational resources. Moreover, these data types are well-suited for ML models as they can be readily converted into numerical values.

Table 11 reveals the prevalent use of source code metadata in studies related to Build Prediction, Test Optimization, and Defect Prediction in Regression Testing, as well as in Integration Testing within the CI pipeline. However, it is notably absent in the context of Unit Test Prediction, signifying a research gap in this area.

Table 11 underscores the limited utilization of data types such as test case codes, build logs, project metadata, textual data, and system logs in automating CI tasks. This underscores the potential benefits of exploring combinations of various data types for each CI task. For example, combining system logs with code metadata could enhance the accuracy of the Build Prediction task in CI. Furthermore, it is noteworthy that test metadata and commit metadata are predominantly employed in the Build Prediction task, indicating areas for further research in ML-based CI approaches.

4.3.3 Data Preparation

In this section of our literature we mainly focus on data-related techniques used to modify and prepare raw data for ML models. Data cleaning, a fundamental aspect, was not explicitly examined separately due to limited information in the studies. However, data cleaning can be considered a filtering technique within the data preparation process.

Table 12 shows that the selected studies utilized ten distinct data preparation techniques. The choice of these techniques and the data’s characteristics depends on the overall strategy for addressing the research problem. The table summarizes the employed techniques and their subgroups. The next section provides a comprehensive description and examples for each data preparation group and subgroup.

Table 12: Correlation between the most commonly used datasets and data preparation methods. color intensity in the table shows numerical values, with lighter shades indicating lower and darker blues representing higher values. The bolded numbers represent the count of studies and their percentage in total.

Dataset Names	Conditioning			Building		Balancing			Filtering
Dataset Names	\CenterstackManual Data
Division	Clustering	\CenterstackObjective Data
Selection	\CenterstackData
Augmentation	\CenterstackData
Manipulation	Re-sampling	Oversampling	Undersampling	\CenterstackData
Pruning	\CenterstackSelective Data
Filtering
IOF/ROL	–	–	–	\CenterstackS35
(1, 1.9%)	–	\CenterstackS35
(1, 1.9%)	\CenterstackS28, S37
(2, 3.8%)	\CenterstackS37
(1, 1.9%)	\CenterstackS11, S14
(2, 3.8%)	\CenterstackS20
(1, 1.9%)
Paint Control	–	–	–	\CenterstackS35
(1, 1.9%)	–	\CenterstackS35
(1, 1.9%)	\CenterstackS37
(1, 1.9%)	\CenterstackS37
(1, 1.9%)	\CenterstackS11, S14
(2, 3.8%)	\CenterstackS20
(1, 1.9%)
Apache Commons	–	–	–	\CenterstackS15, S35
(2, 3.8%)	\CenterstackS10
(1, 1.9%)	\CenterstackS35
(1, 1.9%)	–	–	–	\CenterstackS20
(1, 1.9%)
GSDTSR	–	–	\CenterstackS16
(1, 1.9%)	\CenterstackS35
(1, 1.9%)	–	\CenterstackS35
(1, 1.9%)	\CenterstackS37
(1, 1.9%)	\CenterstackS37
(1, 1.9%)	\CenterstackS14
(1, 1.9%)	–
Google Guava	–	–	–	\CenterstackS15
(1, 1.9%)	–	–	\CenterstackS28
(1, 1.9%)	–	–	\CenterstackS20
(1, 1.9%)
Rails	–	–	–	\CenterstackS35
(1, 1.9%)	–	\CenterstackS35
(1, 1.9%)	–	–	–	\CenterstackS48
(1, 1.9%)
TravisTorrent	\CenterstackS5, S21
(2, 3.8%)	–	–	\CenterstackS29
(1, 1.9%)	–	–	–	–	–	\CenterstackS6
(1, 1.9%)
MyBatis	–	–	–	–	–	–	–	–	\CenterstackS32
(1, 1.9%)	\CenterstackS20
(1, 1.9%)
Google Closure	–	–	–	–	–	–	–	–	–	\CenterstackS20
(1, 1.9%)
Google Auto	–	–	–	–	–	–	–	–	–	\CenterstackS20
(1, 1.9%)
Dspace	–	–	–	–	–	–	–	–	–	\CenterstackS20
(1, 1.9%)
Google Dagger	–	–	–	\CenterstackS15
(1, 1.9%)	–	–	–	–	–	–
Other	\CenterstackS42
(1, 1.9%)	\CenterstackS1, S24, S45
(3, 5.8%)	\CenterstackS2, S40
(2, 3.8%)	\CenterstackS8, S12, S13,
S26, S36, S49,
S43, S46
(8, 15.4%)	\CenterstackS23
(1, 1.9%)	–	\CenterstackS38, S44
(2, 3.8%)	\CenterstackS30
(1, 1.9%)	\CenterstackS9, S34,
S51, S49
(4, 7.7%)	\CenterstackS24, S25
(2, 3.8%)
Num of studies	3, 5.8%	3, 5.8%	3, 5.8%	11, 21.2%	2, 3.8%	1, 1.9%	4, 7.7%	2, 3.8%	7, 13.5%	5, 9.6%

DP1) Conditioning: This technique adjusts the dataset according to its characteristics and the proposed solution. In the case of manual data division, as demonstrated in study S5, the TravisTorrent dataset was partitioned using strategies such as the Burak filter [48]and the Bellwether filter [48]. In another instance (S21), data was segregated based on previous build results (pass or fail), resulting in separate training of machine learning models for each data partition.

Moreover, clustering methods like k-means can divide data based on feature values or data distribution, as in studies S24 and S45. Objective data selection involves choosing specific data points or segments based on defined objectives or criteria of our solution, such as selecting lines before and after the changed codes (changed codes are the objective of our solution) for inclusion in the dataset.

Utilizing conditioning for data preparation presents a clear advantage in reducing the computational overhead during ML model training, especially in CI environments with significant data volumes [42].

DP2) Building: This methodology transforms data to enhance its structure for better compatibility with ML techniques, especially in diverse CI environments. It consists of two sub-groups.

The first sub-group, data augmentation, is the most commonly used method. Researchers employ it to modify data structures or simulate real-world conditions. For instance, in S13, input data is padded with zeros for consistency and easier processing. Techniques like creating graphs or trees based on regression test results (S29, S36, S46) or simulating cloud-based real environments (S12) fall into this sub-group.

The second sub-group, data manipulation, involves methods such as text processing (S23) and merging distinct parts of a dataset by identifying correlations, such as combining issue tracking and VCS (S10).

DP3) Balancing: Many commonly used data sources display imbalances, which can substantially impact the performance and accuracy of classifiers [52]. Addressing this issue involves employing re-sampling techniques or training separate models for each class of input data. Re-sampling encompasses both oversampling, which involves increasing instances in the minority class (e.g., SMOTE), and under-sampling, as demonstrated in S30, which reduces the number of instances in the majority class. Additionally, combining both methods, as exemplified in SMOGN in S37 [53], presents another viable approach.

DP4) Filtering: Data pruning, a form of data cleaning, is vital in CI, addressing high computational workloads in large-scale datasets and improving the performance of the ML models [54]. Seven studies used this method, e.g., S11 removed unexecuted tests, S51 eliminated errored builds, and S9 removed incomplete test data. Issues like these arise from evolving software practices [55] and human errors [54]. Filtering is crucial in build validation studies, tackling interrupted builds [46]. Selective data filtering can enhance the quality of the training model and reduce training costs by prioritizing data segments that closely resemble the practical environment where the trained ML model will be deployed. For example, in study S24, TravisTorrent was constrained to Java-based projects employing Ant, Maven, and Gradle CI build tools, while Ruby-based projects were excluded.

Summary: $\bullet$ 12 datasets, nine data types, and ten processing techniques are identified in the reviewed studies. $\bullet$ Clustering datasets can reduce computation overhead and enhance ML model accuracy due to the high volume of CI data. $\bullet$ Addressing class imbalance is crucial to improve ML model accuracy in CI data. $\bullet$ Filtering, addressing exceptional cases like cancelled or errored processes, enhances ML model performance and reduces computational overhead.

4.4 RQ3: Extracted Features

This section delves into the feature types and feature engineering techniques employed in the reviewed studies. Given that ML models are data-driven, the choice of feature types and engineering methods directly impacts the performance of these models [49]. Through thematic analysis, we have categorized Features (FT) into four primary groups and nine sub-categories, presented as follows:

Relational: Individual and Spatial, Statistical: Numerical, Components and Context, Lexical: Content and Syntactical, and Epochal: Temporal and Narrative.

In the following, the details regarding these features are presented. A brief explanation and examples for these features and the relationship between feature types and CI tasks are presented in Table 13 and 14 respectively. Note that five studies lacked comprehensive feature descriptions. Additionally, feature combinations that were not employed in any of the selected studies have been excluded from Table 14. These omitted combinations encompass the use of Relational features, the utilization of both Relational and Statistical features, the combination of both Relational and Lexical features, as well as the usage of Statistical and Lexical features.

Table 13: Employed features in studies and description of them

Feature types	Sub groups	Description	Example
Relational	Individual	These features focus on the behaviour and experiences of individuals.	(S1) Percentage of ownership of a developer on a file
\cdashline2-4	Spatial	These features encompass the communication pathways and interdependencies among distinct components within the space of software system.	(S3) Source code coverage of a testing code
Statistical	Components	These features encapsulate statistical data pertaining to various aspects of software components, offering insights into their quantitative characteristics.	(S3) Depth of inheritance tree (DIT)
\cdashline2-4	Context	These features represent the statistical information within a code including the minimum, maximum, and average of operations or the complexity of components.	(S44) Number of operands and operators in the committed code
\cdashline2-4	Numerical	These features are based on calculating straightforward metrics not included in the Components and Context feature types, and they are independent of content.	(S26) Number of commits on a file
Lexical	Content	These features denote the information about the terms and tags in texts including source code, log files, and other text files and reports.	(S8) Determining text similarities in Java codes via TF-IDF
\cdashline2-4	Syntactical	These features only focus on information about the specific programming language reserved words.	(S15) Number of each Java reserved word in a source code
Epochal	Temporal	These features represent the time-dependent attributes of any software components from last changes until the present time.	(S21) The time gap since the last build
\cdashline2-4	Narrative	These features represent the historical actions taken and their outcomes leading up to a particular event within the software product.	(S36) The durations of the previous executions of a test case

FT1) Relational: This category encompasses features that depict relationships among elements, such as actions, individuals, and components, influencing the outcomes of CI tasks. It consists of two sub-categories:

Individual: These features focus on individual-related factors, such as developers and test designers, including their experience in software development and file ownership percentages [37]. Although pivotal in predicting build outcomes (S43) and testing results of code commits (S23), only four studies employed these features.

Spatial: Spatial features relate to code concepts like coverage, coupling, inheritance, and cohesion, extractable using tools like IntelliJ Idea [56], Rational Software Analyzer (RSA) [47], and Aniche [57] with low computational overhead. Notably, three studies utilized the Chidamber and Kemerer (CK) indices [58]. This feature type found diverse applications, including predicting branch coverage (S3, S15), identifying code defects (S9), detecting flaky tests (S40), predicting build outcomes (S44), and optimizing test cases (S36, S39, S45, S46).

FT2) Statistical:

These features, are computed and analyzed for statistical data. In a broader context, they serve to show the complexity and scope of projects or committed changes. Table 14 highlights the usage of these features in the Build Prediction task, primarily due to their lower computational overhead, especially in the case of the Numerical feature type. Notably, build prediction tasks due to its full coverage of all tests involve huge data from numerous CI cycles. So, making minimizing computational resources is a serious concern in this task.

Components: Statistical insights into software product composition, such as the number of concrete and abstract classes, functions, and software package dependencies, are provided by Components features. They are useful in predicting branch coverage (S3, S15), defect prediction (S32), build outcomes (S42), flaky test detection (S40), and test optimization based on mutation testing (S29).

Context: This category yields statistical properties offering information about the complexity of software and testing source code. It computes attributes from the source code, including Osmax and Osavg (maximum and average operation size), WMC (weighted method complexity), NOAC (number of operations added), Ocmax and Ocavg (maximum and average operation complexity), Opavg (average operation parameters), CSO (class size operations), CSOA (class size operations attributes), CSA (class size attributes), Query (number of queries), NAAC (number of attributes added), NOIC (number of operations inherited), NOOC (number of operations overridden). It also incorporates the well-known Halstead metrics designed to show program complexity by examining its operators and operands (N, E, V, D, B, n) [59]. These features are prevalently used in predicting build outcomes (S27, S42, S43, S44), test branch coverage (S15), and detecting flaky tests (S40).

Numerical: Known for computational simplicity, Numerical features are utilized in 23 out of 52 studies for tasks like test optimization (S22, S26, S29, S30, S36, S39), build prediction (S1, S5, S6, S21, S24, S25, S42, S43, S44), test outcome prediction (S23), code defect prediction (S9, S17, S32), CI data management (S10), and flaky test detection (S40). These features encompass various attributes like counts of code line changes, files, sub-systems, classes, methods, active authors on the same file, commits, pull requests in VCS, and comments in source codes and VCS.

Table 14: Correlation between CI tasks and feature types. color intensity in the table shows numerical values, with lighter shades indicating lower and darker blues representing higher values. The bolded numbers represent the count of studies and their percentage in total. The \faCircle symbol represents which feature types have been used, while the \faCircleO symbol represents which feature types are not used. Acronyms: UT: Unit Test, IT: Integration Test, RT: Regression Test, BV: Build Validation, ST: System Test and PM: Process Management.

Feature Types

\Centerstack

[l]Relational

\Centerstack

[l]Statistical

\Centerstack

[l]Lexical

\Centerstack

[l]Epochal

\Centerstack

[l]Unit Test Prediction

\Centerstack

[l]Branch Coverage Prediction

\Centerstack

[l]Integration Test Prediction

Test Optimization

Defect Prediction

\Centerstack

[l]Flaky Test Detection

Build Prediction

\Centerstack

[l]Installed Software Discovery

\Centerstack

[l]Performance Test Optimization

\Centerstack

[l]Activity Management

\Centerstack

[l]Number of studies

\faCircleO

\faCircle

\faCircleO

–

\CenterstackS29 (1, 1.9%)

–

\CenterstackS5, S25, S42,

S44 (4, 7.7%)

–

\CenterstackS10

(1, 1.9%)

\Centerstack6

11.5%

\cdashline1-15 \faCircleO

\faCircleO

\faCircle

\faCircleO

\CenterstackS2

(1, 1.9%)

–

\CenterstackS18

(1, 1.9%)

\CenterstackS47

(1, 1.9%)

\CenterstackS13

(1, 1.9%)

–

\CenterstackS12

(1, 1.9%)

–

\CenterstackS49

(1, 1.9%)

\Centerstack6

11.5%

\cdashline1-15 \faCircleO

\faCircleO

\faCircle

–

\CenterstackS7, S11, S14, S16,

S19, S28, S31, S33,

S35, S37, S48, S50

(12, 23.1%)

–

\CenterstackS4

(1, 1.9%)

–

\Centerstack13

25%

\cdashline1-15 \faCircle

\faCircleO

\faCircle

–

\CenterstackS45, S46

(1, 3.8%)

\CenterstackS17, S32

(1, 3.8%)

–

\Centerstack4

7.7%

\cdashline1-15 \faCircleO

\faCircle

\faCircleO

\faCircle

–

\CenterstackS22, S52

(1, 3.8%)

–

\CenterstackS6, S21, S24,

S27 (4, 7.7%)

–

\Centerstack6

11.5%

\cdashline1-15 \faCircleO

\faCircleO

\faCircle

–

\CenterstackS8

(1, 1.9%)

–

\Centerstack1

1.9%

\cdashline1-15 \faCircle

\faCircle

\faCircleO

–

\CenterstackS15

(1, 1.9%)

–

\Centerstack1

1.9%

\cdashline1-15 \faCircle

\faCircle

\faCircleO

\faCircle

–

\CenterstackS36, S39

(1, 3.8%)

\CenterstackS9

(1, 1.9%)

\CenterstackS40

(1, 1.9%)

–

\Centerstack4

7.7%

\cdashline1-15 \faCircle

\faCircleO

\faCircle

–

\CenterstackS1, S43

(1, 3.8%)

–

\Centerstack2

3.8%

\cdashline1-15 \faCircleO

\faCircle

–

\CenterstackS26

(1, 1.9%)

–

\Centerstack1

1.9%

\cdashline1-15 \faCircle

\faCircle

–

\CenterstackS3

(1, 1.9%)

\CenterstackS23

(1, 1.9%)

\CenterstackS30

(1, 1.9%)

–

\Centerstack3

5.8%

FT3) Lexical: Software project source code and log files often contain valuable hidden information, extractable through lexical analysis, which involves two distinct approaches:

Content: Content-based feature extraction methods are employed in 10 out of 52 studies. Their popularity stems from direct data extraction without computationally intensive processes. This feature is versatile as it is related to text analysis, making it applicable to all programming languages through various textual analysis techniques. These techniques include TF-IDF (Term Frequency–Inverse Document Frequency) (S8, S23, S49), lexical tokenization (S2, S12, S26, S30), Bag-of-Words (S13, S47), and Word Embedding (S13).

Syntactical: These features are used in two CI studies, specifically for estimating branch coverage in automated testing (S3, S15). Analysis in this category is based on a predefined list of words, often reserved keywords, making them suitable for extracting semantic information from data.

FT4) Epochal: Epochal features, tied to temporal and narrative information, dominate with 35 out of 52 CI studies incorporating them. This prevalence is particularly notable in studies emphasizing test case optimization, underlining the significance of analyzing prior execution history for optimizing test case execution.

Temporal: While this feature cannot be used solely for decision-making and model training, temporal features critically reveal hidden temporal patterns in data and usually must be used with other features. Encompassing work habits (time, day, and month of actions) (S21, S32), time intervals between the current and the previous event (S21, S43), and the duration of the last event (S37), these features concentrate on events immediately preceding an occurrence. This differs from the broader historical context considered by “Narrative” features. They have also contributed to research on order-dependent and non-order-dependent flaky tests (S40).

Narrative: Narrative features, present in 32 out of 52 studies, are extensively utilized due to their ability to encapsulate valuable insights from past experiences and lessons learned in ML-based CI enhancements [60]. For example, in test case prioritization, the likelihood of test cases failing in the future is higher if they have previously failed, justifying their elevated priority in subsequent tests [61]. Narrative features include attributes like changes in files or text, outcomes of prior test executions, the ratio of failed-to-pass tests, and test execution durations. Notably, features like the Failure Distance attribute (FD - the number of builds since the last failed build) are straightforward to calculate which is used in S21.

Table 14 shows that relational features are often combined with other feature types, with only three studies employing all four feature types in their analysis. Additionally, lexical features find extensive application across various tasks.

4.4.1 Feature Engineering Techniques

Feature engineering is a crucial step in ML model development [26]. Its significance lies in two key factors:

Feature engineering techniques significantly impact the accuracy of trained ML models [62]. These techniques transform raw data into a format interpretable and meaningful for ML models [26]. In our analysis of the reviewed papers, we identified five groups of feature engineering (FE) techniques. However, it is worth noting that among the 52 studies in this systematic literature review, 18 did not provide details on their feature engineering techniques.

Table 15: Relation between the most frequently used data sets and feature engineering techniques. Note: The MyBatis, Google Closure, Google Auto and Dspace datasets did not report any feature engineering techniques. So, we did not present them in this table. The bolded numbers represent the count of studies and their percentage in total.

Datasets	\CenterstackImputation
and
Elimination	\CenterstackFeature
Enhancement
and Labeling	\CenterstackOutliers
Handling	\CenterstackFeature
Scaling	\CenterstackFeature
Tagging and
Encoding
IOF/ROL	\CenterstackS11, S37
(2, 3.8%)	–	–	\CenterstackS19, S22, S37
(3, 5.8%)	\CenterstackS19
(1, 1.9%)
Paint Control	\CenterstackS11, S37
(2, 3.8%)	–	–	\CenterstackS19, S22, S37
(3, 5.8%)	\CenterstackS19
(1, 1.9%)
Apache Commons	\CenterstackS10, S15
(2, 3.8%)	\CenterstackS10
(1, 1.9%)	–	\CenterstackS3, S15, S19, S22
(4, 7.7%)	\CenterstackS19
(1, 1.9%)
GSDTSR	\CenterstackS37
(1, 1.9%)	–	–	\CenterstackS22, S37
(2, 3.8%)	–
Google Guava	\CenterstackS15
(1, 1.9%)	–	–	\CenterstackS3, S15
(2, 3.8%)	–
Rails	–	–	–	\CenterstackS22
(1, 1.9%)	–
TravisTorrent	\CenterstackS5
(1, 1.9%)	\CenterstackS24
(1, 1.9%)	\CenterstackS21
(1, 1.9%)	\CenterstackS24
(1, 1.9%)	–
Google Dagger	\CenterstackS15
(1, 1.9%)	–	–	\CenterstackS3, S15
(2, 3.8%)	–

FE1) Imputation and Elimination: This technique addresses missing values in a dataset by replacing them with calculated or similar values, preventing data loss due to missing features [63]. In CI, it is particularly useful for handling exceptions. For example, S9 calculated the LOC difference when the “Number of Modified Lines” was unavailable. It can also derive values by considering interdependencies among data rows, such as finding the start and end times of test cases based on the log of previous execution times. Moreover, to reduce features and lower ML model training costs, elimination methods like the chi-square method can be used to remove closely related features. For instance, authors retained one of two similar features: the number of modified lines or added lines per data entry [53]. Given the large CI data volume, dimensional reduction techniques, as seen in S13, can further reduce computational overhead and input features in ML methods.

FE2) Feature Enhancement and Labeling: In CI environments, the presence of diverse data types and their high volume poses a challenge for data analysis. To tackle this challenge, various strategies are employed to enhance data comprehensibility for training ML models and improving their performance. In S10, a new labeling scheme categorized software management profile activities into H(high), M(edium), and L(ow). S46 categorized test cases as entirely or partially redundant based on coverage analysis, augmenting data with additional features. This methodology allows for enriching datasets with contextual information; for instance, S37 labeled commit dates to indicate holidays or regular days. Additionally, S16 introduced a new feature representing the percentage of detected failures relative to the total failures for each test case.

FE3) Outliers Handling: In CI environments, unexpected data may arise due to errors or non-repeatable situations, often attributed to human errors or specified limitations. For instance, test case durations may be constrained, and exceeding this time limit results in exceptional values, known as outliers. Outliers can be detected by defining thresholds or cutoff parameters, as demonstrated in S13 and S18.

FE4) Feature Scaling: Feature values exhibit diverse distributions and ranges, necessitating normalization through statistical methods. This technique rescales data to fit within a specific range (commonly [0, 1]) or standardizes values to the same magnitude using methods like log transformation, which mitigates the impact of extremely high or low values. For instance, S3 applied log transformation in conjunction with z-score, although they did not specify which feature required this transformation. Feature scaling is widely employed, with 20 out of 34 studies explaining their feature engineering techniques using this method, primarily due to the diversity of feature values and the prevalence of numerical features in CI environments.

FE5) Feature Tagging and Encoding: CI datasets often involve nominal values, such as test results like “pass”, “fail”, or “canceled”. To facilitate the training of ML models, input data must be made comprehensible, as these models rely on mathematical formulas [27]. One approach is to assign 0 and 1 values to class labels or represent feature values with binary strings or tags based on predefined rules [62, 64].

While encoding applies to various data types, it is particularly common in text-based datasets, involving the assignment of tokens or tags to different text segments. Effective tokenization requires a thorough understanding of the data and the selection of appropriate tokens for each distinct group or category. For example, in S13, authors defined specific tokens (tags) for various groups in source code, such as variables, operands, data types, spaces, and more.

In Table 15, we illustrate the relationship between feature engineering methods and the most commonly used datasets in the literature. It is noteworthy that certain datasets, in comparison with those presented in Table 9, are omitted here because the referenced studies did not report their feature engineering methods.

Summary: $\bullet$ Narrative, Numerical, Content, and Temporal feature types are used more frequently in the selected studies in comparison with other feature types. $\bullet$ Tools like IntelliJ Idea, RSA, and Aniche can extract features from code with low computational overhead. $\bullet$ New features can be defined through more complex computations and raw code analysis such as tagging or defining tree-based structures. $\bullet$ Feature engineering techniques like scaling, outlier handling, and elimination are still useful in CI environments. They cut computational load and boost ML model accuracy and performance.

4.5 RQ4: Model Training and Tuning

The performance and training time of ML methods, as well as their interactions with input data, are influenced by their inherent characteristics and predefined parameters [65]. In this section, our emphasis is on improving ML model training in the context of CI and refining algorithms through hyperparameter tuning.

4.5.1 Types of Learning Algorithms

Various ML algorithms, including supervised, unsupervised, semi-supervised, and reinforcement learning, play vital roles in ML-based approaches. Brief introductions to these methods are presented in the following.

Supervised Learning: This approach involves training a model on labelled data [17], requiring human effort or pre-existing data labelling before model training [9]. Supervised learning comprises two primary classes: classification, where the model categorizes outputs into fixed or discrete classes, and regression, where the model predicts continuous values [66].

Table 16: Map** ML Algorithms to CI Tasks. The table color intensity represents study counts, with lighter shades indicating fewer studies and darker blues representing higher counts. The bolded numbers represent the count of studies and their percentage in total. Acronyms: NN: Neural Network, SVM: Support Vector Machine, DT: Decision Tree, RL: Reinforcement Learning, KM: K-Means, KNN: K-Nearest Neighbors, LR: Linear Regression, NB: Naive Bayes, TL: Transfer learning, UT: Unit Test, IT: Integration Test, RT: Regression Test, BV: Build Validation, ST: System Test and PM: Process Management.

\Centerstack Algorithms	\Centerstack [l]Unit Test Prediction	\Centerstack [l]Branch Coverage Prediction	\Centerstack [l]Integration Test Prediction	Test Optimization	Defect Prediction	\Centerstack [l]Flaky Test Detection	Build Prediction	\Centerstack [l]Installed Software Discovery	\Centerstack [l]Performance Test Optimization	\Centerstack [l]Activity Management	\Centerstack [l]Total number of studies
	UT	IT		RT			BV	ST		PM
NN	\CenterstackS2
(1, 1.9%)	\CenterstackS3, S15
(2, 3.8%)	\CenterstackS18
(1, 1.9%)	\CenterstackS7, S11, S26, S37,
S39, S41 (6, 11.5%)	\CenterstackS13, S32
(2, 3.8%)	–	\CenterstackS38
(1, 1.9%)	–	\CenterstackS4
(1, 1.9%)	–	\Centerstack14
26.9%
Huber	–	\CenterstackS3
(1, 1.9%)	–	–	–	–	–	–	–	–	\Centerstack1
1.9%
SVM	–	\CenterstackS3, S15
(2, 3.8%)	–	\CenterstackS8, S26, S36
(3, 5.8%)	\CenterstackS32
(1, 1.9%)	–	–	–	–	\CenterstackS49
(1, 1.9%)	\Centerstack7
13.5%
DT	–	\CenterstackS15
(1, 1.9%)	\CenterstackS18, S23
(2, 3.8%)	\CenterstackS16, S26, S29, S30,
S36, S46, S47
(7, 13.5%)	\CenterstackS9, S13, S32
(3, 5.7%)	\CenterstackS40
(1, 1.9%)	\CenterstackS1, S5, S21, S24,
S25, S27, S42,
S43, S44, S51
(10, 19.2%)	–	–	\CenterstackS49
(1, 1.9%)	\Centerstack25
48.1%
RL	–	–	–	\CenterstackS7, S14, S20, S24,
S27, S28, S31, S33,
S34, S35, S50, S48
(12, 23.1%)	–	–	–	\CenterstackS12
(1, 1.9%)	–	–	\Centerstack13
25%
KM	–	–	–	\CenterstackS45
(1, 1.9%)	–	–	–	–	–	\CenterstackS10
(1, 1.9%)	\Centerstack2
3.8%
KNN	–	–	–	\CenterstackS16, S30
(2, 3.8%)	–	–	\CenterstackS5
(1, 1.9%)	–	–	–	\Centerstack3
5.8%
LR	–	–	–	\CenterstackS16, S26, S30
(3, 5.7%)	\CenterstackS32
(1, 1.9%)	–	\CenterstackS1, S5
(2, 3.8%)	–	–	–	\Centerstack6
11.5%
NB	–	–	–	\CenterstackS16
(1, 1.9%)	\CenterstackS32
(1, 1.9%)	–	\CenterstackS5
(1, 1.9%)	–	–	\CenterstackS49
(1, 1.9%)	\Centerstack4
7.7%
TL	–	–	–	\CenterstackS52
(1, 1.9%)	\CenterstackS17
(1, 1.9%)	–	–	–	–	–	\Centerstack2
7.7%

Unsupervised learning: Unsupervised learning algorithms, unlike supervised learning, do not require data labeling before training. These models uncover data relationships and cluster data points [17]. Unsupervised clustering algorithms are particularly suited for large datasets where manual data labeling efforts are impractical [17]. Semi-supervised learning: Semi-supervised learning algorithms leverage labeled data to classify unlabeled data by identifying underlying data relationships [67], making them more practical when dealing with imprecise or noisy datasets containing both labeled and unlabeled data [68]. Lastly, Reinforcement Learning (RL) algorithms learn through trial and error, receiving rewards or penalties at each step until they achieve the desired output or accuracy [66]. RL algorithms benefit from continuous model updates, making them suitable for dynamically changing CI environments and data.

Among the primary studies, supervised learning is the most commonly employed approach, with 44 out of 52 studies utilizing it. Unsupervised learning is the second most used, with five out of 52 studies employing this approach, and one study using both supervised and unsupervised ML methods. Additionally, four studies applied semi-supervised learning algorithms, while 12 studies employed RL algorithms. Our observation also indicates a significant increase in the usage of RL algorithms in “Test Optimization” tasks since 2020. Notably, two out of the three state-of-the-art methods in this task (COLEMAN and RETECS, see Table 7) are RL-based methods, underscoring the growing applicability of RL-based algorithms in the realm of test optimization.

Our analysis also reveals that classification algorithms are the most widely employed methods in the CI context, with 32 out of 52 studies utilizing them. This preference can be attributed to the characteristics of testing, where ML-based binary classifiers excel in predicting test outcomes and build results without the need for explicit execution. Additionally, one study (S49) in the CI environment employed multi-class classification methods to categorize reported issues into five classes, aiding developers in identifying and addressing bug-related issues based on the most frequent labels in the training dataset.

Table 16 demonstrates that Decision Tree (DT) algorithms are the prevailing classification method in the CI environment, with 25 out of 52 studies employing them. The popularity of DT algorithms can be attributed to several factors, two of which are derived from the selected studies, while the third is drawn from the literature.

First, CI environments continuously generate vast amounts of data, and ML models require frequent updates [69, 70]. Training and updating DT algorithms demand low computational resources, making them a feasible choice in CI settings.

Second, DT algorithms have high performance in classifying unseen data [47].

Third, DT algorithms are interpretable and easily comprehensible for human users [71].

Table 16 shows that, except for one study (S12), RL algorithms have primarily been used in regression testing (RT) tasks within CI environments. In RT, a predefined set of test cases is established, and new test cases are incrementally added as new features are developed for the final software product. The objective of RT is to identify test cases with failure outcomes and prioritize their execution before those with passing outcomes. RL algorithms are well-suited for RT because this CI phase can be formulated as a sequential decision-making problem, a key characteristic of RL-based solutions [44]. Moreover, RL algorithms adapt to new data more effectively with frequent updates compared to other supervised and unsupervised ML algorithms [62].

However, researchers and practitioners should consider the advantages and disadvantages of RL-based models. For instance, the ROCKET solution [72] faces scalability issues in long runtime, while the RETECS method [73], an RL-based solution, requires a substantial amount of time for training.

Alongside the widespread adoption of DT and RL algorithms in regression testing, Table 16 highlights the utilization of Neural Network (NN) algorithms in 14 studies. Remarkably, NN algorithms have been applied in 7 out of 10 CI tasks, underscoring their flexibility in addressing various challenges within CI. The appeal of NN algorithms, despite their need for significant computational resources during training, lies in their capacity to automatically extract features from datasets and their high predictive accuracy [70].

4.5.2 Hyper-Parameter Tuning

In general, ML models comprise a basic formula that requires configuration by identifying the optimal values for their hyper-parameters [65]. Selected studies present varying perspectives on hyperparameter tuning. Some emphasize the importance of hyperparameter tuning in achieving the optimal performance of trained models [43], considering skip** this process as a potential threat that can affect model accuracy [6, 45]. However, other studies, like [74], found that tuning hyperparameters did not significantly impact model accuracy. Additionally, Al-Sabbagh et al. [41] reported that automatic hyperparameter tuning tools can be time-consuming, prompting them to manually tune hyperparameters to save time.

It is worth noting that hyperparameter tuning can affect model training speed. For instance, adjusting the ‘training rate’ hyperparameter can increase model instability while reducing training time or decrease training speed while enhancing overall model performance stability [27]. Hence, our investigation delves into how the selected studies in the application of ML algorithms within CI environments performed hyperparameter tuning.

Based on the data extracted from the selected studies, we categorized the hyper-parameter tuning strategies (HT) into five groups.

HT1) The first group used searching methods to find the best hyperparameters. These studies employed methods such as Grid Search (GS), Bayesian Search (BS), and Genetic Algorithm (GA).

In GS, researchers fix the domain of hyperparameters, and the algorithm finds the best combination from these fixed values [75]. In BS, the next hyperparameter values are determined based on the evaluation results of the previous values, avoiding unnecessary evaluations [76]. In GA, hyperparameters are adjusted iteratively by evaluating the model’s performance and making slight changes [77].

In S38, a unique Genetic Algorithms-based hyperparameter tuning method is utilized. Random values were assigned to each hyperparameter, and a single-point crossover operator generated two new offspring in the Crossover step. The best solutions were retained using the elitism method and fitness function calculation. In the Mutation step, hyperparameter values were slightly modified, and the Crossover step was repeated. This process continued until it reached the predetermined stop** criteria. Table 17 details the hyperparameter tuning methods used for ML models, acknowledging that one tuning method might be applied to multiple ML models in studies utilizing more than one ML method.

Table 17: Summary of employed hyper-parameter strategies and map** to the ML methods. color intensity corresponds to study counts, with lighter shades indicating lower and darker blues representing higher counts. The bolded numbers represent the count of studies and their percentage in total. Acronyms: NN: Neural Network, SVM: Support Vector Machine, DT: Decision Tree, RL: Reinforcement Learning, KM: K-Means, KNN: K-Nearest Neighbors, LR: Linear Regression, NB: Naive Bayes and TL: Transfer learning, GS: Grid Search, GA: Genetic Algorithm and BS: Bayesian Search

Alg.	Methods	Default	Literature	Test-and-Trial	Formula	Did not report
NN	\CenterstackS15(GS), S38(GA),
S39(BS)
(3, 5.8%)	\CenterstackS26
(1, 1.9%)	\CenterstackS7
(1, 1.9%)	\CenterstackS2, S3, S4, S13,
S18, S37, S41
(7, 13.4%)	\CenterstackS11
(1, 1.9%)	\CenterstackS32
(1, 1.9%)
Huber	–	–	–	\CenterstackS3
(1, 1.9%)	–	–
SVM	\CenterstackS15(GS)
(1, 1.9%)	\CenterstackS26
(1, 1.9%)	–	\CenterstackS3, S49
(2, 3.8%)	–	\CenterstackS8, S32, S36
(3, 5.7%)
DT	\CenterstackS15(GS), S29(GS),
S30(BS)
(3, 5.7%)	\CenterstackS17, S23, S25,
S26, S47, S51
(6, 11.5%)	–	\CenterstackS9, S13, S49,
S18, S46, S16
(6, 11.5%)	\CenterstackS24, S44
(2, 3.8%)	\CenterstackS1, S5, S21, S27, S32,
S36, S40, S42, S43
(9, 17.3%)
RL	–	\CenterstackS19, S35
(2, 3.8%)	\CenterstackS20, S48, S7, S28,
S34 (5, 9.6%)	\CenterstackS14, S22
(2, 3.8%)	–	\CenterstackS12, S31, S33
(3, 5.8%)
KM	–	\CenterstackS10
(1, 1.9%)	–	–	–	\CenterstackS45
(1, 1.9%)
KNN	\CenterstackS30(BS)
(1, 1.9%)	–	–	\CenterstackS16
(1, 1.9%)	–	\CenterstackS5
(1, 1.9%)
LR	\CenterstackS30(BS)
(1, 1.9%)	\CenterstackS26
(1, 1.9%)	–	\CenterstackS16
(1, 1.9%)	–	\CenterstackS1, S5, S32
(3, 5.8%)
NB	–	–	–	\CenterstackS49
(1, 1.9%)	–	\CenterstackS5, S32
(2, 3.8%)
TL	–	\CenterstackS17, S52
(2, 3.8%)	–	–	–	–

In Table 17, the hyperparameter tuning methods used for ML models are detailed. Considering that several studies employed more than one ML method, a single tuning method might be applied to multiple ML models.

HT2) Among the 52 reviewed studies, 10 papers opted for the Default hyperparameter values, determined by ML training libraries like scikit-learn. This approach allows efficient training without investing time in hyperparameter tuning. Default hyperparameters are commonly established through expert knowledge, serving as a dependable initial configuration and aiding in the prevention of overfitting issues [78].

HT3) Four studies reduced ML model training time by utilizing hyperparameters tuned from Literature. Given the frequent retraining needs of ML models in CI environments, both HT2 and HT3 methods contribute to significant time savings.

HT4) The majority of studies (11 out of 52) manually explored hyperparameter values for ML models, determining optimal settings through the Test and Trial method. Similar to Random Search (RS), this method allows parallel and independent evaluation of candidate hyperparameters, facilitating quicker identification of suitable solutions in the agile CI environment, even before the arrival of new data [79].

HT5) The fifth strategy involves defining a Formula for hyperparameter tuning, observed in three studies: S44 calculated the Hoeffding bound for navigating Hoeffding decision tree nodes based on observations and confidence parameters; S11 determined memory units for LSTM, Sigmoid, and Tanh using a specified formula; and S24 adjusted the K value for clustering build logs with the K-Means algorithm as $\sqrt{n/2}$ during data preparation, where n is the number of build logs, without providing evidence for this assumption.

Notably, 14 studies did not report how they adjusted hyperparameters in their studies.

While hyperparameter tuning enhances model accuracy, it may lead to overfitting [80]. Overfitting occurs when a model learns excessively from the training data, such as by increasing the number of layers in neural networks. This includes capturing noise or random patterns, which hinders its performance with new data and reduces its generalizability [81]. To address this issue, researchers commonly employ K-fold cross-validation [46]. This technique involves iterative training and evaluation on different data subsets, allowing hyperparameter adjustment for improved model performance. Notably, the selected studies do not address the impact of hyperparameter tuning techniques on ML model performance in CI tasks.

Summary: $\bullet$ Decision tree algorithms are favored in agile CI due to high accuracy and low computational overhead. $\bullet$ Reinforcement learning suits agile settings for sequential decision-making challenges. $\bullet$ Neural network algorithms, with high accuracy, find broad application across five CI tasks. $\bullet$ Hyperparameter tuning, despite increasing training time, enhances ML model accuracy. $\bullet$ Five strategies, including search methods, default settings, literature-tuned parameters, manual tuning, and formulas, are identified for hyperparameter tuning. $\bullet$ 14 studies in the literature did not report their hyperparameter tuning methods.

4.6 RQ5: Evaluation Methods

In practical ML applications, performance evaluation is crucial. In this section, we summarize the validation techniques and performance metrics used in the selected studies. For assessing supervised ML methods, predictions were compared with actual labels from untrained data, revealing seven distinct evaluation methods (EM) in the literature as presented in Table 18.

Table 18: Relationships between seven evaluation methods and CI phases are illustrated in the table. The color intensity in this table corresponds to the number of studies, with lighter shades indicating lower counts and darker blues representing higher counts. The bolded numbers represent the count of studies and their percentage in total.

\CenterstackEvaluation
technique	\CenterstackSelection
method	\CenterstackTotal
papers	\Centerstack[c]Unit
Test	\Centerstack[c]Integration
Test	\Centerstack[c]Regression
Test	\Centerstack[c]Build
Validation	\Centerstack[c]System
Test	\Centerstack[c]Process
Management
K-Fold	Sorted	\Centerstack5
9.6%	–	–	\CenterstackS26, S30, S32,
S40 (4, 7.7%)	\CenterstackS38 (1, 1.9%)	–	–
\cdashline2-9	Random	\Centerstack16
30.8%	\CenterstackS2
(1, 1.9%)	\CenterstackS3, S15,
S23 (3, 5.8%)	\CenterstackS13, S32, S36, S46,
S47 (5, 9.6%)	\CenterstackS21, S27, S43, S44,
S51 (5, 9.6%)	\CenterstackS12
(1, 1.9%)	\CenterstackS49
(1, 1.9%)
Percentage	Sorted	\Centerstack4
7.7%	–	–	\CenterstackS8, S16, S29
(3, 5.8%)	\CenterstackS25
(1, 1.9%)	–	–
\cdashline2-9	Random	\Centerstack4
7.7%	–	\CenterstackS18
(1, 1.9%)	\CenterstackS39
(1, 1.9%)	\CenterstackS24, S42
(2, 3.8%)	–	–
Constant Number	Sorted	\Centerstack3
5.8%	–	–	\CenterstackS11, S41
(2, 3.8%)	\CenterstackS6
(1, 1.9%)	–	–
Time	Sorted	\Centerstack4
7.7%	–	–	\CenterstackS17, S37, S52
(3, 5.8%)	\CenterstackS1
(1, 1.9%)	–	–
Version	Sorted	\Centerstack1
1.9%	–	–	\CenterstackS9
(1, 1.9%)	–	–	–
Gradually Evaluation	Incremental	\Centerstack13
25%	–	–	\CenterstackS7, S14, S19, S20,
S22, S28, S31, S33,
S34, S35, S48, S50
(12, 23.1%)	–	\CenterstackS4
(1, 1.9%)	–
Baseline	Train=Test	\Centerstack2
3.8%	–	–	\CenterstackS10
(1, 1.9%)	–	–	\CenterstackS45
(1, 1.9%)

EM1) K-Fold: Generally, data is divided into K equal parts, with ML models trained on K-1 segments and evaluated on the unseen segment.

EM2) Percentage: Data is split into training and testing sets based on a specified percentage (e.g., 80% training and 20% testing).

EM3) Constant number: N samples are designated as the test set, while the model is trained on the remaining samples.

EM4) Time: Dataset division is based on specific dates or time spans, mirroring real-world CI data dynamics.

EM5) Version: Utilized in version-based releases, where old software versions serve as the training data and the earliest versions as the test data.

EM6) Gradual Evaluation: Common in RL research, it assesses model performance incrementally through a reward-based approach, reflecting real-world CI conditions.

EM7) Baseline: Unsupervised ML models use the same dataset for training and testing, with evaluation based on comparing model outcomes to the expected solution.

Notably, besides gradual evaluation for RL-based methods, S4 employed an online supervised learning approach and assessed model performance by iterative training and evaluating positive predictive values (PPV) as the main effectiveness metric for a test suite.

Table 18 classifies K-Fold and Percentage methods into Sorted and Random types. In Sorted types, models train on older data and evaluate earlier unseen segments, prioritizing reliability in real-world CI scenarios. Random types, randomly segment data, enhancing solution generalizability but may be less robust for real-world problem-solving [82, 83].

According to Table 18, K-Fold methods are widely employed for evaluating ML techniques. This prevalence may be attributed to the availability of numerous packages capable of calculating evaluation metrics with minimal intervention. However, the reliance on random selection methods for training and testing data, as observed in many studies, can compromise the reliability of the presented methods in real-world scenarios. It is worth mentioning that in CI environments, ML models should always be applied to newly produced, unseen data.

In contrast to random selection in K-Fold methods, as shown in Table 18, a majority (30 out of 36) of the studies utilize sorted selection methods to align their approaches with real-world Continuous Integration (CI) environments.

Additionally, Gradual Evaluation emerges as another commonly utilized evaluation method, particularly notable due to the extensive use of Reinforcement Learning methods in CI’s Regression Testing step.

Our analysis identified six types of K-Fold cross-validation methods (Figure 6), each denoted by a number. K-Fold validation typically divides the dataset into three groups: Training data for model training, Testing data for model evaluation, and a Holdout subset unused for training or evaluation (in some techniques, Testing and Holdout sets are the same).

Figure 6 presents six K-Fold cross-validation types, each addressing specific challenges in CI environments:

Classical K-Fold uses the same testing and holdout sets, risking overfitting in sequential CI data [84].

The second type of K-Fold is ideal for environments with extensive data, addressing the need for training models on large volumes of data in CI settings [85]. Here, testing data is consistently more recent than the training data. In this approach, after dividing the data into K folds, we have two options: we can select one fold as the training set and the next fold as the testing set (first row), or we can divide one fold into both training and testing datasets (second row).

The third K-Fold type, similar to the classical type, uses the most recent data as testing while altering the holdout data. This method lowers the risk of overfitting compared to the first K-Fold type.

The fourth type of K-Fold Cross-Validation, known as Nested Cross-Validation (Nested CV), utilizes two K-Fold procedures to conduct hyperparameter tuning and increase the number of iterations. However, Nested CV incorporates futuristic data during training [84]. This method resembles the third type of K-Fold, with the key difference being that the testing dataset varies in each iteration. Additionally, Nested CV allows for the selection of a new K value for the training dataset. Specifically, the process involves initially reserving one random fold as the testing dataset, followed by model training using a new K-Fold method on the remaining dataset.

The fifth K-Fold type is the modified version of the third type, uses the most recent data for testing, and varies the number of holdout folds to determine the data required for acceptable model accuracy [43].

The last K-Fold type employs the last consecutive data fold for testing while adjusting the number of holdout folds, allowing the examination of model stability over time [43].

The chronological order is vital in CI data, rendering many K-Fold techniques employed in the selected studies inadequate for CI environments due to their lack of consideration for the latest data in evaluations [86, 36].

4.6.1 Performance Measures

Choosing appropriate performance measures allows us to identify weaknesses in a solution and compare our results with others’ methods detached from the research settings.

Table 19: Performance measurements description and formulas.

Measure	Description	Formula
Precision	The percentage of the detected positive instances that were correct	$\frac{TP}{TP+FP}$
Recall	The proportion of positive instances that were correctly identified	$\frac{TP}{TP+FN}$
F1-score	The harmonic mean of recall and precision	$\frac{2\times Precision\times Recall}{Precision+Recall}$
Accuracy	Percentage of correctly classified instances	$\frac{TP+TN}{TP+FP+FN+TN}$
APFD	Ratio between detected and detectable instances in classes	$1-\frac{\sum_{1}^{m}TF_{i}}{m\times n}+\frac{1}{2n}$
NAPFD	Normalized $APFD$	$p-\frac{\sum_{1}^{m}TF_{i}}{m\times n}+\frac{p}{2n}$
MAE	Average of errors between predicted and actual value	$\frac{\sum_{1}^{n}(y_{i}-\widehat{y}_{i})}{n}$
RMSE	The square root of the average of squared differences between prediction and actual observation	$\sqrt{\frac{1}{n}\sum_{1}^{n}(y_{i}-\widehat{y}_{i})^{2}}$
PP	The ratio of the detected positive instances and all instances	$\frac{TP+FP}{TP+FP+TN+FN}$
MCC	Correlation coefficient between the observed and predicted binary classifications	$\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$
G-Measure	The Harmonic mean of TPR ( $recall$ ) and true negative rate ( $TNR$ )	$\frac{2\times TPR\times TNR}{TPR+TNR}$
NRPA	Measures how close a predicted ranking of items ( ${s_{e}}$ ) is to the optimal ranking ( ${s_{o}}$ )	$\frac{RPA(s_{e})}{RPA(s_{o})}$
AUC	Two-dimensional area under Receiver Operating Characteristic ( $ROC$ ).
CCOFF	Measures the closeness of the predicted values and the testing set	$\rho_{xy}=\frac{Cov(x,y)}{\sigma_{x}\sigma_{y}}$
T-test	Determining mean difference between two sets

In the 52 selected studies, recall, precision, and F1-score are the most frequently used performance metrics, appearing in 30, 24, and 21 studies, respectively. Table 19 provides a summary of performance measurements, including formulas and brief descriptions. Key variables encompass $n$ (number of test cases), $m$ (number of faults), $p$ (ratio of detected faults to total faults), $TF_{i}$ (position of the first test exposing fault $i$ ), $y_{i}$ and $\widehat{y}_{i}$ (actual and predicted values), $\rho$ (Pearson product-moment correlation coefficient), and $\sigma_{x}$ (standard deviation of variable $x$ ).

In the reviewed studies, frequently used performance measurements include Average Percentage of Fault Detection (APFD) in 16 studies, Normalized APFD (NAPFD) in 10 studies, Accuracy in 9 studies, and AUC (Area under the ROC Curve) in 9 studies. AUC quantifies performance by plotting the True Positive Rate (TPR or Recall) against the False Positive Rate ( $FPR=\frac{FP}{FP+TN}$ ) in a ROC curve, where $T$ , $F$ , $P$ , and $N$ stand for True, Negative, Positive, and Negative in the confusion matrix.

It is crucial to note that while APFD is a valuable performance metric, researchers are encouraged to prefer Normalized APFD (NAPFD) for enhanced consistency [87]. APFD measures the area under the curve with the y-axis indicating the percentage of faults found and the x-axis showing the percentage of test cases. However, reporting the average APFD value can be misleading, as it equals 1 when no failures occur in a test cycle and may yield high values in imbalanced datasets, a common characteristic in CI datasets [62].

Table 20: Ratio of ML types to performance metrics (left) and CI phases (right). Numbers under ‘ML Types’ indicate study counts. Study IDs are omitted to prevent table dimension expansion. color intensity corresponds to percentages, with lighter shades indicating lower and darker blues representing higher percentages. The bolded numbers represent the count of studies and their percentage in each ML type. The percentage in the ML types column is in total.

NAPFD	Accuracy	AUC	APFD	F1-Score	Precision	Recall	ML types	\CenterstackUnit
Test	\CenterstackIntegration
Test	\CenterstackRegression
Test	\CenterstackBuild
Validation	\CenterstackSystem
Test	\CenterstackProcess
Management
\Centerstack1
3.0%	\Centerstack6
18.2%	\Centerstack9
27.3%	\Centerstack5
15.2%	\Centerstack17
51.5%	\Centerstack19
57.6%	\Centerstack19
57.6%	\CenterstackClassification
(33, 63.5%)	\Centerstack1
3.0%	\Centerstack3
9.1%	\Centerstack15
45.5%	\Centerstack11
33.3%	\Centerstack2
6.1%	\Centerstack1
3.0%
–	\Centerstack1
16.7%	–	\Centerstack2
33.3%	–	\Centerstack1
16.7%	\Centerstack1
16.7%	\CenterstackRegression
(6, 11.5%)	–	\Centerstack2
33.3%	\Centerstack3
50.0%	\Centerstack1
16.7%	–	–
\Centerstack4
33.3%	–	–	\Centerstack5
41.7%	–	–	\Centerstack3
25.0%	\CenterstackReinforcement
Learning (12, 23.1%)	–	–	\Centerstack12
100%	–	–	–
–	\Centerstack1
50.0%	–	\Centerstack1
50.0%	\Centerstack1
50.0%	\Centerstack2
100%	\Centerstack2
100%	\CenterstackClustering
(2, 3.8%)	–	–	\Centerstack1
50.0%	–	–	\Centerstack1
50.0%

Table 20 indicates that classification ML methods often used Recall, Precision, and F1-Score, while RL methods predominantly utilized APFD and NAPFD. The strong association between RL algorithms and APFD metrics is due to RL’s compatibility with Test Case Prioritization (TCP) in CI, leveraging action-reward policies. APFD and NAPFD quantify the weighted mean of fault detection percentages over the test suite’s lifecycle, making them suitable for evaluating test case quality. Notably, studies using APFD or NAPFD mainly focused on TCP, except for S29, which aimed to enhance mutant test precision within their MuDelta approach.

Several studies reported additional performance metrics, including MAE, RMSE, pairwise t-tests, PP, NRPA, MCC, TTF, and G-Measure. Nine studies used simple rate values like misclassification rates and failure detection rates. Combining these simple rate values with more comprehensive measures like F-score and Accuracy is recommended for a more thorough assessment of results. This strategy aids in preventing resource misallocation, as relying solely on the evaluation of results based on True Positive (failing tests) values could lead to the execution of numerous small failing test cases, thereby increasing resource consumption, particularly by resource-intensive passing test cases.

4.6.2 Connection Between CI Tasks and Evaluation Metrics

Table 21: Heat map displays the percentage distribution of evaluation metrics across CI tasks. Study IDs are omitted to prevent table dimension expansion. color intensity corresponds to percentages, with lighter shades indicating lower and darker blues representing higher percentages. The bolded numbers represent the count of studies and their percentage in each CI task.

CI Phases	CI Tasks	\Centerstack Num of papers	Recall	Precision	F1-Score	APFD	NAPFD	Accuracy	AUC	MCC	TTF	MAE	RMSE	NRPA	T-test	PP	G-Measure	Time Cost
UT	\Centerstack[l]Unit Test
Prediction	1	\Centerstack1
100%	\Centerstack1
100%	\Centerstack1
100%	–	–	\Centerstack1
100%	–	–	–	–	–	–	–	–	–	–	–
\cdashline1-20 IT	\Centerstack[l]Branch Coverage
Prediction	2	–	–	–	–	–	–	–	–	–	\Centerstack2
100%	–	–	–	–	–	–	–
\cdashline2-20	\Centerstack[l]Integration Test
Prediction	2	\Centerstack2
100%	\Centerstack2
100%	\Centerstack1
50.0%	–	–	–	\Centerstack1
50.0%	–	–	–	–	–	–	–	–	–	–
\cdashline1-20 RT	\Centerstack[l]Test
Optimization	26	\Centerstack14
53.8%	\Centerstack8
30.8%	\Centerstack6
23.1%	\Centerstack15
57.7%	\Centerstack10
38.5%	\Centerstack6
23.1%	\Centerstack2
7.7%	\Centerstack1
3.8%	\Centerstack3
11.5%	–	\Centerstack2
7.7%	–	–	–	–	\Centerstack9
34.6%	\Centerstack4
15.4%
\cdashline2-20	\Centerstack[l]Defect
Prediction	4	\Centerstack3
75%	\Centerstack2
50%	\Centerstack2
50%	–	–	\Centerstack1
25%	\Centerstack1
25%	\Centerstack1
25%	–	–	–	–	–	\Centerstack1
25%	\Centerstack1
25%	–	–
\cdashline2-20	\Centerstack[l]Flaky Test
Detection	1	\Centerstack1
100%	\Centerstack1
100%	\Centerstack1
100%	–	–	–	–	–	–	–	–	–	–	–	–	–	\Centerstack1
100%
\cdashline1-20 BV	\Centerstack[l]Build
Prediction	12	\Centerstack6
50%	\Centerstack7
58.3%	\Centerstack7
58.3%	–	–	\Centerstack1
8.3%	\Centerstack4
33.3%	\Centerstack1
8.3%	–	–	–	–	\Centerstack1
8.3%	–	–	–	\Centerstack4
33.3%
\cdashline1-20 ST	\Centerstack[l]Installed Software
Discovery	1	\Centerstack1
100%	\Centerstack1
100%	\Centerstack1
100%	–	–	–	–	–	–	–	–	–	–	–	–	–	–
\cdashline2-20	\Centerstack[l]Performance Test
Optimization	1	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	\Centerstack1
100%
\cdashline1-20 PM	\Centerstack[l]Activity
Management	2	\Centerstack2
100%	\Centerstack2
100%	\Centerstack2
100%	–	–	–	\Centerstack1
50%	–	–	–	–	–	–	–	–	–	–

The selection of evaluation metrics to assess tasks’ performance, depends on the objectives of the tasks. Table 21 illustrates the correlation between CI tasks and the metrics applied in the selected studies. Recall, Precision, and F1-Score emerge as the most frequently employed metrics across various tasks. Notably, accuracy is often avoided in CI environments due to data imbalances, which can introduce bias into the results [88].

Furthermore, APFD and NAPFD metrics find primary application in Test Optimization solutions, where ML methods aim to optimize the execution order of test cases. These metrics are specifically designed for the evaluation of test case prioritization and selection techniques [89].

AUC measurements play a crucial role in various CI tasks, offering robust performance assessment in imbalanced datasets. Their application is vital for addressing data imbalances, ensuring fairness, and enhancing the effectiveness of ML-based methodologies in such datasets [90].

Additionally, other metrics are sporadically used across the literature. MAE, for instance, is consistently employed in studies related to the Branch Coverage Prediction task. This metric, measuring the average error between predicted and actual values, provides a more intuitive understanding compared to other error measures like MSE or RMSE, as it utilizes absolute values instead of squared values [91].

While evaluating ML methods’ performance, authors often neglected reporting associated time and resource costs and the reduced costs in real-world projects. However, S41 stood out by emphasizing time-related metrics and introducing Normalized Time Reduction (NTR) and Prioritization Time (PT). NTR measures time saved by using ML models to detect the first failure instead of executing all tests, while PT denotes the time required for ML models to prioritize test cases.

Table 21 illustrates that classification methods find application across all steps of CI, notably in regression testing and build validation. Given the structural similarities between regression testing and unit or integration testing, the underutilization of classification methods in these steps highlights a significant gap in the literature.

Additionally, Table 21 indicates that RL methods are predominantly utilized in RT. Despite their suitability for continuous adjustment and analysis of high-volume data in RT [92], RL methods hold the potential for providing just-in-time predictions in other CI steps.

Moreover, Table 21 reveals a notable neglect of various ML types across different CI steps. This underscores an important opportunity for researchers to explore and assess the performance of ML models in these overlooked areas of CI.

Summary: $\bullet$ The sorted data selection method in the evaluation phase provides higher reliability than random selection, simulating real data streams in CI environments. $\bullet$ Researchers should choose K-fold types based on specific ML method requirements, considering the pros and cons of each type and the characteristics of CI data. $\bullet$ F1-score, Precision, and Recall, detailed in Table 19, are frequently employed performance metrics, aiding comparisons with other ML-based CI solutions. $\bullet$ RL and consequently Gradual Evaluation methods are commonly employed in CI, primarily because of their capability for continuous evaluation. $\bullet$ Designed metrics like APFD and NAPFD for Test Optimization and MAE for Branch Coverage Prediction underscore the importance of selecting appropriate metrics for assessing ML-based CI solution performance.

5 Limitations

While adhering to the guidelines outlined by Kitchenham, and Charters [28], and Braun and Clarke [34], and taking into account the documented threats in systematic literature reviews (SLRs) within the software engineering domain [93, 94], efforts have been made to minimize potential defects. However, despite these efforts, there remain persistent threats to the validity of this SLR.

One such threat relates to the completeness and inclusiveness of relevant studies. We conducted a preliminary search to identify papers already known to us that have been published. However, it’s important to note that our coverage may be impacted by potential inconsistencies in terminology found within paper titles and abstracts.

We executed our search string on ACM, IEEE, and Scopus indexing systems, renowned for their comprehensive coverage in software engineering and computer science, including well-known venues in these fields [29]. To address potential omissions, we implemented backward and forward snowballing techniques, manually exploring references of selected studies and scrutinizing cited papers. However, limitations may still exist with this approach.

Another validity threat may arise from potential bias in the selection, analysis, and synthesis of data by researchers. To mitigate this threat at every step, multiple researchers were involved, and all steps were supervised by the fourth author.

We presented the CI pipeline and its six identified phases based on reviewing the selected studies in Figure 5. However, the risk remains that uninvestigated CI phases could have been overlooked.

6 Discussion

This study primarily focused on breaking down CI phases into automatable tasks optimized by ML and providing detailed insights into ML method preparation in this emerging domain. The rapid software release cycle associated with CI has sparked substantial interest among software companies and researchers, driving the need for efficient CI practices.

Our study focused on the development steps of ML methods, to assist researchers and practitioners in handling the ever-increasing volume of data in the CI field by reviewing the published studies over the last two decades. The surge in research studies on fault prediction within CI necessitates more structured research to achieve reliable and comprehensive results. The paper discusses key assumptions and areas for future research.

In this study, we conducted a systematic literature review (SLR) to give a comprehensive overview of the use of ML approaches in the context of CI. Our findings focused on numerous elements of the utilization of ML in CI, such as the phases, methodologies, performance indicators, and data sources that are often used. The CI process, with its continuous and quick software release cycle, has presented unique problems to software developers and companies. As a result, researchers are increasingly turning to ML techniques to improve CI procedures. Our evaluation identified major trends and gaps in the existing literature, which we will now examine in depth to provide insights into future research areas and practical implications.

Usage of Large Language Models: Despite high expectations regarding Large Language Models (LLMs), our analysis revealed a significant research gap in this area within the CI domain. Future research avenues might include the generation of new test cases for defect detection, leveraging various data types and historical features. Researchers could also explore the identification of test cases closely related to software requirements. Consideration of automation, agility, and code commit frequency is crucial in CI research to align with CI’s core principles. It is noteworthy that our initial pilot study employed the search query “((‘Large Language Model’ OR ‘LLM’) AND ‘Continuous Integration’)”, which, regrettably, did not return any papers suitable for inclusion in the final list of the selected papers

Security of Proposed Methods: Most studies reviewed here overlooked the security aspect of their ML methods except S12. ML models are vulnerable to causative and exploratory attacks, particularly in open-source projects [95]. Researchers should evaluate their methods’ security, especially when employing Neural Network (NN) techniques, which lack interpretability. Comparative evaluations that account for security concerns are essential in addressing these identified gaps.

Performance metrics: Our analysis of Research Question 5 (RQ5) in Section 4.6 reveals diverse performance measures employed across the studies. In particular, 12 studies (S3, S4, S6, S7, S15, S22, S27, S36, S42, S43, S44 and S52) solely utilized a single metric, indicating the importance of adopting a comprehensive set of metrics like precision, recall, F1-score, and accuracy for classification methods, RMSE for regression methods, and APFD and NAPFD for test case prioritization and selection tasks. Researchers are encouraged to report multiple metrics to enable better comparisons and informed decisions, alongside qualitative evaluations to enhance method comprehension. Given that the primary objective of automating CI tasks is to enhance the overall efficiency of software developers, it becomes necessary for researchers to not only rely on the metrics provided in Table 21 for assessing the performance of ML models but also to embark on the development of novel metrics designed to quantify the changes in developers’ performance.

Cost benchmark: We emphasize the significance of reporting benchmark performance measures in addition to performance metrics to assess the efficiency of ML models. Metrics like time to train and perform provide insights into model effectiveness. By integrating both types of measures, studies can perform cost-benefit analyses. Future research can consider metrics like Average Percentage of Fault Detected with Cost ( $APFD_{c}$ ), accounting for hardware costs, for more thorough evaluations. The $APFD_{c}$ metric is calculated using equation (1), where $n$ is the number of test cases in test suite $T$ , $t_{i}$ is the cost of executing test case $i$ , $f_{i}$ is the severity of faults, and $TF_{i}$ is the first test case that reveals fault $i$ .

\centering APFD_{c}=\frac{\sum_{i=1}^{m}(f_{i}\times(\sum_{j=TF_{i}}^{n}t_{j}-% \frac{1}{2}t_{TF_{i}}))}{\sum_{i=1}^{n}t_{i}\times\sum_{i=1}^{m}f_{i}}\@add@centering

(1)

Diversity of data: Diverse input data is critical, as it increases the generality of solutions [63]. Studies should explore different datasets with various sizes and characteristics. Data sources like version control and issue-tracking tools or using user requirement files automating the test designing issues are valuable for ML-based solutions in CI. Evaluating ML methods on a range of projects enhances their generalizability.

New strategies for solutions: Researchers can enhance their proposed methods by exploring innovative strategies, such as hybrid models that optimize performance for different data segments. Notably, there is an underutilization of unsupervised ML methods, emphasizing the need to consider the learning type distribution in ML algorithms. Novelty can be introduced through solution refinements, like defining new reward functions for reinforcement learning methods (e.g., S33) or following the divide-and-conquer solutions (e.g. separate models for predicting pass and fail test cases).

Human interactions: ML methods significantly improve CI tasks, but human involvement remains crucial, as evidenced by 30 out of 34 selected papers in this study from 2020 to 2023 employing supervised ML algorithms. Human effort cannot be entirely replaced, making professional guidance and expertise vital in ML-based solutions for validating the ML models, guiding the data analyzer in the training phase, and hel** with feature extraction and data collection.

Personalized solutions: Out of the 52 reviewed studies, only 5 addressed individual features for modeling (see Table 14). However, these models cannot be considered personalized as they focus on individual factors within a comparative context, revealing a significant gap in truly personalized ML-based methods for CI tasks.

Data drifting challenge: Agile CI environments introduce data drifting challenges, where new data may differ from training data, affecting various phases of ML model development [96]. While most selected studies considered concept drift, data drifting received little attention, except in studies S4 and S36. These studies mitigated data drifting by retraining ML models at specific intervals in the CI cycle, a crucial consideration for industry partners in determining retraining frequencies.

7 Conclusion

In this paper, we conducted a systematic literature review (SLR) to investigate the current state of the application of ML methods in CI. We analyzed a total of 52 primary studies published between 2000 and 2023 and examined the use of ML methods and their properties in this context. While many studies have reported successful results, a few studies suffer from imperfections in certain dimensions and lack detailed information about the presented solutions. These issues can be considered significant research gaps in the application of ML methods in CI.

In summary, our work comprises the following findings:

(1) A comprehensive depiction of the CI pipeline, which begins with the code commitment of the developer and progresses through various testing phases until the final release of a product that is ready for customers.

(2) An examination of different aspects of data engineering, including data sources, data types, and data preparation.

(3) Utilizing thematic analysis, we classify four types of features used in ML-based solutions for CI, along with their subclasses, and examine five feature engineering techniques employed in the selected studies.

(4) Presentation of statistics related to the ML techniques implemented and their association with different phases of the CI pipeline, as well as an investigation into the methods of hyper-parameter tuning.

(5) Finally, in this study, we provide a thorough description of the evaluation methods and metrics that were commonly used in the primary studies we selected. By demonstrating the relationship between the evaluation methods and ML algorithms used in these studies, our work enables researchers to select the most suitable evaluation methods when comparing their findings with those of other studies in the literature.

To summarize, this SLR has presented an overview of ML-based solutions for improving the CI pipeline concerning speed and resource consumption. However, we believe that further research is necessary for various stages of the CI cycle, as discussed in the Discussion section (see section 6). Given that a considerable proportion of the studies we reviewed were conducted in industrial settings, we encourage practitioners to adapt and implement the proposed approaches and solutions in actual CI environments. Furthermore, to make additional progress in this research domain, we suggest the use of standardized research methods and more general approaches, as well as consideration of the security aspects of the studies presented.

In the future, we plan to update this SLR with new studies and extend our research scope to other libraries.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

We acknowledge the contribution of Dr Zohaib Md. Jan and Roshan Namal Rajapakse during the first phase of collecting and analyzing data in this study.

Appendix

Appendix A Selected Studies

Table 22: List of selected studies in this review. Here, ID denotes the study identification number. Note: The number of citations for each study was gathered on 14th August 2023.

ID	Title	Authors	Venue	Cite	Year
S1	FastLane: Test Minimization for Rapidly Deployed Large-Scale Online Services	Philip, A. A., Bhagwan, R., Kumar, R., Maddila, C. S., & Nagppan, N.	IEEE/ACM International Conference on Software Engineering (ICSE)	24	2019
S2	Classifying false positive static checker alarms in continuous integration using convolutional neural networks	Lee, S., Hong, S., Yi, J., Kim, T., Kim, C. J., & Yoo, S.	IEEE Conference on Software Testing, Validation and Verification (ICST)	22	2019
S3	How high will it be? Using machine learning models to predict branch coverage in automated testing	Grano, G., Titov, T. V., Panichella, S., & Gall, H. C.	IEEE workshop on machine learning techniques for software quality evaluation (MaLTeSQuE)	34	2018
S4	Automatic exploratory performance testing using a discriminator neural network	Porres, I., Ahmad, T., Rexha, H., Lafond, S., & Truscan, D.	IEEE international conference on software testing, verification and validation workshops (ICSTW)	14	2020
S5	An empirical study on the cross-project predictability of continuous integration outcomes	Xia, J., Li, Y., & Wang, C.	Web Information Systems and Applications Conference (WISA)	13	2017
S6	Cutting the software building efforts in continuous integration by semi-supervised online AUC optimization	Xie, Z., & Li, M.	International Joint Conference on Artificial Intelligence (IJCAI)	20	2018
S7	Reinforcement learning for automatic test case prioritization and selection in continuous integration	Spieker, H., Gotlieb, A., Marijan, D., & Mossige, M.	Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis	225	2017
S8	Learning for test prioritization: An industrial case study	Busjaeger, B., & Xie, T.	Proceedings of the ACM SIGSOFT International symposium on foundations of software engineering	108	2016
S9	Defect prediction on a legacy industrial software: A case study on software with few defects	Koroglu, Y., Sen, A., Kutluay, D., Bayraktar, A., Tosun, Y., Cinar, M., & Kaya, H.	Proceedings of the International Workshop on Conducting Empirical Studies in Industry	17	2016
S10	SQA-Profiles: Rule-based activity profiles for Continuous Integration environments	Brandtner, M., Müller, S. C., Leitner, P., & Gall, H. C.	IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER)	13	2015
S11	LSTM-based deep learning for spatial–temporal software testing	Xiao, L., Miao, H., Shi, T., & Hong, Y.	Distributed and Parallel Databases	9	2020
S12	Praxi: Cloud Software Discovery That Learns From Practice	Byrne, A., Allen, S. L., Nadgowda, S., & Coskun, A. K.	Proceedings of the International Middleware Conference Demos and Posters	6	2019
S13	Early prediction of test case verdict with bag-of-words vs. word embeddings	Meding, W.	CEUR Workshop Proceedings	1	2020
S14	A time window based reinforcement learning reward for test case prioritization in continuous integration	Wu, Z., Yang, Y., Li, Z., & Zhao, R.	Proceedings of the Asia-Pacific Symposium on Internetware	21	2019
S15	Branch coverage prediction in automated testing	Grano, G., Titov, T. V., Panichella, S., & Gall, H. C.	Journal of Software: Evolution and Process	21	2019
S16	Failure Prediction Using Machine Learning in IBM WebSphere Liberty Continuous Integration Environment	Khan, M. A., Azim, A., Liscano, R., Smith, K., Chang, Y. K., Garcon, S., & Tauseef, Q.	Proceedings of the Annual International Conference on Computer Science and Software Engineering	88	2021
S17	Failure Prediction Using Transfer Learning in Large-Scale Continuous Integration Environments	Mamata, R., Smith, K., Azim, A., Chang, Y. K., Taiseef, Q., Liscano, R., & Seferi, G.	Proceedings of the Annual International Conference on Computer Science and Software Engineering	47	2022
S18	Predicting Test Case Verdicts Using Textual Analysis of Committed Code Churns	Al Sabbagh, K., Staron, M., Hebig, R., & Meding, W.	CEUR Workshop Proceedings	11	2019
S19	Reinforcement Learning for Test Case Prioritization	Bagherzadeh, M., Kahani, N., & Briand, L.	IEEE Transactions on Software Engineering	49	2021
S20	Focus on New Test Cases in Continuous Integration Testing based on Reinforcement Learning	Chen, F., Li, Z., Shang, Y., & Yang, Y.	IEEE International Conference on Software Quality, Reliability and Security (QRS)	8	2022
S21	A cost-efficient approach to building in continuous integration	**, X., & Servant, F.	Proceedings of the ACM/IEEE International Conference on Software Engineering	25	2020

(Continue) List of selected studies in this review.
ID	Title	Authors	Venue	Cite	Year
S22	Occurrence Frequency and All Historical Failure Information Based Method for TCP in CI	Shang, Y., Li, Q., Yang, Y., & Li, Z.	Proceedings of the International Conference on Software and System Processes	2	2020
S23	A Machine Learning Approach to Improve the Detection of CI Skip Commits	Abdalkareem, R., Mujahid, S., & Shihab, E.	IEEE Transactions on Software Engineering	34	2020
S24	Change-Aware Build Prediction Model for Stall Avoidance in Continuous Integration	Hassan, F., & Wang, X.	ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)	38	2017
S25	BuildFast: History-Aware Build Outcome Prediction for Fast Feedback and Reduced Cost in Continuous Integration	Chen, B., Chen, L., Zhang, C., & Peng, X.	Proceedings of the IEEE/ACM International Conference on Automated Software Engineering	1	2020
S26	Empirically Evaluating Readily Available Information for Regression Test Optimization in Continuous Integration	Elsner, D., Hauer, F., Pretschner, A., & Reimer, S.	Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis	28	2021
S27	Continuous Build Outcome Prediction: A Small-N Experiment in Settings of a Real Software Project	Kawalerowicz, M., & Madeyski, L.	Advances and Trends in Artificial Intelligence. From Theory to Practice: International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE	5	2021
S28	Multi-Armed Bandit Test Case Prioritization in Continuous Integration Environments: A Trade-off Analysis.	Lima, J. A. P., & Vergilio, S. R.	Proceedings of the Brazilian symposium on systematic and automated software testing	25	2020
S29	MuDelta: Delta-Oriented Mutation Testing at Commit Time	Ma, W., Chekam, T. T., Papadakis, M., & Harman, M.	IEEE/ACM International Conference on Software Engineering (ICSE)	11	2021
S30	Supervised Learning for Test Suit Selection in Continuous Integration	Martins, R., Abreu, R., Lopes, M., & Nadkarni, J.	IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)	5	2021
S31	Reinforcement learning based test case prioritization for enhancing the security of software	Shi, T., Xiao, L., & Wu, K.	IEEE International Conference on Data Science and Advanced Analytics (DSAA)	4	2020
S32	Continuous Test Suite Failure Prediction	Pan, C., & Pradel, M.	Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis	13	2021
S33	Weighted Reward for Reinforcement Learning based Test Case Prioritization in Continuous Integration Testing	Li, G., Yang, Y., Wu, Z., Cao, T., Liu, Y., & Li, Z.	IEEE Annual Computers, Software, and Applications Conference (COMPSAC)	1	2021
S34	Learning-based Prioritization of Test Cases in Continuous Integration of Highly-Configurable Software	Lima, J. A. P., Mendonça, W. D., Vergilio, S. R., & Assunção, W. K.	Proceedings of the ACM conference on systems and software product line	14	2020
S35	Dynamic Time Window based Reward for Reinforcement Learning in Continuous Integration Testing	Pan, C., Yang, Y., Li, Z., & Guo, J.	Proceedings of the Asia-Pacific Symposium on Internetware	5	2020
S36	Scalable and Accurate Test Case Prioritization in Continuous Integration Contexts	Yaraghi, A. S., Bagherzadeh, M., Kahani, N., & Briand, L. C.	IEEE Transactions on Software Engineering	14	2022
S37	DeepOrder: Deep Learning for Test Case Prioritization in Continuous Integration Testing	Sharif, A., Marijan, D., & Liaaen, M.	IEEE International Conference on Software Maintenance and Evolution (ICSME)	18	2021
S38	Improving the prediction of continuous integration build failures using deep learning	Saidani, I., Ouni, A., & Mkaouer, M. W.	Automated Software Engineering	27	2022
S39	TCP-Net: Test Case Prioritization using End-to-End Deep Neural Networks	Abdelkarim, M., & ElAdawi, R.	International Conference on Software Testing, Verification and Validation Workshops (ICSTW)	5	2022

(Continue) List of selected studies in this review.
ID	Title	Authors	Venue	Cite	Year
S40	Evaluating Features for Machine Learning Detection of Order- and Non-Order-Dependent Flaky Tests	Parry, O., Kapfhammer, G. M., Hilton, M., & McMinn, P.	IEEE Conference on Software Testing, Verification and Validation (ICST)	11	2022
S41	Machine Learning Regression Techniques for Test Case Prioritization in Continuous Integration Environment	Da Roza, E. A., Lima, J. A. P., Silva, R. C., & Vergilio, S. R.	IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)	5	2022
S42	Jaskier: A Supporting Software Tool for Continuous Build Outcome Prediction Practice	Kawalerowicz, M., & Madeyski, L.	Advances and Trends in Artificial Intelligence. From Theory to Practice: International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE	2	2021
S43	Using decision trees to predict the certification result of a build	Hassan, A. E., & Zhang, K.	IEEE/ACM International Conference on Automated Software Engineering (ASE)	11	2006
S44	Data stream mining for predicting software build outcomes using source code metrics	Finlay, J., Pears, R., & Connor, A. M.	Information and Software Technology	50	2014
S45	Enhanced regression testing technique for agile software development and continuous integration strategies	Ali, S., Hafeez, Y., Hussain, S., & Yang, S.	Software Quality Journal	29	2020
S46	A learning algorithm for optimizing continuous integration development and testing practice	Marijan, D., Gotlieb, A., & Liaaen, M.	Software: Practice and Experience	37	2019
S47	The Effect of Class Noise on Continuous Test Case Selection: A Controlled Experiment on Industrial Data	Al-Sabbagh, K. W., Hebig, R., & Staron, M.	International Conference on Product-Focused Software Process Improvement	13	2020
S48	Adaptive Reward Computation in Reinforcement Learning-Based Continuous Integration Testing	Yang, Y., Pan, C., Li, Z., & Zhao, R.	IEEE Access	3	2021
S49	Towards auto-labelling issue reports for pull-based software development using text mining approach	Fazayeli, H., Syed-Mohamad, S. M., & Akhir, N. S. M.	Procedia Computer Science	21	2019
S50	Sparse reward for reinforcement learning-based continuous integration testing	Yang, Y., Li, Z., Shang, Y., & Li, Q.	Journal of Software: Evolution and Process	15	2023
S51	Predicting Build Outcomes in Continuous Integration using Textual Analysis of Source Code Commits	Al-Sabbagh, K., Staron, M., & Hebig, R.	Proceedings of the International Conference on Predictive Models and Data Analytics in Software Engineering	0	2022
S52	Test Case Prioritization using Transfer Learning in Continuous Integration Environments	Mamata, R., Azim, A., Liscano, R., Smith, K., Chang, Y. K., Seferi, G., & Tauseef, Q.	IEEE/ACM International Conference on Automation of Software Test (AST)	0	2023

Appendix B Data Extraction From

Table 23: The extracted data items from each study and their relationship with research questions.

ID	Data Item	Description
	Demographic data
D1	Paper Title	The title of the study
D2	Authors	Name(s) of the author(s)
D3	Publication Year	Publication year of the paper
D4	Publication Venue	Name of the conference or journal where the paper is published
D5	Publication Type	Publication type i.e., workshop, conference, journal
D6	Number of Citation	How many citations does the paper have according to GoogleScholar
D7	Keywords	List of keywords of the paper
D8	Context of the Study	The study contexts are categorized into industry and non-industry cases
	RQ1: CI Tasks
D9	CI Phases Addressed by ML	The definition of the CI phases and their role in CI
D10	CI Phase Mediator	Abstraction of the outcome and required input of each CI phases
D11	CI Tasks Enhanced by ML	The definition of the enhanced CI tasks by ML methods
D12	Insights on Underexplored CI Phases/Tasks	We summarized if any underexplored CI phases/tasks are identified by the study
	RQ2: Data Engineering
D13	Commonly Used Datasets	The most employed datasets in each CI tasks
D14	Data Engineering Methods	The required data preparation and engineering methods
D15	Data Types and Tasks	Correlation between employed data types and the enhanced CI tasks
D16	Data Quality Impact	The impact of data quality on the performance of ML models
	RQ3: Feature Engineering
D17	Feature Types	The features that have been used in studies for training the ML models
D18	Feature Engineering Techniques	The employed techniques for preparing the data as an input of ML models
D19	Feature and Tasks	The correlation between feature types and CI tasks
	RQ4: Model Tuning
D20	ML Learning Types	Learning types of ML models based on the nature of input data
D21	ML Model Types	Types of ML models based on nature of the problem
D22	ML Types and CI Phases	Correlation between ML model types and CI phases
D23	Model Tuning	The employed techniques for tuning the hyper-parameters
	RQ5: Evaluation
D24	Evaluation Metrics	Which evaluation metrics have been used for measuring the performance of ML models
D25	Evaluation Techniques	Data division and evaluation technique
D26	Metrics and ML Models	Correlation between employed evaluation metrics and ML models

References

[1] Mojtaba Shahin, Muhammad Ali Babar, and Liming Zhu. Continuous integration, delivery and deployment: a systematic review on approaches, tools, challenges and practices. IEEE Access, 5:3909–3943, 2017.
[2] Xianhao ** and Francisco Servant. A cost-efficient approach to building in continuous integration. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), pages 13–25. IEEE, 2020.
[3] Zheng Xie and Ming Li. Cutting the software building efforts in continuous integration by semi-supervised online auc optimization. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 2875–2881, 2018.
[4] Rezwana Mamata, Kevin Smith, Akramul Azim, Yee-Kang Chang, Qasim Taiseef, Ramiro Liscano, and Gkerta Seferi. Failure prediction using transfer learning in large-scale continuous integration environments. In Proceedings of the 32nd Annual International Conference on Computer Science and Software Engineering, pages 193–198, 2022.
[5] Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig. Usage, costs, and benefits of continuous integration in open-source projects. In Proceedings of the 31st IEEE/ACM international conference on automated software engineering, pages 426–437, 2016.
[6] Mohamed Abdelkarim and Reem ElAdawi. Tcp-net: Test case prioritization using end-to-end deep neural networks. In 2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pages 122–129. IEEE, 2022.
[7] Iris Figalist, Andreas Biesdorf, Christoph Brand, Sebastian Feld, and Marie Kiermeier. Supporting the devops feedback loop using unsupervised machine learning. In 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA), pages 1–6. IEEE, 2019.
[8] Saad Shafiq, Atif Mashkoor, Christoph Mayr-Dorn, and Alexander Egyed. A literature review of machine learning and software development life cycle stages. IEEE Access, 2021.
[9] Seongmin Lee, Shin Hong, Jungbae Yi, Taeksu Kim, Chul-Joo Kim, and Shin Yoo. Classifying false positive static checker alarms in continuous integration using convolutional neural networks. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), pages 391–401. IEEE, 2019.
[10] Yang Yang, Zheng Li, Ying Shang, and Qianyu Li. Sparse reward for reinforcement learning-based continuous integration testing. Journal of Software: Evolution and Process, page e2409, 2021.
[11] Martin Brandtner, Sebastian C Müller, Philipp Leitner, and Harald C Gall. Sqa-profiles: Rule-based activity profiles for continuous integration environments. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pages 301–310. IEEE, 2015.
[12] Ali Kazemi Arani, Mansooreh Zahedi, Triet Huynh Minh Le, and Muhammad Ali Babar. Sok: Machine learning for continuous integration. In 2023 IEEE/ACM International Workshop on Cloud Intelligence & AIOps (AIOps), pages 8–13. IEEE, 2023.
[13] Brian Fitzgerald and Klaas-Jan Stol. Continuous software engineering: A roadmap and agenda. Journal of Systems and Software, 123:176–189, 2017.
[14] Patrick Debois et al. Devops: A software revolution in the making. Journal of Information Technology Management, 24(8):3–39, 2011.
[15] Tom Michael Mitchell. The discipline of machine learning, volume 9. Carnegie Mellon University, School of Computer Science, Machine Learning …, 2006.
[16] Yalin Baştanlar and Mustafa Özuysal. Introduction to machine learning. In miRNomics: MicroRNA Biology and Computational Analysis, pages 105–128. Springer, 2014.
[17] J Russell Stuart and Peter Norvig. Artificial intelligence: a modern approach. Prentice Hall, 2009.
[18] Alex M Andrew. Reinforcement learning. Kybernetes, 1998.
[19] Foyzul Hassan and Xiaoyin Wang. Change-aware build prediction model for stall avoidance in continuous integration. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 157–162. IEEE, 2017.
[20] Rongqi Pan, Mojtaba Bagherzadeh, Taher A Ghaleb, and Lionel Briand. Test case selection and prioritization using machine learning: a systematic literature review. Empirical Software Engineering, 27(2):1–43, 2022.
[21] Vinicius HS Durelli, Rafael S Durelli, Simone S Borges, Andre T Endo, Marcelo M Eler, Diego RC Dias, and Marcelo P Guimarães. Machine learning applied to software testing: A systematic map** study. IEEE Transactions on Reliability, 68(3):1189–1212, 2019.
[22] Du Zhang and Jeffrey JP Tsai. Machine learning and software engineering. Software Quality Journal, 11(2):87–119, 2003.
[23] Asad Ali and Carmine Gravino. A systematic literature review of software effort prediction using machine learning methods. Journal of Software: Evolution and Process, 31(10):e2211, 2019.
[24] Mihir Gada, Zenil Haria, Arnav Mankad, Kaustubh Damania, and Smita Sankhe. Automated feature engineering and hyperparameter optimization for machine learning. In 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), volume 1, pages 981–986. IEEE, 2021.
[25] Shichao Zhang, Chengqi Zhang, and Qiang Yang. Data preparation for data mining. Applied artificial intelligence, 17(5-6):375–381, 2003.
[26] Udayan Khurana, Fatemeh Nargesian, Horst Samulowitz, Elias Khalil, and Deepak Turaga. Automating feature engineering. Transformation, 10(10):10, 2016.
[27] Ethem Alpaydin. Introduction to machine learning. MIT press, 2020.
[28] Barbara A Kitchenham and Stuart Charters. Guidelines for performing systematic literature reviews in software engineering technical report. Software Engineering Group, EBSE Technical Report, Keele University and Department of Computer Science University of Durham, 2, 2007.
[29] Barbara Kitchenham, Rialette Pretorius, David Budgen, O Pearl Brereton, Mark Turner, Mahmood Niazi, and Stephen Linkman. Systematic literature reviews in software engineering–a tertiary study. Information and software technology, 52(8):792–805, 2010.
[30] Thomas Van Klompenburg, Ayalew Kassahun, and Cagatay Catal. Crop yield prediction using machine learning: A systematic literature review. Computers and Electronics in Agriculture, 177:105709, 2020.
[31] Simin Wang, Liguo Huang, Amiao Gao, Jidong Ge, Tengfei Zhang, Haitao Feng, Ishna Satyarth, Ming Li, He Zhang, and Vincent Ng. Machine/deep learning for software engineering: A systematic literature review. IEEE Transactions on Software Engineering, 49(3):1188–1231, 2022.
[32] Ruchika Malhotra. A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing, 27:504–518, 2015.
[33] Claes Wohlin. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering, pages 1–10, 2014.
[34] Virginia Braun and Victoria Clarke. Using thematic analysis in psychology. Qualitative research in psychology, 3(2):77–101, 2006.
[35] Daniela S Cruzes and Tore Dyba. Recommended steps for thematic synthesis in software engineering. In 2011 international symposium on empirical software engineering and measurement, pages 275–284. IEEE, 2011.
[36] Ahmed E Hassan and Ken Zhang. Using decision trees to predict the certification result of a build. In 21st IEEE/ACM International Conference on Automated Software Engineering (ASE’06), pages 189–198. IEEE, 2006.
[37] Adithya Abraham Philip, Ranjita Bhagwan, Rahul Kumar, Chandra Sekhar Maddila, and Nachiappan Nagppan. Fastlane: test minimization for rapidly deployed large-scale online services. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 408–418. IEEE, 2019.
[38] William A Firestone. Meaning in method: The rhetoric of quantitative and qualitative research. Educational researcher, 16(7):16–21, 1987.
[39] Mary Dixon-Woods, Shona Agarwal, David Jones, Bridget Young, and Alex Sutton. Synthesising qualitative and quantitative evidence: a review of possible methods. Journal of health services research & policy, 10(1):45–53, 2005.
[40] Sean Stolberg. Enabling agile testing through continuous integration. In 2009 agile conference, pages 369–374. IEEE, 2009.
[41] Khaled Walid Al-Sabbagh, Miroslaw Staron, Regina Hebig, Wilhelm Meding, AK Tarhan, and A Coskunçay. Predicting test case verdicts using textual analysis of committed code churns. In IWSM-Mensura, pages 138–153, 2019.
[42] Sadia Ali, Yaser Hafeez, Shariq Hussain, and Shunkun Yang. Enhanced regression testing technique for agile software development and continuous integration strategies. Software Quality Journal, pages 1–27, 2019.
[43] Daniel Elsner, Florian Hauer, Alexander Pretschner, and Silke Reimer. Empirically evaluating readily available information for regression test optimization in continuous integration. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 491–504, 2021.
[44] Chaoyue Pan, Yang Yang, Zheng Li, and Junxia Guo. Dynamic time window based reward for reinforcement learning in continuous integration testing. In 12th Asia-Pacific Symposium on Internetware, pages 189–198, 2020.
[45] Jackson A Prado Lima, Willian DF Mendonça, Silvia R Vergilio, and Wesley KG Assunção. Learning-based prioritization of test cases in continuous integration of highly-configurable software. In Proceedings of the 24th ACM Conference on Systems and Software Product Line: Volume A-Volume A, pages 1–11, 2020.
[46] Ricardo Martins, Rui Abreu, Manuel Lopes, and João Nadkarni. Supervised learning for test suit selection in continuous integration. In 2021 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pages 239–246. IEEE, 2021.
[47] Jacqui Finlay, Russel Pears, and Andy M Connor. Data stream mining for predicting software build outcomes using source code metrics. Information and Software Technology, 56(2):183–198, 2014.
[48] **g Xia, Yanhui Li, and Chuanqi Wang. An empirical study on the cross-project predictability of continuous integration outcomes. In 2017 14th Web Information Systems and Applications Conference (WISA), pages 234–239. IEEE, 2017.
[49] Mark Schwabacher. A survey of data-driven prognostics. Infotech@ Aerospace, page 7002, 2005.
[50] Carmine Vassallo. Enabling continuous improvement of a continuous integration process. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1246–1249. IEEE, 2019.
[51] Fanliang Chen, Zheng Li, Ying Shang, and Yang Yang. Focus on new test cases in continuous integration testing based on reinforcement learning. In 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS), pages 830–841. IEEE, 2022.
[52] Harsurinder Kaur, Husanbir Singh Pannu, and Avleen Kaur Malhi. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR), 52(4):1–36, 2019.
[53] Aizaz Sharif, Dusica Marijan, and Marius Liaaen. Deeporder: Deep learning for test case prioritization in continuous integration testing. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 525–534. IEEE, 2021.
[54] Lei Xiao, Huaikou Miao, Tingting Shi, and Yu Hong. Lstm-based deep learning for spatial–temporal software testing. DISTRIBUTED AND PARALLEL DATABASES, 2020.
[55] Zhaolin Wu, Yang Yang, Zheng Li, and Ruilian Zhao. A time window based reinforcement learning reward for test case prioritization in continuous integration. In Proceedings of the 11th Asia-Pacific Symposium on Internetware, pages 1–6, 2019.
[56] Vidhi Vig and Arvinder Kaur. Test effort estimation and prediction of traditional and rapid release models using machine learning algorithms. Journal of Intelligent & Fuzzy Systems, 35(2):1657–1669, 2018.
[57] Giovanni Grano, Timofey V Titov, Sebastiano Panichella, and Harald C Gall. How high will it be? using machine learning models to predict branch coverage in automated testing. In 2018 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), pages 19–24. IEEE, 2018.
[58] Shyam R Chidamber and Chris F Kemerer. A metrics suite for object oriented design. IEEE Transactions on software engineering, 20(6):476–493, 1994.
[59] Maurice H Halstead. Elements of Software Science (Operating and programming systems series). Elsevier Science Inc., 1977.
[60] Jung-Min Kim and Adam Porter. A history-based test prioritization technique for regression testing in resource constrained environments. In Proceedings of the 24th international conference on software engineering, pages 119–129, 2002.
[61] Alireza Haghighatkhah, Mika Mäntylä, Markku Oivo, and Pasi Kuvaja. Test prioritization in continuous integration environments. Journal of Systems and Software, 146:80–98, 2018.
[62] Mojtaba Bagherzadeh, Nafiseh Kahani, and Lionel Briand. Reinforcement learning for test case prioritization. arXiv preprint arXiv:2011.01834, 2020.
[63] Yavuz Koroglu, Alper Sen, Doruk Kutluay, Akin Bayraktar, Yalcin Tosun, Murat Cinar, and Hasan Kaya. Defect prediction on a legacy industrial software: A case study on software with few defects. In 2016 IEEE/ACM 4th International Workshop on Conducting Empirical Studies in Industry (CESI), pages 14–20. IEEE, 2016.
[64] Khaled Walid Al-Sabbagh, Regina Hebig, and Miroslaw Staron. The effect of class noise on continuous test case selection: A controlled experiment on industrial data. In International Conference on Product-Focused Software Process Improvement, pages 287–303. Springer, 2020.
[65] Mark Treveil, Nicolas Omont, Clément Stenac, Kenji Lefevre, Du Phan, Joachim Zentici, Adrien Lavoillotte, Makoto Miyazaki, and Lynn Heidmann. Introducing MLOps. O’Reilly Media, 2020.
[66] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
[67] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
[68] Alex Ratner, Stephen Bach, Paroma Varma, and Chris Ré. Weak supervision: the new programming paradigm for machine learning. Hazy Research. Available via https://dawn. cs. stanford. edu//2017/07/16/weak-supervision/. Accessed, pages 05–09, 2019.
[69] Ahmadreza Saboor Yaraghi, Mojtaba Bagherzadeh, Nafiseh Kahani, and Lionel Briand. Scalable and accurate test case prioritization in continuous integration contexts. IEEE Transactions on Software Engineering, 2022.
[70] Ivan Porres, Tanwir Ahmad, Hergys Rexha, Sébastien Lafond, and Dragos Truscan. Automatic exploratory performance testing using a discriminator neural network. In 2020 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pages 105–113. IEEE, 2020.
[71] Ian H Witten and Eibe Frank. Data mining: practical machine learning tools and techniques with java implementations. Acm Sigmod Record, 31(1):76–77, 2002.
[72] Dusica Marijan, Arnaud Gotlieb, and Sagar Sen. Test case prioritization for continuous regression testing: An industrial case study. In 2013 IEEE International Conference on Software Maintenance, pages 540–543. IEEE, 2013.
[73] Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige. Reinforcement learning for automatic test case prioritization and selection in continuous integration. arXiv preprint arXiv:1811.04122, 2018.
[74] Amit Kumar Mondal, Banani Roy, and Kevin A Schneider. An exploratory study on automatic architectural change analysis using natural language processing techniques. In 2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM), pages 62–73. IEEE, 2019.
[75] James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. Advances in neural information processing systems, 24, 2011.
[76] Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger Hoos, Kevin Leyton-Brown, et al. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, volume 10, pages 1–5, 2013.
[77] Chuan-sheng Foo, Andrew Ng, et al. Efficient multiple hyperparameter learning for log-linear models. Advances in neural information processing systems, 20, 2007.
[78] Rafael Gomes Mantovani, André Luis Debiaso Rossi, Edesio Alcobaça, Jadson Castro Gertrudes, Sylvio Barbon Junior, and André Carlos Ponce de Leon Ferreira de Carvalho. Rethinking default values: a low cost and efficient strategy to define hyperparameters. arXiv preprint arXiv:2008.00025, 2020.
[79] Li Yang and Abdallah Shami. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415:295–316, 2020.
[80] Lei Cheng and Qingjiang Shi. Towards overfitting avoidance: Tuning-free tensor-aided multi-user channel estimation for 3d massive mimo communications. IEEE Journal of Selected Topics in Signal Processing, 15(3):832–846, 2021.
[81] Osval Antonio Montesinos López, Abelardo Montesinos López, and Jose Crossa. Overfitting, model tuning, and evaluation of prediction performance. In Multivariate statistical machine learning methods for genomic prediction, pages 109–139. Springer, 2022.
[82] Sreeja Ashok, Sangeetha Ezhumalai, and Tanvi Patwa. Remediating data drifts and re-establishing ml models. Procedia Computer Science, 218:799–809, 2023.
[83] Benoit Nougnanke, Yann Labit, Marc Bruyere, Ulrich Aivodji, and Simone Ferlin. Ml-based performance modeling in sdn-enabled data center networks. IEEE Transactions on Network and Service Management, 20(1):815–829, 2022.
[84] Giovanni Grano, Timofey V Titov, Sebastiano Panichella, and Harald C Gall. Branch coverage prediction in automated testing. Journal of Software: Evolution and Process, 31(9):e2158, 2019.
[85] Cong Pan and Michael Pradel. Continuous test suite failure prediction. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 553–565, 2021.
[86] Bihuan Chen, Linlin Chen, Chen Zhang, and Xin Peng. Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 42–53. IEEE, 2020.
[87] Xiao Qu, Myra B Cohen, and Katherine M Woolf. Combinatorial interaction regression testing: A study of test case generation and prioritization. In 2007 IEEE International Conference on Software Maintenance, pages 255–264. IEEE, 2007.
[88] Ebrahim Mortaz. Imbalance accuracy metric for model selection in multi-class imbalance classification problems. Knowledge-Based Systems, 210:106490, 2020.
[89] Anil Mor. Evaluate the effectiveness of test suite prioritization techniques using apfd metric. IOSR Journal of Computer, 16(4):47–51, 2014.
[90] Zhiyong Yang, Qianqian Xu, Shilong Bao, Xiaochun Cao, and Qingming Huang. Learning with multiclass auc: Theory and algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7747–7763, 2021.
[91] Timothy O Hodson. Root-mean-square error (rmse) or mean absolute error (mae): When to use them or not. Geoscientific Model Development, 15(14):5481–5487, 2022.
[92] Yang Yang, Chaoyue Pan, Zheng Li, and Ruilian Zhao. Adaptive reward computation in reinforcement learning-based continuous integration testing. IEEE Access, 9:36674–36688, 2021.
[93] Apostolos Ampatzoglou, Stamatia Bibi, Paris Avgeriou, and Alexander Chatzigeorgiou. Guidelines for managing threats to validity of secondary studies in software engineering. Contemporary empirical methods in software engineering, pages 415–441, 2020.
[94] Xin Zhou, Yuqin **, He Zhang, Shanshan Li, and Xin Huang. A map of threats to validity of systematic literature reviews in software engineering. In 2016 23rd Asia-Pacific Software Engineering Conference (APSEC), pages 153–160. IEEE, 2016.
[95] Xianmin Wang, **g Li, Xiaohui Kuang, Yu-an Tan, and ** Li. The security of machine learning in an adversarial setting: A survey. Journal of Parallel and Distributed Computing, 130:12–23, 2019.
[96] Samuel Ackerman, Eitan Farchi, Orna Raz, Marcel Zalmanovici, and Parijat Dube. Detection of data drift and outliers affecting machine learning model performance over time. arXiv preprint arXiv:2012.09258, 2020.