R2 Indicator and Deep Reinforcement Learning Enhanced Adaptive Multi-Objective Evolutionary Algorithm

Farajollah Tahernezhad-Javazm \orcidlink0000-0002-5073-9802    Debbie Rankin \orcidlink0000-0003-2110-0599    Naomi Du Bois \orcidlink0000-0002-9350-2100    Alice E. Smith \orcidlink0000-0001-8808-0663, Life Fellow, IEEE,    and Damien Coyle \orcidlink0000-0002-4739-1040, Senior Member, IEEE This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.Farajollah Tahernezhad-Javazm, Debbie Rankin, Naomi Du Bois, and Damien Coyle are with Intelligent Systems Research Center, School of Computing, Engineering and Intelligent Systems, Ulster University, Londonderry, BT48 7JL, UK. (email: [email protected])Damien Coyle and Naomi Du Bois are also with the Bath Institute for Augmented Human, University of Bath BA2 7AY, Bath UK.Alice E. Smith is with the Department of Industrial and Systems Engineering and Department of Computer Science and Software Engineering, Auburn University, Auburn, AL 36849, USA.
Abstract

Choosing an appropriate optimization algorithm is essential to achieving success in optimization challenges. Here we present a new evolutionary algorithm structure that utilizes a reinforcement learning-based agent aimed at addressing these issues. The agent employs a double deep q-network to choose a specific evolutionary operator based on feedback it receives from the environment during optimization. The algorithm’s structure contains five single-objective evolutionary algorithm operators. This single-objective structure is transformed into a multi-objective one using the R2 indicator. This indicator serves two purposes within our structure: first, it renders the algorithm multi-objective, and second, provides a means to evaluate each algorithm’s performance in each generation to facilitate constructing the reinforcement learning-based reward function. The proposed R2-reinforcement learning multi-objective evolutionary algorithm (R2-RLMOEA) is compared with six other multi-objective algorithms that are based on R2 indicators. These six algorithms include the operators used in R2-RLMOEA as well as an R2 indicator-based algorithm that randomly selects operators during optimization. We benchmark performance using the CEC09 functions, with performance measured by inverted generational distance and spacing. The R2-RLMOEA algorithm outperforms all other algorithms with strong statistical significance (p<0.001𝑝0.001p<0.001italic_p < 0.001) when compared with the average spacing metric across all ten benchmarks.

Index Terms:
Evolutionary Algorithms (EAs), Multi-objective Optimization Problem (MOP), R2 indicator, Reinforcement Learning (RL), Double Deep Q-learning (DDQN)

I Introduction

Due to the increasing complexity and difficulty of real-world problems, more reliable optimization techniques have become necessary in recent decades. There are numerous types of optimization problems, including single-objective/multi-objective (many-objective), continuous/discrete, and constrained/unconstrained. In recent literature, effort has been concentrated on multi-objective problems and many-objective problems (MOPs/MaOPs). MOPs/MaOPs can be addressed using two well-known strategies: mathematical-based approaches and evolutionary algorithms (EAs)/swarm-based intelligence (SI) methods. Although EAs and SI are more time-consuming compared to many mathematical-based methods, they are flexible to implement and more effective in non-differential, noisy, and complicated environments [1].

Generally, multi-objective evolutionary algorithms (MOEAs) can be divided into three main groups when dealing with MOPs/MaOPs: domination-based, decomposition-based, and indicator-based methods. Indicator-based algorithms have garnered significant attention recently in the field of MOEAs. They evaluate and rate the performance of each individual in an MOEA using a set of criteria derived from the indicator (convergence and distribution). The first MOEA method to use the indicator concept to solve MOPs is the Indicator-Based Evolutionary Algorithm (IBEA) [2]. IBEA, which applies hypervolume (HV), as the indicator, suffers from the problem of high computational cost, especially in MaOPs. To address this issue, many indicators were introduced in recent years, such as R2 [3], e+superscript𝑒e^{+}italic_e start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT [2], ΔpsubscriptΔ𝑝\Delta_{p}roman_Δ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [4], spacing (SP) [5], and inverted generational distance (IGD) [6]. Approaches based on dominance and decomposition are more prevalent than methods based on indicators; however, indicator-based techniques have recently demonstrated promising performance compared to other methods.

Each EA possesses unique features in terms of the two exploration and exploitation criteria. Consequently, different EA methods result in different outputs. Selecting an appropriate EA for a specific problem is a time-consuming and tedious task, requiring a knowledgeable expert. In addition, during the optimization process, the dynamics of the problems change constantly, and applying specific operators/parameters is crucial for success. Tuning and controlling approaches are two techniques aimed at improving the selection of EA settings. The fundamental distinction between these methods lies in the timing of parameter/operator selection. Prior to the optimization process, EA settings are adjusted in parameter tuning. Alternatively, parameters/operators are tuned during the execution time in the controlling methods, also known as ’controlling on the fly’ [7, 8]. According to the nature of real-world problems, controlling methods—which can be divided into two categories: predictive methods and reinforcement-based methods—are better suited to tackling complex situations [7]. Predictive-based approaches employ statistical and machine learning models to foresee the best parameters/operators that will impact the future optimization process. In fact, these approaches use the past performance of EAs as a time series to forecast future parameters [9, 10]. Reinforcement Learning (RL) is an optimization technique that follows the principles of the Markov Decision Process (MDP) and operates based on the agent’s interactions with its environment [11]. As illustrated in Figure 1, the RL process comprises two main parts: agent and environment. It is a closed-loop process in which the agent performs an action based on feedback acquired from the environment (states and rewards), and the environment subsequently responds to the agent’s action. Until the agent learns how to cope with the environment, this trial-and-error process continues. The RL agent is responsible for choosing the best action for maximizing the final accumulated reward. RL is an efficient nature-inspired method for modifying EAs, particularly when dealing with complex conditions [12]. In general, RL can be classified into two well-known main categories: temporal difference-based and policy gradient-based methods. Temporal difference-based methods can be utilized in problems with discrete action space (operator selection), while policy gradient-based methods are suitable for continuous action space problems (parameter selection). Many researchers have addressed the combination of EAs with RL; however, the majority concentrated on single-objective evolutionary algorithms (SOEAs) such as the genetic algorithm (GA) [13, 14, 15] or differential evolution (DE) [16, 17, 18, 19] algorithms.

RL Agent Environment actionstatesreward
Figure 1: General reinforcement learning block diagram

Considering the aforementioned issues, we propose an adaptive MOEA structure that can deal with MOPs and additionally is not limited to a single or specific EA. As the dynamics of MOPs vary with each generation of optimization, MOEAs require different operators to handle these dynamics, dependent on the criteria of exploration and exploitation. Five EAs were therefore deployed in our work to meet this need. We transformed five single-objective EAs—GA [20], evolutionary strategy (ES) [21], teaching learning-based optimization (TLBO) [22], whale optimization algorithm (WOA) [23], and equilibrium optimizer (EO) [24]—into multi-objective EAs by using the R2 indicator. EO and WOA are two relatively new EAs known for their fast convergence [25, 26, 27]. In contrast, GA and ES are more established EAs. ES is particularly good at exploration [28, 29], while TLBO has shown to be effective at balancing the trade-off between exploration and exploitation [30, 31, 32]. In our architecture, a double deep Q-learning Network (DDQN) [33] is used as the hyper meta-heuristic to select the appropriate EA in each generation based on the optimization process’s feedback. The proposed algorithm, called R2-RLMOEA, is tested on the CEC09 multi-objective benchmarks, and the results are compared to the five MOEAs listed previously without guidance from the RL agent during the process. Based on the two indicators of IGD and SP, the results demonstrate the state-of-the-art performance of our proposed structure compared with the other algorithms.

The remainder of the paper is organized as follows: a literature review of recent works in the field of RL-EA is provided in Section II. Generic principles of multi-objective problems and Q-learning-based models are explained in Section III. Section IV introduces the combined EA-RL structure proposed in this work (R2-RLMOEA). Experimental settings and results are described in Section V and Section VI, respectively. The results are discussed in Section VIII. Finally, the conclusion is provided in Section IX.

II Related works

Regarding the combination of RL with EA, DE has received the most attention as a well-known EA in recent years. The essence of DE is based on the principle of mixing population variety to develop superior individuals. Like many other EAs, it employs the three operators of selection, crossover, and mutation, in which the two specific parameters of crossover probability (Cr) and scale factor (F) play a crucial role.

Weikang et al. [17] proposed a multi-objective evolutionary algorithm based on an RL controller and the multi-objective evolutionary algorithm based on decomposition (MOEA/D) [34] with Tchebycheff as the scalarization function. However, compared to other works, they use the state-action-reward-state-action (SARSA) algorithm [11] instead of Q-learning to choose the number of neighbors and type of DE mutation operators during optimization. The proposed structure is evaluated with fitness-rate-rank-based multi-armed bandit (FRRMAB) [35] using the two metrics of IGD and HV [36] on ten benchmark functions from CEC09. In terms of the HV and IGD metrics, RL-MOEA/D performed better than FRRMAB on most of the test instances, except for UF4, UF5, UF6, and UF10. Although both algorithms achieved similar results, the median values obtained by RL-MOEA/D were better than those obtained by FRRMAB.

Tan et al. [19] introduced reinforcement learning-based hybrid parameters and mutation strategies differential evolution (RL-HPSDE) based on Q-learning to improve the DE optimization process through parameter tuning and operator selection on single-objective benchmark problems from CEC17. Four states are identified based on the two methods of dynamic fitness distance correlation (DFDC) [37] and dynamic ruggedness of information entropy (DRIE) [38]. Also, actions are defined based on combining two distribution functions (Levy and Gauchy distribution) and two mutation operators (DE/current to rand/1 and DE/current to best/1). In addition, they applied a variant population policy (decreasing) in their work to improve the algorithm’s performance. In this study, the performance of five algorithms - JADE, SHADE, LSHADE, iLSHADE [39], and jSO [40] - were compared across different dimensions (10, 30, 50, and 100). The results showed that RL-HPSDE outperformed the other algorithms in 10, 30, and 50 dimensions. However, in 100 dimensions, jSO performed better than RL-HPSDE. Based on the results, RL-HPSDE is ranked first, followed by jSO. iLSHADE and LSAHDE, which share a similar ranking, while JADE and SHADE are ranked last.

In addition to F and Cr, population size (N) can significantly affects DE performance. A big F or a small F lead to more exploration and exploitation, respectively. On the other hand, the best values of Cr and N vary with each particular problem [41, 42, 43]. Jianyong et al. [44] proposed a policy gradient-based RL (Learning Differential Evolution [LDE]) structure in continuous action space for tuning F and CR. They applied a Long Short-Term Memory (LSTM) network to consider the nature of MDP-based problems and the state dependencies. They utilized CEC13 (for training and testing) and CEC17 (for testing) benchmark functions to demonstrate the effectiveness of their framework. The algorithm was tested on benchmarks with 10 and 30 dimensions. Overall, it outperformed variations of conventional DE algorithms but was inferior to jSO and HSES [45] in most of the benchmark tests, especially jSO. The drawback of their structure was the long training time, exacerbated by increasing the number of decision variables. LSTMs are significantly impacted by the number of neurons in hidden layers. They found that there is a complex relationship between the best number of neurons in hidden layers and each benchmark function.

Lue et al. [46] tried to address the real-world multi-objective assembly line feeding problem with an RL-based multi-objective DE. They applied a decomposition-based method for converting multi-objective to single-objective problems with an external archive for reserving non-dominated solutions in their framework. They applied three criteria of success rate, domination, and distribution to represent their states and to control convergence and distribution. Additionally, they utilized three mutation operators: DE/rand/2, DE/current-to-best/1, and DE/current-to-best/1. They determined that the proposed framework outperformed the three MOEAs of NAGA-II, NSGA-III [47], and SMPSO [48] in terms of convergence and solution quality.

Michele and Giovanni [49] proposed a generic model for parameter tuning in continuous action space. Proximal Policy Optimization (PPO) [50], a policy gradient-based RL, is used as the agent for optimizing the parameters of two well-known EA algorithms: Co-variance Matrix Adaptation Evolutionary Strategies (CMA-ES) [28] and DE. The agent chooses the proper step size (CMA-ES), scaling factor (DE), and crossover rate (DE) during the optimization process. Performance indicators include area under the curve, and best fitness of Run [51]. In addition, the reward function is considered based on the inter-generational ΔfΔ𝑓\Delta froman_Δ italic_f to demonstrate optimization performance improvement and robustness across different objective scales. The results emphasize the importance of reward normalization and observation space design in achieving the best outcomes.

Many publications have explored the integration of RL with EAs for operator and parameter selection, as summarized in Table I. It is evident that most of these works concentrate on SOEAs, and even those addressing MOEAs predominantly focus on integration with DE operators. R2-RLMOEA distinguishes itself by targeting MOEAs through the innovative use of the R2 indicator, enhancing the dynamic decision-making capabilities of RL by incorporating a diverse array of EAs for optimization. This approach not only broadens the applicability of RL-EA integration to complex multi-objective optimization tasks but also leverages the unique strengths of various evolutionary strategies to navigate the optimization landscape more effectively.

TABLE I: Publications in the field of RL-EA (operator/parameter selection)
Reference RL algorithm Type EA algorithm
[44] Policy Gradient algorithm SOEA DE
[49] PPO SOEA CMA-ES & DE
[52] Dueling Double Deep Q-Learning SOEA PSO [53]
[14]
[54]
SARSA SOEA GA
[55] Deep Q-Network SOEA Hyper-Heuristic Algorithm
[15] Q-Learning & SARSA SOEA GA
[56] Q-Learning SOEA ACO
[57] Q-Learning SOEA memetic search
[58] Q-learning SOEA
ES
Cellular GA [59]
GA MPC [60]
IPO-10DDrCMAES [61]
[62] RL principle SOEA GWO [63]
[64] Q-Learning SOEA Firefly Algorithm [65]
[66] Double Deep Q-learning SOEA DE
[19]
[16]
Q-learning SOEA DE
[67]
[46]
Q-learning MOEA DE
[68] Q-learning MOEA NSGA-II-DE
[17] SARSA MOEA DE operators & MOEA/D
R2-RLMOEA Double Deep Q-learning MOEA
ES, GA, TLBO
WOA, EO, R2 Indicator

III Preliminaries

This section explains the algorithms and methods used in our paper. It includes the definition of multi-objective optimization and the concept of the R2 indicator for solving MOPs. Additionally, the fundamental building block of R2-RLMOEA, Double Deep Q-learning, is described.

III-A Multi-Objective Optimization Problem (MOP)

Generally, a MOP such as F(x)Fx\textbf{F}(\textbf{x})F ( x ) with the "m""𝑚""m"" italic_m " objective functions of (f1(x),,fm(x)subscript𝑓1xsubscript𝑓𝑚xf_{1}(\textbf{x}),...,f_{m}(\textbf{x})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( x ) , … , italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( x )) can be shown as Equation 1:

minimizeF(x)minimizeFx\displaystyle\mathrm{minimize}\>\>\textbf{F}(\textbf{x})roman_minimize F ( x ) =(f1(x),,fm(x))Tabsentsuperscriptsubscript𝑓1xsubscript𝑓𝑚x𝑇\displaystyle=(f_{1}(\textbf{x}),...,f_{m}(\textbf{x}))^{T}= ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( x ) , … , italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( x ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (1)
xfor-allx\displaystyle\forall\textbf{x}∀ x Ψ(ΨnRm)absentΨsuperscriptΨ𝑛superscript𝑅𝑚\displaystyle\in\Psi\>(\Psi^{n}\to R^{m})∈ roman_Ψ ( roman_Ψ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT )

where x=(x1,x2,,xn)Txsuperscriptsubscript𝑥1subscript𝑥2subscript𝑥𝑛𝑇\textbf{x}=(x_{1},x_{2},...,x_{n})^{T}x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is a vector with dimension n𝑛nitalic_n (n𝑛nitalic_n is the number of decision variables), and ΨnsuperscriptΨ𝑛\Psi^{n}roman_Ψ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and Rmsuperscript𝑅𝑚R^{m}italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT are the search/decision space and objective space, respectively. The goal of Equation 1 is to minimize/maximize all objective functions (f1(x),,fm(x)subscript𝑓1𝑥subscript𝑓𝑚𝑥f_{1}(x),...,f_{m}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , … , italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x )) in one run and find Pareto optimal solutions and Pareto Front (PF).

A vector a=(a1,a2,,am)Tasuperscriptsubscript𝑎1subscript𝑎2subscript𝑎𝑚𝑇\textbf{a}=(a_{1},a_{2},...,a_{m})^{T}a = ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT dominates vector b=(b1,b2,,bm)Tbsuperscriptsubscript𝑏1subscript𝑏2subscript𝑏𝑚𝑇\textbf{b}=(b_{1},b_{2},...,b_{m})^{T}b = ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, denoted by abprecedesab\textbf{a}\prec\textbf{b}a ≺ b, iff i{1,2,,m},fi(a)fi(b)formulae-sequencefor-all𝑖12𝑚subscript𝑓𝑖asubscript𝑓𝑖b\forall i\in\{1,2,...,m\},f_{i}(\textbf{a})\leq f_{i}(\textbf{b})∀ italic_i ∈ { 1 , 2 , … , italic_m } , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( a ) ≤ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( b ) and fk(a)fk(b),k{1,2,,m}formulae-sequencesubscript𝑓𝑘asubscript𝑓𝑘bfor-all𝑘12𝑚f_{k}(\textbf{a})\neq f_{k}(\textbf{b}),\forall k\in\{1,2,...,m\}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( a ) ≠ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( b ) , ∀ italic_k ∈ { 1 , 2 , … , italic_m } . In addition, a vector in decision space such as xΨsuperscriptxΨ\textbf{x}^{*}\in\Psix start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Ψ is named a Pareto optimal solution if there is no yxprecedesysuperscriptx\textbf{y}\prec\textbf{x}^{*}y ≺ x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, iff yfor-ally\forall\textbf{y}∀ y \exists ΨΨ\Psiroman_Ψ. The Pareto optimal set (PS) is the set of Pareto optimal solutions, and PF is the projection of PS from decision space into objective space (PF={F(x)Rm|xPS}𝑃𝐹conditional-setFxsuperscript𝑅𝑚x𝑃𝑆PF=\{\textbf{F}(\textbf{x})\in R^{m}|\textbf{x}\in PS\}italic_P italic_F = { F ( x ) ∈ italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | x ∈ italic_P italic_S }).

III-B R2-Indicator

Although HV is one of the most popularly employed indicators, the R2 indicator and its family (R1 and R3) [69] is another indicator that evaluates both convergence and diversity [70]. Compared to HV, R2 requires substantially fewer computations and provides more distributed solutions [71]. The generic functionality of the R2 indicator is to evaluate the quality of two populations. To compute the R2 indicator, it is necessary to identify both the scalarization function and the reference point, and different scalarization functions have different effects on the results [69]. We utilized the achievement scalarization function (ASF) as the utility function in our study. R2 can thus be calculated as Equation 2 for the population/set of 𝒫𝒫\mathcal{P}caligraphic_P, the reference/utopian point zsuperscriptz\textbf{z}^{*}z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and the weight vector w=(w1,w2,,wm)Ωwsubscript𝑤1subscript𝑤2subscript𝑤𝑚Ω\textbf{w}=(w_{1},w_{2},...,w_{m})\in\Omegaw = ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∈ roman_Ω (w1=1(\parallel\textbf{w}\parallel_{1}=1( ∥ w ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and w1,w2,,wmm)w_{1},w_{2},...,w_{m}\geq m)italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≥ italic_m ).

R2(𝒫,Ω,z)=1|Ω|wΩminp𝒫{ASF(p|w,z)}𝑅2𝒫Ωsuperscriptz1ΩsubscriptwΩp𝒫𝑚𝑖𝑛𝐴𝑆𝐹conditionalpwsuperscriptzR2(\mathcal{P},\Omega,\textbf{z}^{*})=\frac{1}{|\Omega|}\sum_{\textbf{w}\in% \Omega}\underset{\textbf{p}\in\mathcal{P}}{min}\left\{ASF(\textbf{p}|\textbf{w% },\textbf{z}^{*})\right\}italic_R 2 ( caligraphic_P , roman_Ω , z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT w ∈ roman_Ω end_POSTSUBSCRIPT start_UNDERACCENT p ∈ caligraphic_P end_UNDERACCENT start_ARG italic_m italic_i italic_n end_ARG { italic_A italic_S italic_F ( p | w , z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } (2)

where ASF is the scalarization function and can be defined as Equation 3:

ASF(p|w,z)=max1im{|pizi|wi}𝐴𝑆𝐹conditionalpwsuperscriptz1𝑖𝑚subscript𝑝𝑖subscriptsuperscript𝑧𝑖subscript𝑤𝑖ASF(\textbf{p}|\textbf{w},\textbf{z}^{*})=\underset{1\leq i\leq m}{\max}\;% \left\{\frac{\left|p_{i}-z^{*}_{i}\right|}{w_{i}}\right\}italic_A italic_S italic_F ( p | w , z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_UNDERACCENT 1 ≤ italic_i ≤ italic_m end_UNDERACCENT start_ARG roman_max end_ARG { divide start_ARG | italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG } (3)

The optimal result is related to a smaller value of R2, which indicates a shorter distance between the reference point zsuperscriptz\textbf{z}^{*}z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the arbitrary set 𝒫𝒫\mathcal{P}caligraphic_P. The reference point is an ideal point to which no other point in the population can dominate.

III-C Double Deep Q-learning (DDQN)

One of the most popular types of RL algorithm is Q-learning [72], which is a model-free and off-policy method and, like other types of RL, includes four parts: environment, state, action, and reward. The goal is to learn a policy enabling the agent to take optimal actions and maximize the rewards. In Q-learning, an agent makes decisions based on the action-valued function (derived from the Bellman equation), which can be defined as Equation III-C.

Q(st,at)𝑄subscript𝑠𝑡subscript𝑎𝑡\displaystyle Q(s_{t},a_{t})italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Q(st,at)+α[rt+1+γmaxa𝒜\displaystyle\leftarrow Q(s_{t},a_{t})+\alpha\left[r_{t+1}+\gamma\;\underset{a% \in\mathcal{A}}{\max}\right.← italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α [ italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ start_UNDERACCENT italic_a ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG
Q(st+1,a)Q(st,at)]\displaystyle\quad\left.Q(s_{t+1},a)-Q(s_{t},a_{t})\right]italic_Q ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a ) - italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] (4)

where 𝒜=[a1,a2,.,an]\mathcal{A}=[a_{1},a_{2},....,a_{n}]caligraphic_A = [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … . , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] is the set of possible actions. atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote action and state at time t𝑡titalic_t, respectively. After performing action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is the earned reward/punishment. α𝛼\alphaitalic_α is known as the learning rate, specifying the acceleration of learning, and γ𝛾\gammaitalic_γ is a value between zero to one. γ=0𝛾0\gamma=0italic_γ = 0 converts all the future rewards to zero. Thus, the agent is promoted by immediate rewards. On the other hand, if gamma is equal to one, we give the agent more patience to develop its long-term policy. [11]. Q-learning uses a look-up table (Q-table) to map various states to various actions and addresses problems with discrete states/actions. When the states are large or continuous, it cannot function properly. Mnih et al. [73] attempted to tackle this issue by substituting a neural network for the Q-table. In deep Q-learning (DQN), the neural network acts as the function approximator for estimating the Q-value, leading to the selection of the subsequent action. Additionally, its performance is enhanced by introducing the concept of an experience replay buffer [74]. Despite its success in many problems, DQN can suffer from action overestimation (resulting in premature convergence), unstable training, and poor performance. This problem is solved in double deep Q-learning (DDQN) [75] by adding another neural network. In DDQN, the first/main network (Qmsuperscript𝑄𝑚Q^{m}italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT) is used for action selection, while a second/target network (Qtsuperscript𝑄𝑡Q^{t}italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT), updated periodically based on the main network, is used for action evaluation.

Qm(st,at;θ)superscript𝑄𝑚subscript𝑠𝑡subscript𝑎𝑡𝜃\displaystyle Q^{m}(s_{t},a_{t};\theta)italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ ) Qm(st,at;θ)+α[rt+1+γQt(st+1,\displaystyle\leftarrow Q^{m}(s_{t},a_{t};\theta)+\alpha\Big{[}r_{t+1}+\gamma% \;Q^{t}\Big{(}s_{t+1},← italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ ) + italic_α [ italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ,
argmaxaAQm(st+1,a;θ);θ)Qm(st,at;θ)]\displaystyle\quad\underset{a\in A}{\arg\max}\;Q^{m}(s_{t+1},a;\theta);\theta^% {{}^{\prime}}\Big{)}-Q^{m}(s_{t},a_{t};\theta)\Big{]}start_UNDERACCENT italic_a ∈ italic_A end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a ; italic_θ ) ; italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ ) ] (5)

where θ𝜃\thetaitalic_θ and θsuperscript𝜃\theta^{{}^{\prime}}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT are the main and target network parameters, respectively. Based on the DDQN (Algorithm. 1), removing the maximization of the next state leads to reducing bias maximizing.

input : Initialized Qmsuperscript𝑄𝑚Q^{m}italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and Qtsuperscript𝑄𝑡Q^{t}italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, update limit (τ𝜏\tauitalic_τ), counter (c=0𝑐0c=0italic_c = 0)
maximum iteration (𝒩gamesubscript𝒩𝑔𝑎𝑚𝑒\mathcal{N}_{game}caligraphic_N start_POSTSUBSCRIPT italic_g italic_a italic_m italic_e end_POSTSUBSCRIPT), empty experience replay buffer (\mathcal{R}caligraphic_R)
output : trained Qmsuperscript𝑄𝑚Q^{m}italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT
1 for n𝒩game𝑛subscript𝒩𝑔𝑎𝑚𝑒n\in\mathcal{N}_{game}italic_n ∈ caligraphic_N start_POSTSUBSCRIPT italic_g italic_a italic_m italic_e end_POSTSUBSCRIPT do
2       Observe stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and choose at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A
3       apply atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, go to st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and receive rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT
4       store (st,at,rt+1,st+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡1subscript𝑠𝑡1(s_{t},a_{t},r_{t+1},s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) \in \mathcal{R}caligraphic_R
5for each update step do
6       cc+1𝑐𝑐1c\leftarrow c+1italic_c ← italic_c + 1
7       sample a batch of (st,at,rt+1,st+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡1subscript𝑠𝑡1(s_{t},a_{t},r_{t+1},s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) \in \mathcal{R}caligraphic_R
8       𝒫=rt+1+γQt(st+1,argmaxa𝒜Qm(st+1,a;θ),θ)Qm(st,at,θ)𝒫subscript𝑟𝑡1𝛾superscript𝑄𝑡subscript𝑠𝑡1𝑎𝒜𝑎𝑟𝑔𝑚𝑎𝑥superscript𝑄𝑚subscript𝑠𝑡1𝑎𝜃superscript𝜃superscript𝑄𝑚subscript𝑠𝑡subscript𝑎𝑡𝜃\mathcal{P}=r_{t+1}+\gamma\;Q^{t}(s_{t+1},\underset{a\in\mathcal{A}}{argmax}\;% Q^{m}(s_{t+1},a;\theta),\theta^{{}^{\prime}})-Q^{m}(s_{t},a_{t},\theta)caligraphic_P = italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , start_UNDERACCENT italic_a ∈ caligraphic_A end_UNDERACCENT start_ARG italic_a italic_r italic_g italic_m italic_a italic_x end_ARG italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a ; italic_θ ) , italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ )
9       Qm(st,at)Qm(st,at)+α𝒫superscript𝑄𝑚subscript𝑠𝑡subscript𝑎𝑡superscript𝑄𝑚subscript𝑠𝑡subscript𝑎𝑡𝛼𝒫Q^{m}(s_{t},a_{t})\leftarrow Q^{m}(s_{t},a_{t})+\alpha\;\mathcal{P}italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α caligraphic_P
10       if c𝑐citalic_c equals to τ𝜏\tauitalic_τ then
11             θθsuperscript𝜃𝜃\theta^{{}^{\prime}}\leftarrow\thetaitalic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ← italic_θ
12             c0𝑐0c\leftarrow 0italic_c ← 0
13      
Algorithm 1 Double Deep Q-learning

IV Proposed Algorithm (R2-RLMOEA)

This section presents the structure of R2-RLMOEA, including the two main EA and RL parts. The general structure of the algorithm is depicted in Figure 2. Following the initialization of EAs and the first population, R2 ranking and updating reference points operators are applied. The RL states (the inputs of the deep neural network) are then calculated, and depending on the received inputs, the RL agent chooses an action. Each action is associated with a specific evolutionary algorithm. Finally, the reward is predicted based on the selected EA and the current generation, and the deep network is updated. This process continues for the maximum specified number of RL iterations (maximum number of games). The final output is a trained network that can be utilized for solving a benchmark problem.

start Initialize EAs parameters Initialize random Pop Apply R2 rank operator & update reference points Observe states Select an action/EA Compute the reward Update Q-network Apply R2 rank operator & update reference points Termination Endyesno
Figure 2: R2-RLMOEA flow chart

The EA section includes the operators regarding the five single-objective algorithms of GA, ES, TLBO, WOA, and EO. Since we apply the single-objective form of EAs, this framework is converted to a generic structure and other EAs might be used instead or in addition. In each generation, to convert the generated single-objective population to a multi-objective population, we apply the R2 ranking operator, explained in Algorithm. 2.

input : Population members (𝒫𝒫\mathcal{P}caligraphic_P), and weight vector (𝒲𝒲\mathcal{W}caligraphic_W)
output : Sorted population members (𝒫sortedsubscript𝒫sorted\mathcal{P}_{\mathrm{sorted}}caligraphic_P start_POSTSUBSCRIPT roman_sorted end_POSTSUBSCRIPT)
1 Compute the L2normsubscript𝐿2normL_{2}-\mathrm{norm}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - roman_norm of p.Objformulae-sequence𝑝𝑂𝑏𝑗p.Objitalic_p . italic_O italic_b italic_j (p𝒫𝑝𝒫p\in\mathcal{P}italic_p ∈ caligraphic_P)
2 for w𝒲w𝒲\textbf{w}\in\mathcal{W}w ∈ caligraphic_W do
3       for p𝒫𝑝𝒫p\in\mathcal{P}italic_p ∈ caligraphic_P do
4             p.SvalASF(p.Obj|0,u)p.Sval\leftarrow ASF(p.Obj|0,\textbf{u})italic_p . italic_S italic_v italic_a italic_l ← italic_A italic_S italic_F ( italic_p . italic_O italic_b italic_j | 0 , u )   # p.Svalformulae-sequence𝑝𝑆𝑣𝑎𝑙p.Svalitalic_p . italic_S italic_v italic_a italic_l: scalarization value, #p.Objformulae-sequence𝑝𝑂𝑏𝑗p.Objitalic_p . italic_O italic_b italic_j: objectives
5      𝒫sortedsubscript𝒫sorted\mathcal{P}_{\mathrm{sorted}}caligraphic_P start_POSTSUBSCRIPT roman_sorted end_POSTSUBSCRIPT = Sort(𝒫,p.Sval1st,L2norm2ndformulae-sequence𝒫𝑝𝑆𝑣𝑎superscript𝑙1𝑠𝑡subscript𝐿2superscriptnorm2𝑛𝑑\mathcal{P},{p.Sval}^{1st},{L_{2}-\mathrm{norm}}^{2nd}caligraphic_P , italic_p . italic_S italic_v italic_a italic_l start_POSTSUPERSCRIPT 1 italic_s italic_t end_POSTSUPERSCRIPT , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - roman_norm start_POSTSUPERSCRIPT 2 italic_n italic_d end_POSTSUPERSCRIPT)   # 1st: first priority 2nd: second priority
Algorithm 2 Population ranking structure

where p.Svalformulae-sequence𝑝𝑆𝑣𝑎𝑙p.Svalitalic_p . italic_S italic_v italic_a italic_l is the scalar value for each individual based on the ASF scalarization function, and L2normsubscript𝐿2normL_{2}-\mathrm{norm}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - roman_norm is the Euclidean norm of all population objectives. 𝒫𝒫\mathcal{P}caligraphic_P presents the current population, and the population members are ranked based on the ASF scalarization value and L2normsubscript𝐿2normL_{2}-\mathrm{norm}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - roman_norm, respectively. To improve the diversity of the solutions, we use a new reference point updating method applied in [71]. This method reduces the objective space and subsequently decreases the outliers.

The RL part includes two sections of online training and offline testing. For implementing DDQN, we utilize two deep networks as the main network (applied in the online and offline parts) and the target network (applied in online training). The network inputs, also called states, include twenty parameters representing the dynamic environment feature in each optimization generation. All the states are shown in Table II

TABLE II: States representation.
      Index       State       Explanation
      s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT       fQ1fminfmaxfminsubscript𝑓𝑄1subscript𝑓𝑚𝑖𝑛subscript𝑓𝑚𝑎𝑥subscript𝑓𝑚𝑖𝑛\frac{f_{Q1}-f_{min}}{f_{max}-f_{min}}divide start_ARG italic_f start_POSTSUBSCRIPT italic_Q 1 end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG        fQ1::subscript𝑓𝑄1absentf_{Q1}:italic_f start_POSTSUBSCRIPT italic_Q 1 end_POSTSUBSCRIPT : lower quartile of population performance fminsubscript𝑓𝑚𝑖𝑛f_{min}italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT: minimum performance fmaxsubscript𝑓𝑚𝑎𝑥f_{max}italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT: maximum performance
      s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT       fQ2fminfmaxfminsubscript𝑓𝑄2subscript𝑓𝑚𝑖𝑛subscript𝑓𝑚𝑎𝑥subscript𝑓𝑚𝑖𝑛\frac{f_{Q2}-f_{min}}{f_{max}-f_{min}}divide start_ARG italic_f start_POSTSUBSCRIPT italic_Q 2 end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG       fQ2subscript𝑓𝑄2f_{Q2}italic_f start_POSTSUBSCRIPT italic_Q 2 end_POSTSUBSCRIPT: median of population performance
      s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT       fQ3fminfmaxfminsubscript𝑓𝑄3subscript𝑓𝑚𝑖𝑛subscript𝑓𝑚𝑎𝑥subscript𝑓𝑚𝑖𝑛\frac{f_{Q3}-f_{min}}{f_{max}-f_{min}}divide start_ARG italic_f start_POSTSUBSCRIPT italic_Q 3 end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG       fQ3subscript𝑓𝑄3f_{Q3}italic_f start_POSTSUBSCRIPT italic_Q 3 end_POSTSUBSCRIPT: upper quartile of population performance
      s4subscript𝑠4s_{4}italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT       fQmeanfminfmaxfminsubscript𝑓𝑄𝑚𝑒𝑎𝑛subscript𝑓𝑚𝑖𝑛subscript𝑓𝑚𝑎𝑥subscript𝑓𝑚𝑖𝑛\frac{f_{Qmean}-f_{min}}{f_{max}-f_{min}}divide start_ARG italic_f start_POSTSUBSCRIPT italic_Q italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG       fQmeansubscript𝑓𝑄𝑚𝑒𝑎𝑛f_{Qmean}italic_f start_POSTSUBSCRIPT italic_Q italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT: average of all quartile
      s5subscript𝑠5s_{5}italic_s start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT       SD(𝒫)SD(𝒫minmax)SD𝒫SDsubscript𝒫𝑚𝑖𝑛𝑚𝑎𝑥\frac{\mathrm{SD}(\mathcal{P})}{\mathrm{SD}(\mathcal{P}_{minmax})}divide start_ARG roman_SD ( caligraphic_P ) end_ARG start_ARG roman_SD ( caligraphic_P start_POSTSUBSCRIPT italic_m italic_i italic_n italic_m italic_a italic_x end_POSTSUBSCRIPT ) end_ARG        SD(𝒫)SD𝒫\mathrm{SD}(\mathcal{P})roman_SD ( caligraphic_P ): population performance standard deviation SD(𝒫minmax)SDsubscript𝒫𝑚𝑖𝑛𝑚𝑎𝑥\mathrm{SD}(\mathcal{P}_{minmax})roman_SD ( caligraphic_P start_POSTSUBSCRIPT italic_m italic_i italic_n italic_m italic_a italic_x end_POSTSUBSCRIPT ): maximum performance standard deviation (half of the 𝒫minmaxsubscript𝒫𝑚𝑖𝑛𝑚𝑎𝑥\mathcal{P}_{minmax}caligraphic_P start_POSTSUBSCRIPT italic_m italic_i italic_n italic_m italic_a italic_x end_POSTSUBSCRIPT includes minimum performance and the rest maximum performance)
      s6subscript𝑠6s_{6}italic_s start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT       GmaxGtGmaxsubscript𝐺𝑚𝑎𝑥subscript𝐺𝑡subscript𝐺𝑚𝑎𝑥\frac{G_{max}-G_{t}}{G_{max}}divide start_ARG italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG        Gmaxsubscript𝐺𝑚𝑎𝑥G_{max}italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT: maximum generation Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: current generation
      s7subscript𝑠7s_{7}italic_s start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT       ED(xQ1,xmin)ED(xmaxxmin)EDsubscript𝑥𝑄1subscript𝑥𝑚𝑖𝑛EDsubscript𝑥𝑚𝑎𝑥subscript𝑥𝑚𝑖𝑛\frac{\mathrm{ED}(x_{Q1},x_{min})}{\mathrm{ED}(x_{max}-x_{min})}divide start_ARG roman_ED ( italic_x start_POSTSUBSCRIPT italic_Q 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG roman_ED ( italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) end_ARG       ED(.):\mathrm{ED}(.):roman_ED ( . ) : Euclidean distance
      s8subscript𝑠8s_{8}italic_s start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT       ED(xQ2,xmin)ED(xmaxxmin)EDsubscript𝑥𝑄2subscript𝑥𝑚𝑖𝑛EDsubscript𝑥𝑚𝑎𝑥subscript𝑥𝑚𝑖𝑛\frac{\mathrm{ED}(x_{Q2},x_{min})}{\mathrm{ED}(x_{max}-x_{min})}divide start_ARG roman_ED ( italic_x start_POSTSUBSCRIPT italic_Q 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG roman_ED ( italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) end_ARG        xminsubscript𝑥𝑚𝑖𝑛x_{min}italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT: decision variables related to fminsubscript𝑓𝑚𝑖𝑛f_{min}italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT xmaxsubscript𝑥𝑚𝑎𝑥x_{max}italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT: decision variables related to fmaxsubscript𝑓𝑚𝑎𝑥f_{max}italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT
      s9subscript𝑠9s_{9}italic_s start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT       ED(xQ3,xmin)ED(xmaxxmin)EDsubscript𝑥𝑄3subscript𝑥𝑚𝑖𝑛EDsubscript𝑥𝑚𝑎𝑥subscript𝑥𝑚𝑖𝑛\frac{\mathrm{ED}(x_{Q3},x_{min})}{\mathrm{ED}(x_{max}-x_{min})}divide start_ARG roman_ED ( italic_x start_POSTSUBSCRIPT italic_Q 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG roman_ED ( italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) end_ARG
      s10subscript𝑠10s_{10}italic_s start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT       ED(xQmean,xmin)ED(xmaxxmin)EDsubscript𝑥𝑄𝑚𝑒𝑎𝑛subscript𝑥𝑚𝑖𝑛EDsubscript𝑥𝑚𝑎𝑥subscript𝑥𝑚𝑖𝑛\frac{\mathrm{ED}(x_{Qmean},x_{min})}{\mathrm{ED}(x_{max}-x_{min})}divide start_ARG roman_ED ( italic_x start_POSTSUBSCRIPT italic_Q italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG roman_ED ( italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) end_ARG
      s11subscript𝑠11s_{11}italic_s start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT       EOcountGmaxsubscriptEOcountsubscript𝐺𝑚𝑎𝑥\frac{\mathrm{EO_{count}}}{G_{max}}divide start_ARG roman_EO start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT end_ARG start_ARG italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG       EOcount::subscriptEOcountabsent\mathrm{EO_{count}}:roman_EO start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT : number of selected EO operator until Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
      s12subscript𝑠12s_{12}italic_s start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT       WOAcountGmaxsubscriptWOAcountsubscript𝐺𝑚𝑎𝑥\frac{\mathrm{WOA_{count}}}{G_{max}}divide start_ARG roman_WOA start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT end_ARG start_ARG italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG       WOAcount::subscriptWOAcountabsent\mathrm{WOA_{count}}:roman_WOA start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT : number of selected WOA operator until Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
      s13subscript𝑠13s_{13}italic_s start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT       TLBOcountGmaxsubscriptTLBOcountsubscript𝐺𝑚𝑎𝑥\frac{\mathrm{TLBO_{count}}}{G_{max}}divide start_ARG roman_TLBO start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT end_ARG start_ARG italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG       TLBOcount::subscriptTLBOcountabsent\mathrm{TLBO_{count}}:roman_TLBO start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT : number of selected TLBO operator until Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
      s14subscript𝑠14s_{14}italic_s start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT       EScountGmaxsubscriptEScountsubscript𝐺𝑚𝑎𝑥\frac{\mathrm{ES_{count}}}{G_{max}}divide start_ARG roman_ES start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT end_ARG start_ARG italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG       EScount::subscriptEScountabsent\mathrm{ES_{count}}:roman_ES start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT : number of selected ES operator until Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
      s15subscript𝑠15s_{15}italic_s start_POSTSUBSCRIPT 15 end_POSTSUBSCRIPT       GAcountGmaxsubscriptGAcountsubscript𝐺𝑚𝑎𝑥\frac{\mathrm{GA_{count}}}{G_{max}}divide start_ARG roman_GA start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT end_ARG start_ARG italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG       GAcount::subscriptGAcountabsent\mathrm{GA_{count}}:roman_GA start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT : number of selected GA operator until Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
      s16subscript𝑠16s_{16}italic_s start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT       SuEOcountEOcount+εsubscriptSuEOcount𝐸subscript𝑂𝑐𝑜𝑢𝑛𝑡𝜀\frac{\mathrm{SuEO_{count}}}{EO_{count}+\varepsilon}divide start_ARG roman_SuEO start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT end_ARG start_ARG italic_E italic_O start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT + italic_ε end_ARG        SuEOcount::subscriptSuEOcountabsent\mathrm{SuEO_{count}}:roman_SuEO start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT : number of selected EO operator that beat the previous operator ε:1.0×106:𝜀1.0E-6\varepsilon:$1.0\text{\times}{10}^{-6}$italic_ε : start_ARG 1.0 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 6 end_ARG end_ARG
      s17subscript𝑠17s_{17}italic_s start_POSTSUBSCRIPT 17 end_POSTSUBSCRIPT       SuWOAcountWOAcount+εsubscriptSuWOAcount𝑊𝑂subscript𝐴𝑐𝑜𝑢𝑛𝑡𝜀\frac{\mathrm{SuWOA_{count}}}{WOA_{count}+\varepsilon}divide start_ARG roman_SuWOA start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT end_ARG start_ARG italic_W italic_O italic_A start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT + italic_ε end_ARG       SuWOAcount::subscriptSuWOAcountabsent\mathrm{SuWOA_{count}}:roman_SuWOA start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT : number of selected WOA operator that beat the previous operator
      s18subscript𝑠18s_{18}italic_s start_POSTSUBSCRIPT 18 end_POSTSUBSCRIPT       SuTlBOcountTLBOcount+εsubscriptSuTlBOcount𝑇𝐿𝐵subscript𝑂𝑐𝑜𝑢𝑛𝑡𝜀\frac{\mathrm{SuTlBO_{count}}}{TLBO_{count}+\varepsilon}divide start_ARG roman_SuTlBO start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_L italic_B italic_O start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT + italic_ε end_ARG       SuTLBOcount::subscriptSuTLBOcountabsent\mathrm{SuTLBO_{count}}:roman_SuTLBO start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT : number of selected TLBO operator that beat the previous operator
      s19subscript𝑠19s_{19}italic_s start_POSTSUBSCRIPT 19 end_POSTSUBSCRIPT       SuEScountSuEScount+εsubscriptSuEScount𝑆𝑢𝐸subscript𝑆𝑐𝑜𝑢𝑛𝑡𝜀\frac{\mathrm{SuES_{count}}}{SuES_{count}+\varepsilon}divide start_ARG roman_SuES start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT end_ARG start_ARG italic_S italic_u italic_E italic_S start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT + italic_ε end_ARG       SuEScount::subscriptSuEScountabsent\mathrm{SuES_{count}}:roman_SuES start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT : number of selected ES operator that beat the previous operator
      s20subscript𝑠20s_{20}italic_s start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT       SuGAcountGAcount+εsubscriptSuGAcount𝐺subscript𝐴𝑐𝑜𝑢𝑛𝑡𝜀\frac{\mathrm{SuGA_{count}}}{GA_{count}+\varepsilon}divide start_ARG roman_SuGA start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT end_ARG start_ARG italic_G italic_A start_POSTSUBSCRIPT italic_c italic_o italic_u italic_n italic_t end_POSTSUBSCRIPT + italic_ε end_ARG       SuGAcount::subscriptSuGAcountabsent\mathrm{SuGA_{count}}:roman_SuGA start_POSTSUBSCRIPT roman_count end_POSTSUBSCRIPT : number of selected GA operator that beat the previous operator

where fQ1fQ3subscript𝑓𝑄1subscript𝑓𝑄3f_{Q1}-f_{Q3}italic_f start_POSTSUBSCRIPT italic_Q 1 end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_Q 3 end_POSTSUBSCRIPT represent the first, second, and third quartiles, respectively, of a current population performance. Each individual performance is calculated by adding up its R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT rank and L2normsubscript𝐿2normL_{2}-\mathrm{norm}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - roman_norm, and subsequently, the overall population cost is obtained by summation of all individual performances. In each generation, the lowest quartile (for the minimization problem) demonstrates the superior performance of each algorithm. fQmeansubscript𝑓𝑄𝑚𝑒𝑎𝑛f_{Qmean}italic_f start_POSTSUBSCRIPT italic_Q italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT also calculates the average of all quartiles. fminsubscript𝑓𝑚𝑖𝑛f_{min}italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and fmaxsubscript𝑓𝑚𝑎𝑥f_{max}italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT represent the minimum and maximum performance found in each RL game up to the current EA generation (Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). The 𝒫𝒫\mathcal{P}caligraphic_P standard deviation of current population performance over the maximum standard deviation is calculated in s6𝑠6s6italic_s 6. The maximum standard deviation is calculated by considering 𝒫/2𝒫2\mathcal{P}/2caligraphic_P / 2 of the population has the maximum performance (in that generation), while the rest of the population has the minimum performance. s7subscript𝑠7s_{7}italic_s start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT-s10subscript𝑠10s_{10}italic_s start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT computes the Euclidean distance between each quartile (s7subscript𝑠7s_{7}italic_s start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT-s9subscript𝑠9s_{9}italic_s start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT) and the average of quartiles (s10subscript𝑠10s_{10}italic_s start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT) with the decision variable (xminsubscript𝑥𝑚𝑖𝑛x_{min}italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT) related to fminsubscript𝑓𝑚𝑖𝑛f_{min}italic_f start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT. s11subscript𝑠11s_{11}italic_s start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT-s15subscript𝑠15s_{15}italic_s start_POSTSUBSCRIPT 15 end_POSTSUBSCRIPT presented the number of selected specific EA in one generation over the maximum number of generations (Gmaxsubscript𝐺𝑚𝑎𝑥G_{max}italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT). In addition, s16subscript𝑠16s_{16}italic_s start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT-s20subscript𝑠20s_{20}italic_s start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT reveals the number of a specifically selected EA, which leads to improved results (results refer to overall population cost, including the average of all population performance). The agent’s criterion for evaluating the performance of each EA is the average of all quartiles. Therefore, it is considered as the reward as shown in Equation 6.

Reward={𝒱ifQ(t)mean>Q(t1)mean0.0ifQ(t)meanQ(t1)mean𝑅𝑒𝑤𝑎𝑟𝑑cases𝒱if𝑄subscript𝑡𝑚𝑒𝑎𝑛𝑄subscript𝑡1𝑚𝑒𝑎𝑛0.0if𝑄subscript𝑡𝑚𝑒𝑎𝑛𝑄subscript𝑡1𝑚𝑒𝑎𝑛Reward=\left\{\begin{array}[]{lcl}\mathcal{V}&\mbox{if}&Q(t)_{mean}>Q(t-1)_{% mean}\\ 0.0&\mbox{if}&Q(t)_{mean}\leq Q(t-1)_{mean}\end{array}\right.italic_R italic_e italic_w italic_a italic_r italic_d = { start_ARRAY start_ROW start_CELL caligraphic_V end_CELL start_CELL if end_CELL start_CELL italic_Q ( italic_t ) start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT > italic_Q ( italic_t - 1 ) start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0.0 end_CELL start_CELL if end_CELL start_CELL italic_Q ( italic_t ) start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT ≤ italic_Q ( italic_t - 1 ) start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY (6)

where Q(t)mean𝑄subscript𝑡𝑚𝑒𝑎𝑛Q(t)_{mean}italic_Q ( italic_t ) start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT and Q(t1)mean𝑄subscript𝑡1𝑚𝑒𝑎𝑛Q(t-1)_{mean}italic_Q ( italic_t - 1 ) start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT are the average of the quartiles for the current and previous generation, respectively. As the performance of the EAs in the first generations does not equal their performance in the last generations, the reward is multiplied by a value function (𝒱𝒱\mathcal{V}caligraphic_V) ranging from a specific minimum to maximum value Equation 7.

𝒱=(GmaxGtGmax)p(cinitialcfinal)+cfinal𝒱superscriptsubscript𝐺𝑚𝑎𝑥subscript𝐺𝑡subscript𝐺𝑚𝑎𝑥𝑝subscript𝑐𝑖𝑛𝑖𝑡𝑖𝑎𝑙subscript𝑐𝑓𝑖𝑛𝑎𝑙subscript𝑐𝑓𝑖𝑛𝑎𝑙\mathcal{V}=\left(\frac{G_{max}-G_{t}}{G_{max}}\right)^{p}\left(c_{initial}-c_% {final}\right)+c_{final}caligraphic_V = ( divide start_ARG italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t italic_i italic_a italic_l end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT (7)

where cinitialsubscript𝑐𝑖𝑛𝑖𝑡𝑖𝑎𝑙c_{initial}italic_c start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t italic_i italic_a italic_l end_POSTSUBSCRIPT indicates the initial reward value and cfinalsubscript𝑐𝑓𝑖𝑛𝑎𝑙c_{final}italic_c start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT specifies the final reward (for final generation). As the changes in the reward function is not linear, we also added a non-linear index (p𝑝pitalic_p) to our reward function.

V Experimental Settings and Evaluation metrics

To R2-RLMOEA, the CEC09 (UF1-UF10) benchmark problems with two and three objectives are considered. Table III [76] shows the characteristics of each test problem. R2-RLMOEA is compared with the five R2-based MOEAs used in R2-RLMOEA (R2-GA [MOMBI-II], R2-ES, R2-TLBO, R2-WOA, and R2-EO) and a random MOEA in which operators are selected randomly in each generation. The comparison demonstrates how our RL-based agent performs in selecting various MOEAs during the optimization process compared to utilizing a specific algorithm and also the random selection of operators. In addition, two performance metrics, inverted generational distance (IGD) [77] and SP, are utilized to illustrate our algorithm’s efficacy compared to other MOEAs.

TABLE III: CEC09 benchmark characteristics.
Name Objective/Dimension Characteristics
UF1 2/30 Concave PF, Complex PS
UF2 2/30 Concave PF, Complex PS
UF3 2/30 Concave PF, Complex PS
UF4 2/30 Convex PF, Complex PS
UF5 2/30 Discrete PF, Complex PS
UF6 2/30 Discrete PF, Complex PS
UF7 2/30 Complex PS
UF8 3/30 Concave and Parabolic PF, Complex PS
UF9 3/30 Discrete and Planar PF, Complex PS
UF10 3/30 Concave and Parabolic PF

IGD is a popular indicator for measuring the convergence and diversity of non-dominated solutions. For a distributed reference point of R=(r1,r2,,rm)Rsubscript𝑟1subscript𝑟2subscript𝑟𝑚\textbf{R}=(r_{1},r_{2},...,r_{m})R = ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), IGD computes the average distance of each point in set 𝒫𝒫\mathcal{P}caligraphic_P to the reference point (Equation 8).

IGD(𝒫,R)IGD𝒫R\displaystyle\mathrm{IGD}(\mathcal{P},\textbf{R})roman_IGD ( caligraphic_P , R ) =rRminx𝒫i=1m(f(xi)ri)2|R|absentsubscript𝑟R𝑥𝒫𝑚𝑖𝑛superscriptsubscript𝑖1𝑚superscript𝑓subscript𝑥𝑖subscript𝑟𝑖2𝑅\displaystyle=\frac{\sum_{r\in\textbf{R}}\underset{x\in\mathcal{P}}{min}\sqrt{% \sum_{i=1}^{m}(f(x_{i})-r_{i})^{2}}}{|R|}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_r ∈ R end_POSTSUBSCRIPT start_UNDERACCENT italic_x ∈ caligraphic_P end_UNDERACCENT start_ARG italic_m italic_i italic_n end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG | italic_R | end_ARG (8)

where ED(.)\mathrm{ED}(.)roman_ED ( . ) is the Euclidean distance between p𝒫𝑝𝒫p\in\mathcal{P}italic_p ∈ caligraphic_P with a nearest point in R and b𝑏bitalic_b is an arbitrary number (b>0𝑏0b>0italic_b > 0). Also, we applied spacing criteria to evaluate the distribution of obtained solutions, calculating the distance (generally Euclidean distance) between each solution to its nearest neighbor in decision variable space. Based on Equation 9, a smaller SPSP\mathrm{SP}roman_SP value reveals more distributed solutions.

SPSP\displaystyle\mathrm{SP}roman_SP =i=1n(𝒟i𝒟m)2n𝒟mabsentsuperscriptsubscript𝑖1𝑛superscriptsubscript𝒟𝑖subscript𝒟𝑚2𝑛subscript𝒟𝑚\displaystyle=\frac{\sqrt{\sum_{i=1}^{n}(\mathcal{D}_{i}-\mathcal{D}_{m})^{2}}% }{n\ast\mathcal{D}_{m}}= divide start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_n ∗ caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG (9)

where n𝑛nitalic_n denotes the number of obtained solutions. 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒟msubscript𝒟𝑚\mathcal{D}_{m}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (1/ni=1n𝒟i1𝑛superscriptsubscript𝑖1𝑛subscript𝒟𝑖\nicefrac{{1}}{{n}}\sum_{i=1}^{n}\mathcal{D}_{i}/ start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) are the Euclidean distance between ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT solution and its closest neighbor and average distance, respectively. All the implementations are coded in Python. For implementing GA, ES, and some multi-objective functions, we utilized the Platypus package [78]. Also, the initial parameters for R2-indicator, TLBO, WOA, and EO are derived from their respective papers [71, 22, 23, 24]. The initial parameters for the General EAs and RL structure are shown in Table IV.

TABLE IV: General EAs and RL parameters
RL parameters
𝒩game:1.0×105:subscript𝒩𝑔𝑎𝑚𝑒1.0E5\mathcal{N}_{game}:$1.0\text{\times}{10}^{5}$caligraphic_N start_POSTSUBSCRIPT italic_g italic_a italic_m italic_e end_POSTSUBSCRIPT : start_ARG 1.0 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 5 end_ARG end_ARG γ:0.9:𝛾0.9\gamma:0.9italic_γ : 0.9 size::subscript𝑠𝑖𝑧𝑒absent\mathcal{R}_{size}:caligraphic_R start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT : 1.0×1051.0E51.0\text{\times}{10}^{5}start_ARG 1.0 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 5 end_ARG end_ARG n:100:𝑛100n\mathcal{H}:100italic_n caligraphic_H : 100
n:2:𝑛2n\mathcal{L}:2italic_n caligraphic_L : 2 ϵinitial:0.9:subscriptitalic-ϵ𝑖𝑛𝑖𝑡𝑖𝑎𝑙0.9\epsilon_{initial}:0.9italic_ϵ start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t italic_i italic_a italic_l end_POSTSUBSCRIPT : 0.9 ϵfinal:1.0×103:subscriptitalic-ϵ𝑓𝑖𝑛𝑎𝑙1.0E-3\epsilon_{final}:$1.0\text{\times}{10}^{-3}$italic_ϵ start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT : start_ARG 1.0 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG p:3:𝑝3p:3italic_p : 3
size:64:subscript𝑠𝑖𝑧𝑒64\mathcal{B}_{size}:64caligraphic_B start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT : 64 𝒰freq:1.0×103:subscript𝒰𝑓𝑟𝑒𝑞1.0E3\mathcal{U}_{freq}:$1.0\text{\times}{10}^{3}$caligraphic_U start_POSTSUBSCRIPT italic_f italic_r italic_e italic_q end_POSTSUBSCRIPT : start_ARG 1.0 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 3 end_ARG end_ARG naction:5:subscript𝑛𝑎𝑐𝑡𝑖𝑜𝑛5n_{action}:5italic_n start_POSTSUBSCRIPT italic_a italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT : 5 nstates:20:subscript𝑛𝑠𝑡𝑎𝑡𝑒𝑠20n_{states}:20italic_n start_POSTSUBSCRIPT italic_s italic_t italic_a italic_t italic_e italic_s end_POSTSUBSCRIPT : 20
EA parameters
𝒩pop:100:subscript𝒩𝑝𝑜𝑝100\mathcal{N}_{pop}:100caligraphic_N start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT : 100 𝒢max:1.0×102:subscript𝒢𝑚𝑎𝑥1.0E2\mathcal{G}_{max}:$1.0\text{\times}{10}^{2}$caligraphic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT : start_ARG 1.0 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 2 end_ARG end_ARG

where 𝒩gamesubscript𝒩𝑔𝑎𝑚𝑒\mathcal{N}_{game}caligraphic_N start_POSTSUBSCRIPT italic_g italic_a italic_m italic_e end_POSTSUBSCRIPT denotes the maximum number of RL games and sizesubscript𝑠𝑖𝑧𝑒\mathcal{R}_{size}caligraphic_R start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT shows the replay memory size. In our deep model, we applied two neural networks with two hidden layers (n𝑛n\mathcal{L}italic_n caligraphic_L), batch size 64 (sizesubscript𝑠𝑖𝑧𝑒\mathcal{B}_{size}caligraphic_B start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT), and 100 nodes (n𝑛n\mathcal{H}italic_n caligraphic_H) in each layer. In addition, each RL game executes 100 EA generations (𝒢maxsubscript𝒢𝑚𝑎𝑥\mathcal{G}_{max}caligraphic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT) with the population number of 100 (𝒩popsubscript𝒩𝑝𝑜𝑝\mathcal{N}_{pop}caligraphic_N start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT). In RL online training, we selected actions using the epsilon greedy policy, whereas in offline mode (test mode), the agent selects actions using the greedy policy. In the epsilon-greedy policy, the epsilon varies between 9.0×1019.0E-19.0\text{\times}{10}^{-1}start_ARG 9.0 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 1 end_ARG end_ARG to 1.0×1031.0E-31.0\text{\times}{10}^{-3}start_ARG 1.0 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG with the power of 3333 over the game iterations. This variation pattern is also utilized in RL rewarding to make the reward function more applicable. Based on Figure 3, the reward is increased non-linearly from 1111 to 5555 over the 100100100100 EA generations (based on Equation 6).

Refer to caption
Figure 3: RL reward variation over EA generations.

Providing an exact computational complexity in our algorithm is challenging due to the variability in operations of each EA, as well as the dynamics of the DDQN agent. However, as a high-level approximation, we can express the overall complexity as Equation 10:

Ctotalsubscript𝐶𝑡𝑜𝑡𝑎𝑙\displaystyle C_{total}italic_C start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT =𝒢max×(CRL+CEA)absentsubscript𝒢𝑚𝑎𝑥subscript𝐶𝑅𝐿subscript𝐶𝐸𝐴\displaystyle=\mathcal{G}_{max}\times(C_{RL}+C_{EA})= caligraphic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT × ( italic_C start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_E italic_A end_POSTSUBSCRIPT ) (10)
Ctotalsubscript𝐶𝑡𝑜𝑡𝑎𝑙\displaystyle C_{total}italic_C start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT =𝒢max×(O(n2)+O(𝒩pop×𝒪op))absentsubscript𝒢𝑚𝑎𝑥𝑂𝑛superscript2𝑂subscript𝒩𝑝𝑜𝑝subscript𝒪𝑜𝑝\displaystyle=\mathcal{G}_{max}\times(O(n\mathcal{H}^{2})+O(\mathcal{N}_{pop}% \times\mathcal{O}_{op}))= caligraphic_G start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT × ( italic_O ( italic_n caligraphic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_O ( caligraphic_N start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT × caligraphic_O start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT ) )

where CRLsubscript𝐶𝑅𝐿C_{RL}italic_C start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT is the complexity of the neural network in the RL agent and can be approximated as the O(n2)𝑂𝑛superscript2O(n\mathcal{H}^{2})italic_O ( italic_n caligraphic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The average computational cost of the EA is CEAsubscript𝐶𝐸𝐴C_{EA}italic_C start_POSTSUBSCRIPT italic_E italic_A end_POSTSUBSCRIPT, i.e., equals O(𝒩pop×𝒪op)𝑂subscript𝒩𝑝𝑜𝑝subscript𝒪𝑜𝑝O(\mathcal{N}_{pop}\times\mathcal{O}_{op})italic_O ( caligraphic_N start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT × caligraphic_O start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT ). 𝒪opsubscript𝒪𝑜𝑝\mathcal{O}_{op}caligraphic_O start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT is known as the average operations per individual in EAs, varying based on each specific EA. It is clear that by using the DDQN method, the computational complexity of our algorithm is higher than a simple MOEA. This complexity is particularly evident during the training of the network. However, since we can apply the trained network in an offline mode, the benefits of RL compensate for the higher computational training time and complexity.

After playing 80% of the maximum number of games in each benchmark, we save the top five trained networks based on the maximum rewards they earned. These networks are then tested offline, and the ones with the best performances are reported. For each specific test function, we ran the selected network (in offline mode) 30 times for statistical analysis.

To determine the correct statistical tests to apply, we evaluated the distribution of the average performance across all benchmarks and algorithms compared. Since a power analysis found that a sample of 28 would be required to detect a large effect at 80%percent8080\%80 % power and an alpha level of 0.050.050.050.05, the non-parametric Friedmann test was performed to determine if there was significant difference between algorithm performance across all benchmarks, as this test is robust against outliers and does not assume a Gaussian distribution. However, to examine the performance on individual benchmarks, given the distribution of the data and the sample size, a paired mixed ANOVA is the most appropriate statistical test. For the mixed ANOVA, the between-group factor is algorithm (MOMBI-II, R2-ES, R2-TLBO, R2-WOA, R2-EO, Random Opt, and R2-RLMOEA), the within factors are benchmark (ten levels; UF1-10) and indicator (two levels; IGD and SP), and the dependent measure is the model performance (lower scores indicate better performance).

VI Results

The mean, standard deviation, and minimum of IGDs and SPs over the 30 evaluations of each algorithm are presented in Table V. The Friedman test results showed a significant difference in algorithm performance depending on benchmark applied and the indicator used as a measure (η2(139)=4022.61superscript𝜂21394022.61\eta^{2}(139)=4022.61italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 139 ) = 4022.61, p<0.001𝑝0.001p<0.001italic_p < 0.001). The highlighted cells in the tables indicate which algorithm performed the best for each benchmark.

TABLE V: IGD and SP metric for CEC09 test functions.
UF1 R2-RLMOEA R2-EO R2-WOA R2-TLBO R2-ES MOMBI-II Random a
IGD mean 0.10538 0.11867 0.16070 0.10723 0.77287 0.11075 0.09446
IGD min 0.08390 0.10397 0.12422 0.09922 0.56487 0.06794 0.06729
IGD std 0.01579 0.01750 0.02035 0.00900 0.07568 0.03586 0.00954
SP mean 0.01238 0.00749 0.01392 0.03036 0.08208 0.00318 0.00814
SP min 0.00251 0.00429 0.00538 0.01225 0.04518 0.00113 0.00210
SP std 0.01092 0.00236 0.00945 0.01103 0.02436 0.00176 0.00430
UF2 IGD mean 0.04918 0.08924 0.08186 0.08915 0.64830 0.05011 0.06958
IGD min 0.03825 0.07783 0.05865 0.06963 0.55292 0.03805 0.05747
IGD std 0.00782 0.00511 0.00758 0.01093 0.04676 0.01216 0.00740
SP mean 0.00934 0.02695 0.02568 0.02376 0.05485 0.01164 0.02024
SP min 0.00501 0.01422 0.01031 0.00923 0.03346 0.00484 0.01261
SP std 0.00226 0.00923 0.01147 0.01339 0.01429 0.00360 0.01141
UF3 IGD mean 0.26263 0.37566 0.26473 0.22021 1.13164 0.25866 0.26826
IGD min 0.18657 0.30857 0.21376 0.18771 0.90526 0.21038 0.24050
IGD std 0.06051 0.10245 0.02275 0.01992 0.09330 0.03278 0.01305
SP mean 0.00581 0.01070 0.01236 0.01059 0.16723 0.00839 0.00964
SP min 0.00182 0.00110 0.00611 0.00602 0.10375 0.00464 0.00506
SP std 0.00423 0.00705 0.00597 0.00454 0.03871 0.00278 0.00506
UF4 IGD mean 0.05431 0.09427 0.06804 0.06269 0.11733 0.05489 0.05619
IGD min 0.05055 0.08364 0.05767 0.05824 0.11082 0.05173 0.05382
IGD std 0.00367 0.00531 0.00700 0.00233 0.00308 0.00331 0.00191
SP mean 0.01328 0.01268 0.01882 0.00952 0.01003 0.01606 0.01602
SP min 0.00904 0.00982 0.00529 0.00696 0.00818 0.00893 0.00894
SP std 0.00572 0.00201 0.00938 0.00199 0.00133 0.00609 0.00377
UF5 IGD mean 0.47041 0.95651 0.87734 0.96416 3.52099 0.53685 0.59164
IGD min 0.24627 0.80740 0.63975 0.66813 3.07449 0.29841 0.39338
IGD std 0.15464 0.09763 0.16326 0.13353 0.21883 0.14403 0.14759
SP mean 0.03185 0.02221 0.07299 0.06861 0.18692 0.02756 0.03620
SP min 0.01121 0.00522 0.03673 0.02654 0.11942 0.01106 0.01902
SP std 0.01293 0.02508 0.04388 0.02571 0.05163 0.01351 0.01079
UF6 IGD mean 0.39789 0.54266 0.68166 0.50609 2.84649 0.22817 0.43189
IGD min 0.19638 0.45760 0.55416 0.37369 2.20107 0.12296 0.34400
IGD std 0.13448 0.08564 0.05756 0.13473 0.30047 0.08415 0.06448
SP mean 0.01232 0.02146 0.01017 0.05508 0.30391 0.01376 0.02065
SP min 0.00255 0.00615 0.00124 0.00785 0.14232 0.00661 0.00667
SP std 0.01951 0.03213 0.01064 0.07354 0.09813 0.00697 0.02332
UF7 IGD mean 0.05298 0.07339 0.10720 0.06002 0.71496 0.19040 0.05759
IGD min 0.04209 0.06393 0.08228 0.05437 0.58447 0.04075 0.05056
IGD std 0.00714 0.01273 0.01949 0.00262 0.06146 0.13883 0.00428
SP mean 0.00334 0.01354 0.01299 0.01612 0.08612 0.00432 0.00734
SP min 0.00109 0.00420 0.00752 0.00597 0.05164 0.00059 0.00423
SP std 0.00112 0.01096 0.00317 0.01044 0.02527 0.00595 0.00281
UF8 IGD mean 0.28916 0.49866 0.43428 1.04480 2.31742 0.32214 0.33271
IGD min 0.17132 0.39782 0.15952 0.72428 1.50500 0.17561 0.17740
IGD std 0.05910 0.03218 0.09175 0.25355 0.36207 0.11172 0.11418
SP mean 0.02619 0.01839 0.04698 0.51333 0.72981 0.03730 0.05175
SP min 0.01584 0.01276 0.01109 0.29071 0.41091 0.01631 0.01297
SP std 0.01469 0.00775 0.04714 0.18741 0.23122 0.02089 0.03294
UF9 IGD mean 0.20521 0.46875 0.38419 1.25015 2.46484 0.30993 0.30526
IGD min 0.09773 0.39060 0.25611 0.68833 1.63253 0.12381 0.12994
IGD std 0.11612 0.04570 0.08392 0.40548 0.36800 0.12175 0.11141
SP mean 0.02882 0.18032 0.08126 0.58679 0.73977 0.04505 0.08789
SP min 0.01053 0.05468 0.02350 0.26892 0.47925 0.00956 0.02428
SP std 0.01156 0.11952 0.06534 0.19189 0.18283 0.03479 0.07486
UF10 IGD mean 0.42335 0.66104 0.68001 7.47899 12.90669 0.59200 1.19085
IGD min 0.30726 0.39044 0.27059 4.45004 11.24410 0.32362 0.66454
IGD std 0.10891 0.14451 0.33837 1.88916 0.86784 0.09175 0.36029
SP mean 0.07480 0.15164 0.14133 1.96243 2.40635 0.14202 0.20189
SP min 0.05067 0.07663 0.05280 1.14109 1.10547 0.05534 0.10491
SP std 0.01311 0.04956 0.09954 0.75434 0.83837 0.06912 0.08863
a Random Operators of EO, WOA, TLBO, ES, and GA

Based on the results from the non-parametric analysis, the simple main effects were determined for each algorithm by comparing all algorithms with each other using a series of paired mixed ANOVA’s. For these paired mixed ANOVA’s there were enough cases to provide 80%percent8080\%80 % power at an alpha level of 0.050.050.050.05 to detect a medium effect. The results demonstrated there was a simple main effect for the performance of the MOMBI-II algorithm compared to all others (p<0.001𝑝0.001p<0.001italic_p < 0.001), except the R2-RLMOEA algorithm, and for the R2-TLBO compared to the R2-ES algorithm (p<0.001𝑝0.001p<0.001italic_p < 0.001), and the R2-WOA algorithm compared to the R2-ES and R2-TLBO algorithms (p<0.001𝑝0.001p<0.001italic_p < 0.001), and the Random Opt compared to the R2-ES, R2-TLBO, R2-WOA, and R2-EO algorithms (p<0.03𝑝0.03p<0.03italic_p < 0.03). Overall, a simple main effect was found for the R2-RLMOEA algorithm when compared to all the other algorithms (p<0.001𝑝0.001p<0.001italic_p < 0.001), demonstrating that R2-RLMOEA algorithms is superior to all other when compared across all benchmarks.

Given the significantly improved performance of the R2-RLMOEA algorithm compared to all other algorithms, a series of mixed ANOVA’s were run, separately for each indicator, to evaluate differences between algorithms on each benchmark. For these ANOVA’s, the between factor was algorithm (comparing the R2-RLMOEA algorithm with each of the others, i.e., MOMBI-II, R2-ES, R2-TLBO, R2-WOA, R2-EO, and Random Opt, in a separate ANOVA), and the benchmark was the repeated measure (ten levels; UF1-10), and the dependent measure was algorithm performance on either the SP measure, or the IGD measure.

Based on SP the R2-RLMOEA algorithm significantly outperformed all other algorithms on individual benchmarks UF2, UF9, and UF10 (See Table VI for a summary of pairwise comparisons). However, the R2-RLMOEA algorithm was outperformed on benchmark UF1 by the MOMBI-II algorithm (MD=0.008𝑀𝐷0.008MD=0.008italic_M italic_D = 0.008, SE=0.001𝑆𝐸0.001SE=0.001italic_S italic_E = 0.001, p<0.001𝑝0.001p<0.001italic_p < 0.001), on UF4 by the R2-ES and R2-TLBO algorithms (MD=0.003𝑀𝐷0.003MD=0.003italic_M italic_D = 0.003, SE=0.009𝑆𝐸0.009SE=0.009italic_S italic_E = 0.009, p=0.018𝑝0.018p=0.018italic_p = 0.018 and MD=0.005𝑀𝐷0.005MD=0.005italic_M italic_D = 0.005, SE=0.002𝑆𝐸0.002SE=0.002italic_S italic_E = 0.002, p=0.02𝑝0.02p=0.02italic_p = 0.02), and on UF5 by the MOMBI-II and the R2-EO algorithms (MD=0.01𝑀𝐷0.01MD=0.01italic_M italic_D = 0.01, SE=0.004𝑆𝐸0.004SE=0.004italic_S italic_E = 0.004, p=0.017𝑝0.017p=0.017italic_p = 0.017 and MD=0.014𝑀𝐷0.014MD=0.014italic_M italic_D = 0.014, SE=0.003𝑆𝐸0.003SE=0.003italic_S italic_E = 0.003, p=0.001𝑝0.001p=0.001italic_p = 0.001, respectively).

TABLE VI: Results where the R2-RLMOEA algorithm significantly outperformed all other algorithms on benchmarks UF2, UF9, and UF10, when measured using the SP indicator.
Algorithms UF2 UF9 UF10
R2-RLMOEA vs MOMBI-II
MD=0.004𝑀𝐷0.004MD=-0.004italic_M italic_D = - 0.004
SE=0.001𝑆𝐸0.001SE=0.001italic_S italic_E = 0.001
p=0.007𝑝0.007p=0.007italic_p = 0.007
MD=0.124𝑀𝐷0.124MD=-0.124italic_M italic_D = - 0.124
SE=0.035𝑆𝐸0.035SE=0.035italic_S italic_E = 0.035
p=0.002𝑝0.002p=0.002italic_p = 0.002
MD=0.062𝑀𝐷0.062MD=-0.062italic_M italic_D = - 0.062
SE=0.011𝑆𝐸0.011SE=0.011italic_S italic_E = 0.011
p<0.001𝑝0.001p<0.001italic_p < 0.001
R2-RLMOEA vs R2-ES
MD=0.043𝑀𝐷0.043MD=-0.043italic_M italic_D = - 0.043
SE=0.003𝑆𝐸0.003SE=0.003italic_S italic_E = 0.003
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=2.28𝑀𝐷2.28MD=-2.28italic_M italic_D = - 2.28
SE=0.08𝑆𝐸0.08SE=0.08italic_S italic_E = 0.08
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=2.26𝑀𝐷2.26MD=-2.26italic_M italic_D = - 2.26
SE=0.081𝑆𝐸0.081SE=0.081italic_S italic_E = 0.081
p<0.001𝑝0.001p<0.001italic_p < 0.001
R2-RLMOEA vs R2-TLBO
MD=0.009𝑀𝐷0.009MD=-0.009italic_M italic_D = - 0.009
SE=0.003𝑆𝐸0.003SE=0.003italic_S italic_E = 0.003
p=0.02𝑝0.02p=0.02italic_p = 0.02
MD=1.12𝑀𝐷1.12MD=-1.12italic_M italic_D = - 1.12
SE=0.099𝑆𝐸0.099SE=0.099italic_S italic_E = 0.099
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=2.032𝑀𝐷2.032MD=-2.032italic_M italic_D = - 2.032
SE=0.23𝑆𝐸0.23SE=0.23italic_S italic_E = 0.23
p<0.001𝑝0.001p<0.001italic_p < 0.001
R2-RLMOEA vs R2-WOA
MD=0.017𝑀𝐷0.017MD=-0.017italic_M italic_D = - 0.017
SE=0.002𝑆𝐸0.002SE=0.002italic_S italic_E = 0.002
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.225𝑀𝐷0.225MD=-0.225italic_M italic_D = - 0.225
SE=0.031𝑆𝐸0.031SE=0.031italic_S italic_E = 0.031
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.049𝑀𝐷0.049MD=-0.049italic_M italic_D = - 0.049
SE=0.015𝑆𝐸0.015SE=0.015italic_S italic_E = 0.015
p=0.004𝑝0.004p=0.004italic_p = 0.004
R2-RLMOEA vs R2-EO
MD=0.16𝑀𝐷0.16MD=-0.16italic_M italic_D = - 0.16
SE=0.002𝑆𝐸0.002SE=0.002italic_S italic_E = 0.002
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.297𝑀𝐷0.297MD=-0.297italic_M italic_D = - 0.297
SE=0.025𝑆𝐸0.025SE=0.025italic_S italic_E = 0.025
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.083𝑀𝐷0.083MD=-0.083italic_M italic_D = - 0.083
SE=0.016𝑆𝐸0.016SE=0.016italic_S italic_E = 0.016
p<0.001𝑝0.001p<0.001italic_p < 0.001
R2-RLMOEA vs Random Opt
MD=0.007𝑀𝐷0.007MD=-0.007italic_M italic_D = - 0.007
SE=0.001𝑆𝐸0.001SE=0.001italic_S italic_E = 0.001
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.16𝑀𝐷0.16MD=-0.16italic_M italic_D = - 0.16
SE=0.032𝑆𝐸0.032SE=0.032italic_S italic_E = 0.032
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.1𝑀𝐷0.1MD=-0.1italic_M italic_D = - 0.1
SE=0.009𝑆𝐸0.009SE=0.009italic_S italic_E = 0.009
p<0.001𝑝0.001p<0.001italic_p < 0.001

Comparing the IGD indicator as a measure of performance, the R2-RLMOEA algorithm performance was also significantly better than all other models, averaged across all ten benchmarks (p<0.001𝑝0.001p<0.001italic_p < 0.001). For individual benchmarks, the R2-RLMOEA algorithm performance was significantly better than all other algorithms on the benchmarks UF7, UF9, and UF10 (Table VII). However again, the R2-RLMOEA algorithm was outperformed by the MOMBI-II algorithm on benchmark UF6 (MD=0.166𝑀𝐷0.166MD=0.166italic_M italic_D = 0.166, SE=0.033𝑆𝐸0.033SE=0.033italic_S italic_E = 0.033, p<0.001𝑝0.001p<0.001italic_p < 0.001), the R2-TLBO algorithm on benchmark UF3 (MD=0.027𝑀𝐷0.027MD=0.027italic_M italic_D = 0.027, SE=0.006𝑆𝐸0.006SE=0.006italic_S italic_E = 0.006, p<0.001𝑝0.001p<0.001italic_p < 0.001), and the Random Opt algorithm on benchmark UF1 (MD=0.007𝑀𝐷0.007MD=0.007italic_M italic_D = 0.007, SE=0.002𝑆𝐸0.002SE=0.002italic_S italic_E = 0.002, p=0.007𝑝0.007p=0.007italic_p = 0.007).

TABLE VII: Results where the R2-RLMOEA algorithm significantly outperformed all other algorithms on benchmarks UF7, UF9, and UF10, when measured using the IGD indicator.
Algorithms UF7 UF9 UF10
R2-RLMOEA vs MOMBI-II
MD=0.131𝑀𝐷0.131MD=-0.131italic_M italic_D = - 0.131
SE=0.026𝑆𝐸0.026SE=0.026italic_S italic_E = 0.026
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.125𝑀𝐷0.125MD=-0.125italic_M italic_D = - 0.125
SE=0.033𝑆𝐸0.033SE=0.033italic_S italic_E = 0.033
p=0.001𝑝0.001p=0.001italic_p = 0.001
MD=0.177𝑀𝐷0.177MD=-0.177italic_M italic_D = - 0.177
SE=0.026𝑆𝐸0.026SE=0.026italic_S italic_E = 0.026
p<0.001𝑝0.001p<0.001italic_p < 0.001
R2-RLMOEA vs R2-ES
MD=0.664𝑀𝐷0.664MD=-0.664italic_M italic_D = - 0.664
SE=0.012𝑆𝐸0.012SE=0.012italic_S italic_E = 0.012
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=2.3𝑀𝐷2.3MD=-2.3italic_M italic_D = - 2.3
SE=0.07𝑆𝐸0.07SE=0.07italic_S italic_E = 0.07
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=12.44𝑀𝐷12.44MD=-12.44italic_M italic_D = - 12.44
SE=0.181𝑆𝐸0.181SE=0.181italic_S italic_E = 0.181
p<0.001𝑝0.001p<0.001italic_p < 0.001
R2-RLMOEA vs R2-TLBO
MD=0.008𝑀𝐷0.008MD=-0.008italic_M italic_D = - 0.008
SE=0.001𝑆𝐸0.001SE=0.001italic_S italic_E = 0.001
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=1.127𝑀𝐷1.127MD=-1.127italic_M italic_D = - 1.127
SE=0.096𝑆𝐸0.096SE=0.096italic_S italic_E = 0.096
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=6.94𝑀𝐷6.94MD=-6.94italic_M italic_D = - 6.94
SE=0.4𝑆𝐸0.4SE=0.4italic_S italic_E = 0.4
p<0.001𝑝0.001p<0.001italic_p < 0.001
R2-RLMOEA vs R2-WOA
MD=0.054𝑀𝐷0.054MD=-0.054italic_M italic_D = - 0.054
SE=0.004𝑆𝐸0.004SE=0.004italic_S italic_E = 0.004
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.195𝑀𝐷0.195MD=-0.195italic_M italic_D = - 0.195
SE=0.031𝑆𝐸0.031SE=0.031italic_S italic_E = 0.031
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.3𝑀𝐷0.3MD=-0.3italic_M italic_D = - 0.3
SE=0.07𝑆𝐸0.07SE=0.07italic_S italic_E = 0.07
p<0.001𝑝0.001p<0.001italic_p < 0.001
R2-RLMOEA vs R2-EO
MD=0.017𝑀𝐷0.017MD=-0.017italic_M italic_D = - 0.017
SE=0.001𝑆𝐸0.001SE=0.001italic_S italic_E = 0.001
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.278𝑀𝐷0.278MD=-0.278italic_M italic_D = - 0.278
SE=0.016𝑆𝐸0.016SE=0.016italic_S italic_E = 0.016
p<0.001𝑝0.001p<0.001italic_p < 0.001
MD=0.272𝑀𝐷0.272MD=-0.272italic_M italic_D = - 0.272
SE=0.045𝑆𝐸0.045SE=0.045italic_S italic_E = 0.045
p<0.001𝑝0.001p<0.001italic_p < 0.001
R2-RLMOEA vs Random Opt
MD=0.006𝑀𝐷0.006MD=-0.006italic_M italic_D = - 0.006
SE=0.002𝑆𝐸0.002SE=0.002italic_S italic_E = 0.002
p=0.001𝑝0.001p=0.001italic_p = 0.001
MD=0.114𝑀𝐷0.114MD=-0.114italic_M italic_D = - 0.114
SE=0.029𝑆𝐸0.029SE=0.029italic_S italic_E = 0.029
p=0.001𝑝0.001p=0.001italic_p = 0.001
MD=0.797𝑀𝐷0.797MD=-0.797italic_M italic_D = - 0.797
SE=0.083𝑆𝐸0.083SE=0.083italic_S italic_E = 0.083
p<0.001𝑝0.001p<0.001italic_p < 0.001

To provide a comprehensive visualization of the result distribution and variability among the different algorithms tested, a series of box plots are presented in Figures 4 to 13. Each box plot represents the IGD and SP distributions (for 30 independent runs) of a specific algorithm across various benchmarks. Also, Figure 14 displays the combined box plots for the IGD and SP indicators for all applied algorithms over the CEC09 test functions.

Refer to caption
Figure 4: Box plot of the IGD and SP metrics for applied algorithms on the UF1 test function.
Refer to caption
Figure 5: Box plot of the IGD and SP metrics for applied algorithms on the UF2 test function.
Refer to caption
Figure 6: Box plot of the IGD and SP metrics for applied algorithms on the UF3 test function.
Refer to caption
Figure 7: Box plot of the IGD and SP metrics for applied algorithms on the UF4 test function.
Refer to caption
Figure 8: Box plot of the IGD and SP metrics for applied algorithms on the UF5 test function.
Refer to caption
Figure 9: Box plot of the IGD and SP metrics for applied algorithms on the UF6 test function.
Refer to caption
Figure 10: Box plot of the IGD and SP metrics for applied algorithms on the UF7 test function.
Refer to caption
Figure 11: Box plot of the IGD and SP metrics for applied algorithms on the UF8 test function.
Refer to caption
Figure 12: Box plot of the IGD and SP metrics for applied algorithms on the UF9 test function.
Refer to caption
Figure 13: Box plot of the IGD and SP metrics for applied algorithms on the UF10 test function.
Refer to caption
Figure 14: Combined Box plot of the IGD and SP metrics for all algorithms on the CEC09 (UF1-UF10) test functions.

VII EA Choice Assessment

Selecting the appropriate evolutionary algorithm (EA) for a particular problem is crucial and requires professional expertise. In our framework, we take this a step further by selecting a specific EA for each generation during the optimization process. By analyzing over 30 independent runs, we reveal the percentage of selection of each algorithm in specific generations. The results for operator selection for UF1 to UF10 are presented in Figures 15 to 24, respectively. Based on the results, it can be concluded that the agent prefers using the ES method during the initial generations due to its exploration capabilities. However, as the optimization process progresses and requires more exploitation of the search space, the agent switches to GA and TLBO in the middle generations, and EO and WOA in the last generations, in most cases. This trend was particularly noticeable when dealing with three-objective test functions. Optimisation may be suboptimal due to the agent’s lack of confidence or poor selection of operators with respect to the exploration and exploitation criteria. In addition, the percentage contributions of various EA operators to a variety of test problems are displayed in Table VIII, demonstrating how the selection process adapts to the specific characteristics of each problem. Despite the diverse range of operators utilized throughout these benchmarks, it is evident that the GA and TLBO operators predominate as the operators utilized in the CEC09 benchmark.

Refer to caption
Figure 15: The percentage of each EA selected by the RL agent in each generation for the UF1 test function.
Refer to caption
Figure 16: The percentage of each EA selected by the RL agent in each generation for the UF2 test function.
Refer to caption
Figure 17: The percentage of each EA selected by the RL agent in each generation for the UF3 test function.
Refer to caption
Figure 18: The percentage of each EA selected by the RL agent in each generation for the UF4 test function.
Refer to caption
Figure 19: The percentage of each EA selected by the RL agent in each generation for the UF5 test function.
Refer to caption
Figure 20: The percentage of each EA selected by the RL agent in each generation for the UF6 test function.
Refer to caption
Figure 21: The percentage of each EA selected by the RL agent in each generation for the UF7 test function.
Refer to caption
Figure 22: The percentage of each EA selected by the RL agent in each generation for the UF8 test function.
Refer to caption
Figure 23: The percentage of each EA selected by the RL agent in each generation for the UF9 test function.
Refer to caption
Figure 24: The percentage of each EA selected by the RL agent in each generation for the UF10 test function.
TABLE VIII: The percentage of EA operators selected across different benchmark problems.
Test Problem EO Operator WOA Operator TLBO Operator ES Operator GA Operator
UF1 20.87% 1.10% 21.47% 32.23% 24.33%
UF2 22.87% 0.00% 1.17% 0.00% 75.97%
UF3 5.70% 66.53% 3.40% 9.67% 14.70%
UF4 7.20% 0.70% 21.73% 0.13% 70.23%
UF5 18.60% 11.90% 14.77% 27.50% 27.23%
UF6 0.97% 16.97% 17.17% 21.30% 43.60%
UF7 14.73% 0.27% 36.77% 17.90% 30.33%
UF8 2.53% 11.47% 24.63% 13.93% 47.43%
UF9 7.80% 18.83% 24.17% 13.47% 35.73%
UF10 34.97% 0.07% 6.47% 24.00% 34.50%
Average 13.62% 12.78% 17.17% 16.01% 40.41%

VIII Discussion

Selecting and optimizing a specific evolutionary algorithm for a particular optimization problem is a difficult task. To address this challenge, this paper presents an adaptive MOEA (R2-RLMOEA) where a reinforcement learning agent guides the search process. We evaluated the performance of the R2-RLMOEA algorithm across the CEC09 benchmark problems UF1-UF10. These benchmark problems cover a range of test scenarios with two and three objectives. We systematically and statistically compared the performance of the R2-RLMOEA algorithm against five other R2-based MOEAs and a random operator selection of MOEA. This comparison provided valuable insights into how the RL-based agent selects various MOEAs operators during the optimization process, a crucial aspect of its design. We critically assessed the algorithm’s performance to identify its strengths and weaknesses. We used the IGD and SP metrics to evaluate how well R2-RLMOEA performed. These metrics are commonly used to measure convergence and diversity and assessed R2-RLMOEA compared to other algorithms.

Even though the R2-RLMOEA performs the best on average across all datasets and thus has been demonstrated to be the most versatile of all algorithms compared in this study, the results show that there is scope for improvement in the R2-RLMOEA algorithm, based on the observation that MOMBI-II (UF1), R2-ES and R2-TLBO (UF4) and MOMBI-II and R2-EO (UF5) performed best based on SP and MOMBI-II (UF6), R2-TLBO (UF3) and Random Opt algorithm (UF1) performed best based on the IGD indicator. Improvements in R2-RLMOEA would focus around additional optimisation of specific EAs, perhaps broadening hyperparameters ranges, to address challenges associated with different datasets/environments and/or incorporating additional EAs that are known to be better suited to specific environments/data in the R2-RLMOEA algorithm.

The findings of the study highlight the potential of RL in optimizing algorithm selection. The RL-based agent exhibited an advanced capability to navigate through various MOEAs and made informed choices that improved the optimization process. Our analysis of EA selection showed that certain EAs perform better during early-stage optimization generations, while others show good performance during the final stages. After studying the results of the RL agent in the successful benchmarks, we observed and provided evidence that the ES algorithm has strong exploration capabilities and is best suited to the initial stages of optimization. On the other hand, the GA and TLBO algorithms maintain a good balance between exploration and exploitation. Finally, the EO and WOA are mostly been used during the last generation, revealing their exploitation features. This adaptability is particularly important in multi-objective optimization tasks, where a one-size-fits-all approach is usually ineffective.

In the RL component of R2-RLMOEA, we employed the DDQN model, which is a model-free and off-policy method. DDQN involves experience replay, which makes it highly stable and effective in various environments. However, one disadvantage of using DDQN is the costly training network time compared to EA optimization. Despite this, off-policy methods enable R2-RLMOEA to be well-prepared before the exact optimization process (by training the network early to be better prepared when encountering any state in the future). During optimization, there is no difference between using an agent-based RL and an EA without an RL agent. After analyzing the results of various tests, it can be concluded that the R2-RLMOEA algorithm is a highly effective tool for tackling complex optimization challenges over average performance across all benchmarks. Its dynamic approach to selecting MOEAs is particularly noteworthy and highlights the benefits of incorporating reinforcement learning into EAs. These findings not only showcase the potential of this exciting field of research but also pave the way for future advancements.

IX Conclusion and future work

Our work introduces the R2-RLMOEA, an adaptive algorithm that combines multiple optimization operators for efficient multi-objective optimization. Our algorithm features a DDQ network that selects the appropriate EA based on the process condition. We used five EAs (GA, ES, TLBO, WOA, and EO) as well as an MOEA with randomly selected operators. These are chosen adaptively to improve the optimization process. The R2 indicator is utilized to convert each SOEA into an MOEA and construct an RL reward architecture. The algorithm was evaluated using the IGD (for convergence and distribution) and SP (for distribution) criteria on multi-objective CEC09 benchmarks, with results demonstrating that our agent-guided MOEA (R2-RLMOEA) significantly outperforms all other algorithms on average across 10 benchmarks (p<0.001𝑝0.001p<0.001italic_p < 0.001) and is significantly better than all algorithms for specific benchmarks UF8-UF10, as the three-objective benchmarks.

For future work, our plan is to incorporate advanced multi-objective frameworks, both non-indicator-based and indicators, to enhance the conversion of SOEAs to MOEAs. Additionally, we will explore the utilization of more EAs to enhance the convergence and distribution of our solutions across all conceivable test functions during the optimization procedure. We also plan to implement advanced RL algorithms, particularly policy-based methods that are suitable for optimizing continuous variables and parameters.

References

  • [1] A. Mukhopadhyay, U. Maulik, S. Bandyopadhyay, and C. A. C. Coello, “A survey of multiobjective evolutionary algorithms for data mining: Part i,” IEEE Transactions on Evolutionary Computation, vol. 18, no. 1, pp. 4–19, 2013.
  • [2] E. Zitzler and S. Künzli, “Indicator-based selection in multiobjective search,” in International conference on parallel problem solving from nature, pp. 832–842, Springer, 2004.
  • [3] D. Brockhoff, T. Wagner, and H. Trautmann, “On the properties of the R2 indicator,” in Proceedings of the 14th annual conference on genetic and evolutionary computation, pp. 465–472, 2012.
  • [4] O. Schutze, X. Esquivel, A. Lara, and C. A. C. Coello, “Using the averaged Hausdorff distance as a performance measure in evolutionary multiobjective optimization,” IEEE Transactions on Evolutionary Computation, vol. 16, no. 4, pp. 504–522, 2012.
  • [5] J. R. Schott, “Fault tolerant design using single and multicriteria genetic algorithm optimization,” tech. rep., 1995.
  • [6] C. A. C. Coello and M. R. Sierra, “A study of the parallelization of a coevolutionary multi-objective evolutionary algorithm,” in Mexican international conference on artificial intelligence, pp. 688–697, Springer, 2004.
  • [7] M. G. P. de Lacerda, L. F. de Araujo Pessoa, F. B. de Lima Neto, T. B. Ludermir, and H. Kuchen, “A systematic literature review on general parameter control for evolutionary and swarm-based algorithms,” Swarm and Evolutionary Computation, vol. 60, p. 100777, 2021.
  • [8] G. Karafotias, M. Hoogendoorn, and Á. E. Eiben, “Parameter control in evolutionary algorithms: Trends and challenges,” IEEE Transactions on Evolutionary Computation, vol. 19, no. 2, pp. 167–187, 2014.
  • [9] A. Aleti and I. Moser, “Predictive parameter control,” in Proceedings of the 13th annual conference on Genetic and evolutionary computation, pp. 561–568, 2011.
  • [10] A. Aleti, I. Moser, I. Meedeniya, and L. Grunske, “Choosing the appropriate forecasting model for predictive parameter control,” Evolutionary Computation, vol. 22, no. 2, pp. 319–349, 2014.
  • [11] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT Press, 2018.
  • [12] M. M. Drugan, “Reinforcement learning versus evolutionary computation: A survey on hybrid algorithms,” Swarm and Evolutionary Computation, vol. 44, pp. 228–246, 2019.
  • [13] J. E. Pettinger and R. M. Everson, “Controlling genetic algorithms with reinforcement learning,” in Proceedings of the 4th annual conference on genetic and evolutionary computation, pp. 692–692, 2002.
  • [14] F. Chen, Y. Gao, Z.-q. Chen, and S.-f. Chen, “SCGA: Controlling genetic algorithms with sarsa (0),” in International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06), vol. 1, pp. 1177–1183, IEEE, 2005.
  • [15] A. Eiben, M. Horvath, W. Kowalczyk, and M. C. Schut, “Reinforcement learning for online control of evolutionary algorithms,” in ESOA, pp. 151–160, Springer, 2006.
  • [16] K. M. Sallam, S. M. Elsayed, R. K. Chakrabortty, and M. J. Ryan, “Evolutionary framework with reinforcement learning-based mutation adaptation,” IEEE Access, vol. 8, pp. 194045–194071, 2020.
  • [17] W. Ning, B. Guo, X. Guo, C. Li, and Y. Yan, “Reinforcement learning aided parameter control in multi-objective evolutionary algorithm based on decomposition,” Progress in Artificial Intelligence, vol. 7, pp. 385–398, 2018.
  • [18] T. Visutarrom, T.-C. Chiang, A. Konak, and S. Kulturel-Konak, “Reinforcement learning-based differential evolution for solving economic dispatch problems,” in 2020 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pp. 913–917, IEEE, 2020.
  • [19] Z. Tan, Y. Tang, K. Li, H. Huang, and S. Luo, “Differential evolution with hybrid parameters and mutation strategies based on reinforcement learning,” Swarm and Evolutionary Computation, vol. 75, p. 101194, 2022.
  • [20] J. H. Holland, Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992.
  • [21] H.-G. Beyer and H.-P. Schwefel, “Evolution strategies–a comprehensive introduction,” Natural computing, vol. 1, pp. 3–52, 2002.
  • [22] R. V. Rao, V. J. Savsani, and D. Vakharia, “Teaching–learning-based optimization: A novel method for constrained mechanical design optimization problems,” Computer-aided design, vol. 43, no. 3, pp. 303–315, 2011.
  • [23] S. Mirjalili and A. Lewis, “The whale optimization algorithm,” Advances in engineering software, vol. 95, pp. 51–67, 2016.
  • [24] A. Faramarzi, M. Heidarinejad, B. Stephens, and S. Mirjalili, “Equilibrium optimizer: A novel optimization algorithm,” Knowledge-Based Systems, vol. 191, p. 105190, 2020.
  • [25] M. A. Al-Betar, I. Abu Doush, S. N. Makhadmeh, G. Al-Naymat, O. A. Alomari, and M. A. Awadallah, “Equilibrium optimizer: a comprehensive survey,” Multimedia Tools and Applications, pp. 1–50, 2023.
  • [26] N. Rana, M. S. A. Latiff, S. M. Abdulhamid, and H. Chiroma, “Whale optimization algorithm: a systematic review of contemporary applications, modifications and developments,” Neural Computing and Applications, vol. 32, pp. 16245–16277, 2020.
  • [27] F. Tahernezhad-Javazm, D. Rankin, and D. Coyle, “R2-HMEWO: Hybrid multi-objective evolutionary algorithm based on the Equilibrium Optimizer and Whale Optimization Algorithm,” in 2022 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8, IEEE, 2022.
  • [28] N. Hansen and A. Ostermeier, “Completely derandomized self-adaptation in evolution strategies,” Evolutionary computation, vol. 9, no. 2, pp. 159–195, 2001.
  • [29] N. Hansen, D. V. Arnold, and A. Auger, “Evolution strategies,” Springer handbook of computational intelligence, pp. 871–898, 2015.
  • [30] F. Zou, D. Chen, and Q. Xu, “A survey of teaching–learning-based optimization,” Neurocomputing, vol. 335, pp. 366–383, 2019.
  • [31] F. Tahernezhad-Javazm, D. Rankin, and D. Coyle, “A Hybrid Multi-Objective Teaching Learning-Based Optimization Using Reference Points and R2 Indicator,” in 2022 6th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence, pp. 19–23, 2022.
  • [32] F. Tahernezhad-Javazm, V. Azimirad, and M. Shoaran, “A review and experimental study on the application of classifiers and evolutionary algorithms in EEG-based brain–machine interface systems,” Journal of neural engineering, vol. 15, no. 2, p. 021007, 2018.
  • [33] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [34] Q. Zhang and H. Li, “MOEA/D: A multiobjective evolutionary algorithm based on decomposition,” IEEE Transactions on Evolutionary Computation, vol. 11, no. 6, pp. 712–731, 2007.
  • [35] K. Li, A. Fialho, S. Kwong, and Q. Zhang, “Adaptive operator selection with bandits for a multiobjective evolutionary algorithm based on decomposition,” IEEE Transactions on Evolutionary Computation, vol. 18, no. 1, pp. 114–130, 2013.
  • [36] E. Zitzler and L. Thiele, “Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach,” IEEE Transactions on Evolutionary Computation, vol. 3, no. 4, pp. 257–271, 1999.
  • [37] K. M. Malan and A. P. Engelbrecht, “A survey of techniques for characterising fitness landscapes and some possible ways forward,” Information Sciences, vol. 241, pp. 148–163, 2013.
  • [38] Y. Huang, W. Li, F. Tian, and X. Meng, “A fitness landscape ruggedness multiobjective differential evolution algorithm with a reinforcement learning strategy,” Applied Soft Computing, vol. 96, p. 106693, 2020.
  • [39] J. Brest, M. S. Maučec, and B. Bošković, “iL-SHADE: Improved L-SHADE algorithm for single objective real-parameter optimization,” in 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 1188–1195, IEEE, 2016.
  • [40] J. Brest, M. S. Maučec, and B. Bošković, “Single objective real-parameter optimization: Algorithm jSO,” in 2017 IEEE Congress on Evolutionary Computation (CEC), pp. 1311–1318, IEEE, 2017.
  • [41] K. V. Price, “Eliminating drift bias from the differential evolution algorithm,” Advances in differential evolution, pp. 33–88, 2008.
  • [42] A. K. Qin and P. N. Suganthan, “Self-adaptive differential evolution algorithm for numerical optimization,” in 2005 IEEE congress on evolutionary computation, vol. 2, pp. 1785–1791, IEEE, 2005.
  • [43] J. Ilonen, J.-K. Kamarainen, and J. Lampinen, “Differential evolution training algorithm for feed-forward neural networks,” Neural Processing Letters, vol. 17, pp. 93–105, 2003.
  • [44] J. Sun, X. Liu, T. Bäck, and Z. Xu, “Learning adaptive differential evolution algorithm from optimization experiences by policy gradient,” IEEE Transactions on Evolutionary Computation, vol. 25, no. 4, pp. 666–680, 2021.
  • [45] G. Zhang and Y. Shi, “Hybrid sampling evolution strategy for solving single objective bound constrained problems,” in 2018 IEEE Congress on Evolutionary Computation (CEC), pp. 1–7, IEEE, 2018.
  • [46] L. Tao, Y. Dong, W. Chen, Y. Yang, L. Su, Q. Guo, and G. Wang, “A differential evolution with reinforcement learning for multi-objective assembly line feeding problem,” Computers & Industrial Engineering, vol. 174, p. 108714, 2022.
  • [47] K. Deb and H. Jain, “An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part i: solving problems with box constraints,” IEEE Transactions on Evolutionary Computation, vol. 18, no. 4, pp. 577–601, 2013.
  • [48] A. J. Nebro, J. J. Durillo, J. Garcia-Nieto, C. C. Coello, F. Luna, and E. Alba, “SMPSO: A new PSO-based metaheuristic for multi-objective optimization,” in 2009 IEEE Symposium on computational intelligence in multi-criteria decision-making (MCDM), pp. 66–73, IEEE, 2009.
  • [49] M. Tessari and G. Iacca, “Reinforcement learning based adaptive metaheuristics,” in Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1854–1861, 2022.
  • [50] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [51] G. Shala, A. Biedenkapp, N. Awad, S. Adriaensen, M. Lindauer, and F. Hutter, “Learning step-size adaptation in CMA-ES,” in Parallel Problem Solving from Nature–PPSN XVI: 16th International Conference, PPSN 2020, Leiden, The Netherlands, September 5-9, 2020, Proceedings, Part I 16, pp. 691–706, Springer, 2020.
  • [52] M. G. P. de Lacerda, F. B. de Lima Neto, H. d. A. A. Neto, H. Kuchen, and T. B. Ludermir, “On the learning properties of dueling DDQN in parameter control for evolutionary and swarm-based algorithms,” in 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1–6, IEEE, 2019.
  • [53] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95-international conference on neural networks, vol. 4, pp. 1942–1948, IEEE, 1995.
  • [54] Y. Sakurai, K. Takada, T. Kawabe, and S. Tsuruta, “A method to control parameters of evolutionary algorithms by using reinforcement learning,” in 2010 sixth international conference on signal-image technology and internet based systems, pp. 74–79, IEEE, 2010.
  • [55] A. Dantas and A. Pozo, “Online Selection of Heuristic Operators with Deep Q-Network: A Study on the HyFlex Framework,” in Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, November 29–December 3, 2021, Proceedings, Part I 10, pp. 280–294, Springer, 2021.
  • [56] L. Gambardella and A.-Q. DM, “A Reinforcement Learning Approach to the Traveling Salesman Problem,” in of ML-95 Twelfth Intern Conf on Machining, pp. 252–260, 1994.
  • [57] S. D. Handoko, D. T. Nguyen, Z. Yuan, and H. C. Lau, “Reinforcement learning for adaptive operator selection in memetic search applied to quadratic assignment problem,” in Proceedings of the companion publication of the 2014 annual conference on genetic and evolutionary computation, pp. 193–194, 2014.
  • [58] G. Karafotias, M. Hoogendoorn, and B. Weel, “Comparing generic parameter controllers for EAs,” in 2014 IEEE Symposium on Foundations of Computational Intelligence (FOCI), pp. 46–53, IEEE, 2014.
  • [59] E. Alba and B. Dorronsoro, “Introduction to cellular genetic algorithms,” in Cellular Genetic Algorithms, pp. 3–20, Springer, 2008.
  • [60] S. M. Elsayed, R. A. Sarker, and D. L. Essam, “GA with a new multi-parent crossover for solving IEEE-CEC2011 competition problems,” in 2011 IEEE Congress of Evolutionary Computation (CEC), pp. 1034–1040, IEEE, 2011.
  • [61] T. Liao and T. Stützle, “Bounding the population size of IPOP-CMA-ES on the noiseless BBOB testbed,” in Proceedings of the 15th annual conference companion on Genetic and evolutionary computation, pp. 1161–1168, 2013.
  • [62] E. Emary, H. M. Zawbaa, and C. Grosan, “Experienced gray wolf optimization through reinforcement learning and neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 3, pp. 681–694, 2017.
  • [63] S. Mirjalili, S. M. Mirjalili, and A. Lewis, “Grey wolf optimizer,” Advances in engineering software, vol. 69, pp. 46–61, 2014.
  • [64] A. K. Sadhu, A. Konar, T. Bhattacharjee, and S. Das, “Synergism of firefly algorithm and Q-learning for robot arm path planning,” Swarm and Evolutionary Computation, vol. 43, pp. 50–68, 2018.
  • [65] X.-S. Yang, Nature-inspired metaheuristic algorithms. Luniver press, 2010.
  • [66] M. Sharma, A. Komninos, M. López-Ibáñez, and D. Kazakov, “Deep reinforcement learning based parameter control in differential evolution,” in Proceedings of the Genetic and Evolutionary Computation Conference, pp. 709–717, 2019.
  • [67] Z. Li, L. Shi, C. Yue, Z. Shang, and B. Qu, “Differential evolution based on reinforcement learning with fitness ranking for solving multimodal multiobjective problems,” Swarm and Evolutionary Computation, vol. 49, pp. 234–244, 2019.
  • [68] F. Zou, G. G. Yen, L. Tang, and C. Wang, “A reinforcement learning approach for dynamic multi-objective optimization,” Information Sciences, vol. 546, pp. 815–834, 2021.
  • [69] M. P. Hansen and A. Jaszkiewicz, Evaluating the quality of approximations to the non-dominated set. Citeseer, 1994.
  • [70] J. G. Falcón-Cardona and C. A. C. Coello, “Indicator-based multi-objective evolutionary algorithms: A comprehensive survey,” ACM Computing Surveys (CSUR), vol. 53, no. 2, pp. 1–35, 2020.
  • [71] R. Hernández Gómez and C. A. Coello Coello, “Improved metaheuristic based on the R2 indicator for many-objective optimization,” in Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp. 679–686, 2015.
  • [72] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, “Q-learning algorithms: A comprehensive classification and applications,” IEEE access, vol. 7, pp. 133653–133667, 2019.
  • [73] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  • [74] L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine learning, vol. 8, pp. 293–321, 1992.
  • [75] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, 2016.
  • [76] W. Yang, L. Chen, Y. Wang, and M. Zhang, “Multi/Many-Objective Particle Swarm Optimization Algorithm Based on Competition Mechanism,” Computational intelligence and neuroscience, vol. 2020, 2020.
  • [77] C. A. C. Coello and N. C. Cortés, “Solving multiobjective optimization problems using an artificial immune system,” Genetic programming and evolvable machines, vol. 6, pp. 163–190, 2005.
  • [78] D. Hadka, “Platypus-multiobjective optimization in python,” Revision 0552483b, 2015.