License: arXiv.org perpetual non-exclusive license
arXiv:2312.15412v1 [cs.LG] 24 Dec 2023

CARSS: Cooperative Attention-guided Reinforcement Subpath Synthesis for Solving Traveling Salesman Problem

Yuchen Shi
Department of Mathematical Sciences
University of Chinese Academy of Sciences
Bei**g 100049, China
[email protected]
&Congying Han
Department of Mathematical Sciences
University of Chinese Academy of Sciences
Bei**g 100049, China
[email protected]
\ANDTiande Guo
Department of Mathematical Sciences
University of Chinese Academy of Sciences
Bei**g 100049, China
[email protected]
Corresponding author
Abstract

This paper introduces CARSS (Cooperative Attention-guided Reinforcement Subpath Synthesis), a novel approach to address the Traveling Salesman Problem (TSP) by leveraging cooperative Multi-Agent Reinforcement Learning (MARL). CARSS decomposes the TSP solving process into two distinct yet synergistic steps: "subpath generation" and "subpath merging." In the former, a cooperative MARL framework is employed to iteratively generate subpaths using multiple agents. In the latter, these subpaths are progressively merged to form a complete cycle. The algorithm’s primary objective is to enhance efficiency in terms of training memory consumption, testing time, and scalability, through the adoption of a multi-agent divide and conquer paradigm. Notably, attention mechanisms play a pivotal role in feature embedding and parameterization strategies within CARSS. The training of the model is facilitated by the independent REINFORCE algorithm. Empirical experiments reveal CARSS’s superiority compared to single-agent alternatives: it demonstrates reduced GPU memory utilization, accommodates training graphs nearly 2.5 times larger, and exhibits the potential for scaling to even more extensive problem sizes. Furthermore, CARSS substantially reduces testing time and optimization gaps by approximately 50% for TSP instances of up to 1000 vertices, when compared to standard decoding methods.

1 Introduction

The Traveling Salesman Problem (TSP) stands as one of the quintessential combinatorial optimization challenges, seeking the shortest route to visit a set of cities and return to the origin. Its NP-hard nature has spurred continuous research into develo** efficient algorithms capable of tackling real-world instances. Traditional methods, such as exact algorithms based on cutting plane method (Chvátal et al., 2009) or dynamic programming (Held and Karp, 1962; Bellman, 1962), and heuristic algorithms based on insertion (Rosenkrantz et al., 1974), local search (Helsgaun, 2000) or population (Dorigo and Gambardella, 1997), often struggle with scalability and optimality for larger problem sizes, prompting the exploration of innovative paradigms that transcend the limitations of single-agent approaches.

In recent times, the field of Multi-Agent Reinforcement Learning (MARL) has gained prominence as a promising avenue for tackling intricate optimization problems. Notable examples include Level-Based Foraging (Albrecht and Ramamoorthy, 2013), Multi-Agent Particle Environment (Mordatch and Abbeel, 2018; Lowe et al., 2017), StarCraft Multi-Agent Challenge (Samvelyan et al., 2019), Multi-Robot Warehouse (Christianos et al., 2020; Dhamankar et al., 2020), Google Research Football (Kurach et al., 2020), and Hanabi (Bard et al., 2020). Through harnessing the collaborative proficiencies of multiple agents, cooperative MARL brings about the potential to enhance the efficiency of problem-solving processes, overcome computational bottlenecks, and advance scalability. Within this context, we introduce a pioneering algorithm—Cooperative Attention-guided Reinforcement Subpath Synthesis (CARSS)—crafted to transform the approach to solving the Traveling Salesperson Problem (TSP).

CARSS adopts a distinctive two-step strategy to decompose the TSP solving process. The first step, termed "subpath generation", harnesses the power of cooperative MARL to iteratively generate subpaths. Each agent contributes to constructing a subpath, collectively working towards achieving an optimal solution. The second step, "subpath merging," involves the incremental fusion of these subpaths to ultimately form a complete cycle that represents the solution to the TSP. This decomposition not only capitalizes on the strengths of MARL but also strategically divides the problem to mitigate the computational and memory burdens associated with large-scale instances.

A notable feature of CARSS lies in its incorporation of attention mechanisms, which serve a dual role in both feature embedding and parameterization strategies. These mechanisms enhance the agents’ ability to capture relevant information and learn effectively from their interactions with the environment. The training of the CARSS model is facilitated by the independent REINFORCE algorithm, a proven reinforcement learning technique.

Our contributions are threefold:

  • A novel algorithm, CARSS, is introduced for solving the TSP by leveraging cooperative MARL and attention mechanisms. The algorithm decomposes the problem into "subpath generation" and "subpath merging" steps, addressing memory consumption and scalability challenges.

  • The proposed approach demonstrates substantial improvements in terms of memory efficiency and testing times when compared to conventional single-agent algorithms. CARSS extends the capability to train on larger problem instances while maintaining solution quality.

  • Empirical results show that the CARSS algorithm reduces testing times and optimization gaps by approximately 50% for TSP instances of up to 1000 vertices, underscoring its potential to significantly enhance the efficiency of TSP-solving techniques.

2 Related Works

A considerable portion of the research in the realm of solving the TSP through supervised and reinforcement learning has been rooted in constructive modeling methodologies (Vinyals et al., 2015; Bello et al., 2016; Khalil et al., 2017; Kool et al., 2018; Bresson and Laurent, 2021). These approaches involve the stepwise selection of individual points, akin to methods driven by a singular agent. However, it is noteworthy that these methodologies tend to exhibit elevated time and space complexities when confronted with the task of addressing expansive problem scales. As a testament to this, numerous algorithms demonstrate their efficacy solely on problems of modest proportions, typically up to a size of 200, utilizing a prescribed quantum of GPU resources. For instance, Joshi et al. (2020) expound upon the challenges by affirming that "Training on large TSP200 from scratch is intractable and sample inefficient." This intrinsic computational burden consequently restricts their performance when applied to more substantial problem instances. Nevertheless, in contrast to these conventional paradigms, the CARSS algorithm introduces a pioneering approach that strategically decomposes the TSP-solving process into two distinct stages: subpath generation and subpath merging. By leveraging the principles of MARL, CARSS endeavors to surmount the limitations of memory consumption during training, mitigate testing duration, and amplify its scalability.

In the realm of TSP variations, Zhang et al. (2020) introduced a MARL-oriented framework addressing the vehicle routing problem encompassing soft time windows for a multi-vehicle scenario. This approach hinged upon predefined regulations, dictating a rotational decision-making process among vehicles. Notably, all vehicles shared a singular policy network, inadvertently rendering the framework functionally akin to single-agent control. Building upon this premise, Zong et al. (2022) advanced the paradigm by fashioning independent policy networks, eschewing the necessity for predetermined coordination rules in scenarios involving vehicle interaction. This liberation substantially expanded the exploration capacity within the collective of vehicle agents, efficiently tackling the intricacies posed by pickup and delivery problems. Extending this innovation, the CARSS algorithm extrapolates the concept into the domain of TSP, orchestrating a divide-and-conquer methodology tailored to surmount the challenges of larger-scale problem instances.

3 Method

In this section, we present the methodology of CARSS. The subsections that follow outline the cooperative Markov game formulation, algorithm specifics, policy parameterization, policy optimization, and complexity analysis.

3.1 Cooperative Markov Game Formulation for Traveling Salesman Problem

Define TSP as the tuple (,f,μ)𝑓𝜇(\mathcal{I},f,\mu)( caligraphic_I , italic_f , italic_μ ), where \mathcal{I}caligraphic_I represents the set of graph instances, f(G)𝑓𝐺f(G)italic_f ( italic_G ) denotes the set of all feasible solutions for the graph G=(V(G),E(G),w(G))𝐺𝑉𝐺𝐸𝐺𝑤𝐺G=(V(G),E(G),w(G))italic_G = ( italic_V ( italic_G ) , italic_E ( italic_G ) , italic_w ( italic_G ) ) within the context of \mathcal{I}caligraphic_I. Here, each graph G𝐺Gitalic_G comprises v(G)𝑣𝐺v(G)italic_v ( italic_G ) vertices and e(G)𝑒𝐺e(G)italic_e ( italic_G ) edges. The function μ(G,H)𝜇𝐺𝐻\mu(G,H)italic_μ ( italic_G , italic_H ) quantifies the value of solution H𝐻Hitalic_H within the set f(G)𝑓𝐺f(G)italic_f ( italic_G ) concerning the problem’s objective. In the context of TSP, μ(G,H)𝜇𝐺𝐻\mu(G,H)italic_μ ( italic_G , italic_H ) equates to eHwesubscript𝑒𝐻subscript𝑤𝑒\sum_{e\in H}w_{e}∑ start_POSTSUBSCRIPT italic_e ∈ italic_H end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, where wesubscript𝑤𝑒w_{e}italic_w start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT signifies the weight of edge e𝑒eitalic_e. The ultimate objective of the problem is to determine the solution H𝐻Hitalic_H that minimizes this objective value across all instances G𝐺G\in\mathcal{I}italic_G ∈ caligraphic_I, formally expressed as argminHf(G)μ(G,H)subscriptargmin𝐻𝑓𝐺𝜇𝐺𝐻\operatorname{argmin}_{H\in f(G)}\mu(G,H)roman_argmin start_POSTSUBSCRIPT italic_H ∈ italic_f ( italic_G ) end_POSTSUBSCRIPT italic_μ ( italic_G , italic_H ).

For a multi-agent system involving Kv(G)2𝐾𝑣𝐺2K\leq\frac{v(G)}{2}italic_K ≤ divide start_ARG italic_v ( italic_G ) end_ARG start_ARG 2 end_ARG agents, we can establish the corresponding cooperative Markov game (K,𝒮,𝒜kk1,,K,P,r)𝐾𝒮subscriptsuperscript𝒜𝑘𝑘1𝐾𝑃𝑟(K,\mathcal{S},{\mathcal{A}^{k}}_{k\in{1,\ldots,K}},P,r)( italic_K , caligraphic_S , caligraphic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k ∈ 1 , … , italic_K end_POSTSUBSCRIPT , italic_P , italic_r ) as follows:

  • The state space, denoted as 𝒮={ssH,Hf(G),G}𝒮conditional-set𝑠formulae-sequence𝑠𝐻formulae-sequence𝐻𝑓𝐺𝐺\mathcal{S}=\{s\mid s\subseteq H,H\in f(G),G\in\mathcal{I}\}caligraphic_S = { italic_s ∣ italic_s ⊆ italic_H , italic_H ∈ italic_f ( italic_G ) , italic_G ∈ caligraphic_I }, encompasses all possible states. The initial state, s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, is represented by the null graph K0subscript𝐾0K_{0}italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, while the state space at the final time step T𝑇Titalic_T is 𝒯={HHf(G),G}𝒯conditional-set𝐻formulae-sequence𝐻𝑓𝐺𝐺\mathcal{T}=\{H\mid H\in f(G),G\in\mathcal{I}\}caligraphic_T = { italic_H ∣ italic_H ∈ italic_f ( italic_G ) , italic_G ∈ caligraphic_I }. Each agent shares an identical state at every time step and enjoys full access to all environmental observations.

  • The action space for agent k𝑘kitalic_k, noted as 𝒜k=V(G)E(G)superscript𝒜𝑘𝑉𝐺𝐸𝐺\mathcal{A}^{k}=V(G)\cup E(G)caligraphic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_V ( italic_G ) ∪ italic_E ( italic_G ), includes all vertices and edges of graph G𝐺Gitalic_G. Furthermore, 𝒜sk=(V(G)V(s))(E(G)E(s))superscriptsubscript𝒜𝑠𝑘𝑉𝐺𝑉𝑠𝐸𝐺𝐸𝑠\mathcal{A}_{s}^{k}=(V(G)\setminus V(s))\cup(E(G)\setminus E(s))caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = ( italic_V ( italic_G ) ∖ italic_V ( italic_s ) ) ∪ ( italic_E ( italic_G ) ∖ italic_E ( italic_s ) ) characterizes the set of feasible actions for agent k𝑘kitalic_k within state s𝑠sitalic_s.

  • The state transition probability function, P:𝒮×𝒜1××𝒜KΔ𝒮:𝑃𝒮superscript𝒜1superscript𝒜𝐾Δ𝒮P:\mathcal{S}\times\mathcal{A}^{1}\times\cdots\times\mathcal{A}^{K}\rightarrow% \Delta\mathcal{S}italic_P : caligraphic_S × caligraphic_A start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT × ⋯ × caligraphic_A start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT → roman_Δ caligraphic_S, is defined as follows:

    P(ss,a1,,aK)={1ifs𝒯 and s=s,1ifs𝒮𝒯 and s=s+k=1Kak, 0otherwise.𝑃conditionalsuperscript𝑠𝑠superscript𝑎1superscript𝑎𝐾cases1if𝑠𝒯 and superscript𝑠𝑠𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒1if𝑠𝒮𝒯 and superscript𝑠𝑠superscriptsubscript𝑘1𝐾superscript𝑎𝑘 0otherwiseP(s^{\prime}\mid s,a^{1},\ldots,a^{K})=\begin{cases}1&\text{if}\ s\in\mathcal{% T}\text{ and }s^{\prime}=s,\\ \\ 1&\text{if}\ s\in\mathcal{S}\setminus\mathcal{T}\text{ and }s^{\prime}=s+\sum_% {k=1}^{K}a^{k},\\ \ 0&\text{otherwise}.\end{cases}italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ) = { start_ROW start_CELL 1 end_CELL start_CELL if italic_s ∈ caligraphic_T and italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_s , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if italic_s ∈ caligraphic_S ∖ caligraphic_T and italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_s + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise . end_CELL end_ROW

    where Δ𝒮Δ𝒮\Delta\mathcal{S}roman_Δ caligraphic_S represents the probabilistic simplex in the state space 𝒮𝒮\mathcal{S}caligraphic_S, and s+a𝑠𝑎s+aitalic_s + italic_a signifies the disjoint union of graph s𝑠sitalic_s and graph a𝑎aitalic_a.

  • The reward function r:𝒮×𝒜1××𝒜K×𝒮δS:𝑟𝒮superscript𝒜1superscript𝒜𝐾𝒮𝛿𝑆r:\mathcal{S}\times\mathcal{A}^{1}\times\cdots\times\mathcal{A}^{K}\times% \mathcal{S}\rightarrow\delta Sitalic_r : caligraphic_S × caligraphic_A start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT × ⋯ × caligraphic_A start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT × caligraphic_S → italic_δ italic_S is defined by r(s,a1,,aK,s)=μ(G,s)𝑟𝑠superscript𝑎1superscript𝑎𝐾superscript𝑠𝜇𝐺superscript𝑠r(s,a^{1},\ldots,a^{K},s^{\prime})=-\mu(G,s^{\prime})italic_r ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = - italic_μ ( italic_G , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). It evaluates to 00 if s𝑠sitalic_s does not belong to 𝒯𝒯\mathcal{T}caligraphic_T but ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT does; otherwise, it is 00.

Solving TSP involves the acquisition of a strategy denoted as π𝜽:𝒮Δ(𝒜1××𝒜K):subscript𝜋𝜽𝒮Δsuperscript𝒜1superscript𝒜𝐾\pi_{\boldsymbol{\theta}}:\mathcal{S}\rightarrow\Delta(\mathcal{A}^{1}\times% \cdots\times\mathcal{A}^{K})italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT : caligraphic_S → roman_Δ ( caligraphic_A start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT × ⋯ × caligraphic_A start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ), which is crafted to optimize the expected partial return J(𝜽)=𝔼π𝜽[RT]𝐽𝜽subscript𝔼𝜋𝜽delimited-[]subscript𝑅𝑇J(\boldsymbol{\theta})=\mathbb{E}_{\pi{\boldsymbol{\theta}}}[R_{T}]italic_J ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_π bold_italic_θ end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ].

In the scenario where the number of agents, denoted as K=1𝐾1K=1italic_K = 1, and the approach employed is deep reinforcement learning for solving TSP, the model adheres to the classical reinforcement learning methodology for addressing the TSP (Kool et al., 2018; Bresson and Laurent, 2021). However, this conventional approach exhibits several limitations:

  • As a TSP route comprises a composition of v(G)𝑣𝐺v(G)italic_v ( italic_G ) edges, invoking the policy network a minimum of v(G)𝑣𝐺v(G)italic_v ( italic_G ) times diminishes the potential benefits of parallel computation within sequential models. Additionally, the substantial computational overhead hampers the training of the model for larger-scale problems.

  • Policy networks conventionally rely on the computation of the attention matrix, entailing a time and space complexity of O(v(G)2)𝑂𝑣superscript𝐺2O(v(G)^{2})italic_O ( italic_v ( italic_G ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (Vaswani et al., 2017). This propensity for excessive memory utilization renders training infeasible for more extensive problem instances.

  • The actions generated through policy network sampling do not invariably constitute feasible solutions. Consequently, decoding necessitates the application of a mask to regulate the selection of visited vertices. However, as the termination point approaches, the number of visited vertices increases, resulting in a diminished space of viable actions. Consequently, the efficiency of attention matrix computation diminishes, leading to inefficient resource utilization.

On the other hand, in scenarios where the number of agents K>1𝐾1K>1italic_K > 1, the policy network is invoked a minimum of v(G)/K𝑣𝐺𝐾v(G)/Kitalic_v ( italic_G ) / italic_K times. By predetermining the feasible actions for each agent, the number of viable actions per agent is averaged to v(G)/K𝑣𝐺𝐾v(G)/Kitalic_v ( italic_G ) / italic_K, thereby reducing the time and space complexity of attention matrix computation to O(v(G)2K2)𝑂𝑣superscript𝐺2superscript𝐾2O(\frac{v(G)^{2}}{K^{2}})italic_O ( divide start_ARG italic_v ( italic_G ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ). This approach also indirectly enhances the efficiency of computational resource utilization, alleviating some of the constraints faced in the single-agent setting.

3.2 CARSS Algorithm

To address the limitations inherent in solving TSP with a single agent, we propose the CARSS algorithm. Designed for tackling the TSP within Euclidean space, CARSS effectively mitigates these limitations by strategically reducing the action and state space of the underlying Markov game, thereby approximating its optimal strategy.

The CARSS algorithm is structured around two pivotal phases: subpath generation and subpath merging. Within this context, "subpaths" represent non-circular graphs that form integral parts of the problem’s final solution tour.

During the subpath generation phase, CARSS initialization features an empty graph, denoted as K0subscript𝐾0K_{0}italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Each agent independently selects a vertex to ensure non-overlap** choices. Subsequently, during each time step, every agent gradually extends an edge to their selected vertex. This synchronized edge addition results in the simultaneous incorporation of K𝐾Kitalic_K edges. This process continues until several subpaths of uniform lengths, devoid of intersections, are established. Here, "intersecting" indicates the absence of any intersection between the vertex sets of two subpaths. It is noteworthy that this phase constitutes the majority of the algorithm’s runtime due to its computationally intensive nature.

The subsequent subpath mergings phase can be analogously conceived as a single-agent approach. Within this phase, the algorithm connects K𝐾Kitalic_K subpaths and, at most, K𝐾Kitalic_K isolated points. This gradual connection is achieved by adding up to 2K2𝐾2K2 italic_K edges, ultimately culminating in a complete cycle. This phase is crucial for addressing a specific TSP instance of a size not exceeding 4K4𝐾4K4 italic_K. The computation time associated with this phase is nearly negligible due to the relatively diminutive size of the subproblem.

Subsequently, we provide the temporal range encompassing time step t{1,,T}𝑡1𝑇t\in\{1,\ldots,T\}italic_t ∈ { 1 , … , italic_T } within the two phases of subpath generation and subpath merging. Furthermore, we expound upon the precise structure of the space of feasible actions 𝒜sksuperscriptsubscript𝒜𝑠𝑘\mathcal{A}_{s}^{k}caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT within state s𝑠sitalic_s during these two phases.

3.2.1 Subpath Generation

In the subpath generation phase, we consider time steps denoted by t{1,,T}𝑡1superscript𝑇t\in\{1,\ldots,T^{\prime}\}italic_t ∈ { 1 , … , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }, where

T={v(G)K2ifK evenly divides v(G),v(G)K1otherwise.superscript𝑇cases𝑣𝐺𝐾2if𝐾 evenly divides 𝑣𝐺𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒𝑣𝐺𝐾1otherwiseT^{\prime}=\begin{cases}\frac{v(G)}{K}-2&\text{if}\ K\text{ evenly divides }v(% G),\\ \\ \ \left\lfloor\frac{v(G)}{K}\right\rfloor-1&\text{otherwise}.\end{cases}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { start_ROW start_CELL divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG - 2 end_CELL start_CELL if italic_K evenly divides italic_v ( italic_G ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ⌊ divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG ⌋ - 1 end_CELL start_CELL otherwise . end_CELL end_ROW

The rationale for treating the case of K𝐾Kitalic_K evenly dividing v(G)𝑣𝐺v(G)italic_v ( italic_G ) separately arises from the following consideration: when the algorithm advances to the (v(G)/K2)𝑣𝐺𝐾2(v(G)/K-2)( italic_v ( italic_G ) / italic_K - 2 )th time step, a total of K𝐾Kitalic_K paths exist within the current state, each with a length of v(G)/K2𝑣𝐺𝐾2v(G)/K-2italic_v ( italic_G ) / italic_K - 2. At this point, the number of visited vertices is v(G)K𝑣𝐺𝐾v(G)-Kitalic_v ( italic_G ) - italic_K, and the number of isolated points is K𝐾Kitalic_K. Consequently, by considering isolated points as subpaths with a length of 0, the total count of subpaths to be connected amounts to 2K2𝐾2K2 italic_K. When each agent carries out an additional action, the count of isolated points decreases to 00, resulting in a reduction of the number of subpaths to K𝐾Kitalic_K. It is evident that a scarcity of isolated points will lead to fewer optional vertices in later stages of the subpath generation phase, thereby reducing training efficiency. Hence, a balance is struck between the number of time steps in the subpath generation phase and the problem size in the subpath merging phase. This strategic compromise enhances overall algorithm performance by marginally decreasing the number of time steps in the subpath generation phase while augmenting the problem’s complexity during the subpath merging phase.

At the initial time step t=1𝑡1t=1italic_t = 1, the feasible action space for state K0subscript𝐾0K_{0}italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is defined as 𝒜K0k=V(G)superscriptsubscript𝒜subscript𝐾0𝑘𝑉𝐺\mathcal{A}_{K_{0}}^{k}=V(G)caligraphic_A start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_V ( italic_G ), where each initial action corresponds to the overlap** initial endpoints of a subpath. For each agent k𝑘kitalic_k, we respectively denote these front and rear endpoints at its current state as fk(s)superscriptf𝑘𝑠\operatorname{f}^{k}(s)roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) and rk(s)superscriptr𝑘𝑠\operatorname{r}^{k}(s)roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ). For subsequent time steps, when sK0𝑠subscript𝐾0s\neq K_{0}italic_s ≠ italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e., when t{2,,T}𝑡2superscript𝑇t\in\{2,\ldots,T^{\prime}\}italic_t ∈ { 2 , … , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }, 𝒜sksuperscriptsubscript𝒜𝑠𝑘\mathcal{A}_{s}^{k}caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is determined by addressing the following specialized assignment problem:

minimizeminimize\displaystyle\operatorname{minimize}roman_minimize i=1v(G)k=1Kxi,kmin{wifk(s),wirk(s)}superscriptsubscript𝑖1𝑣𝐺superscriptsubscript𝑘1𝐾subscript𝑥𝑖𝑘subscript𝑤𝑖superscriptf𝑘𝑠subscript𝑤𝑖superscriptr𝑘𝑠\displaystyle\sum_{i=1}^{v(G)}\sum_{k=1}^{K}x_{i,k}\min\{w_{i\operatorname{f}^% {k}(s)},w_{i\operatorname{r}^{k}(s)}\}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v ( italic_G ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT roman_min { italic_w start_POSTSUBSCRIPT italic_i roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT }
subject to k=1Kxik=1,superscriptsubscript𝑘1𝐾subscript𝑥𝑖𝑘1\displaystyle\sum_{k=1}^{K}x_{ik}=1,∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 1 , i=1,,v(G)𝑖1𝑣𝐺\displaystyle\quad i=1,\ldots,v(G)italic_i = 1 , … , italic_v ( italic_G )
n=1v(G)(xifk(s)+xirk(s))1,superscriptsubscript𝑛1𝑣𝐺subscript𝑥𝑖superscriptf𝑘𝑠subscript𝑥𝑖superscriptr𝑘𝑠1\displaystyle\sum_{n=1}^{v(G)}\left(x_{i\operatorname{f}^{k}(s)}+x_{i% \operatorname{r}^{k}(s)}\right)\geq 1,∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v ( italic_G ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_i roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT ) ≥ 1 , k=1,,K𝑘1𝐾\displaystyle\quad k=1,\ldots,Kitalic_k = 1 , … , italic_K
xik{0,1},subscript𝑥𝑖𝑘01\displaystyle x_{ik}\in\{0,1\},italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∈ { 0 , 1 } , i=1,,v(G),k=1,,Kformulae-sequence𝑖1𝑣𝐺𝑘1𝐾\displaystyle\quad i=1,\ldots,v(G),\ k=1,\ldots,Kitalic_i = 1 , … , italic_v ( italic_G ) , italic_k = 1 , … , italic_K

Here, xiksubscript𝑥𝑖𝑘x_{ik}italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT signifies whether the i𝑖iitalic_ith vertex is assigned to the k𝑘kitalic_kth agent. min{wifk(s),wirk(s)}subscript𝑤𝑖superscriptf𝑘𝑠subscript𝑤𝑖superscriptr𝑘𝑠\min\{w_{i\operatorname{f}^{k}(s)},w_{i\operatorname{r}^{k}(s)}\}roman_min { italic_w start_POSTSUBSCRIPT italic_i roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT } represents the minimum distance from the i𝑖iitalic_ith vertex to either the first or last endpoint of the path corresponding to the k𝑘kitalic_kth agent. The objective function aims to minimize the total sum of distances from each vertex to its closest endpoint within the assigned agent’s path. The first constraint mandates that each vertex must be assigned to exactly one agent. The second constraint ensures that each agent is assigned at least one vertex, thereby guaranteeing that the length of subpaths generated by each agent consistently increases over time. It’s worth noting that the vertex assignment obtained from solving this problem might not necessarily lead to the optimal solution of the original problem. However, it can significantly reduce the action space for the agents, resulting in a substantial acceleration of subpath generation.

A heuristic is designed to efficiently solve the assignment problem. It involves iterating over each agent and having them select the nearest unassigned vertex to fulfill the first constraint. Subsequently, each unassigned vertex is assigned to its nearest agent. The "distances" between vertices i𝑖iitalic_i and agents k𝑘kitalic_k are defined with respect to the metric mini{wifk(s),wirk(s)}subscript𝑖subscript𝑤𝑖superscriptf𝑘𝑠subscript𝑤𝑖superscriptr𝑘𝑠\min_{i}\{w_{i\operatorname{f}^{k}(s)},w_{i\operatorname{r}^{k}(s)}\}roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_w start_POSTSUBSCRIPT italic_i roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT }. The complete algorithm for solving this assignment problem is presented in Algorithm 1.

Having obtained an approximate solution to the problem, the feasible action space for each agent k𝑘kitalic_k in state s𝑠sitalic_s is characterized as 𝒜sk={(i,j)Exik=1,iV(s),j=argminj{fk(s),rk(s)}wij}superscriptsubscript𝒜𝑠𝑘conditional-set𝑖𝑗𝐸formulae-sequencesubscript𝑥𝑖𝑘1formulae-sequence𝑖𝑉𝑠𝑗subscriptargmin𝑗superscriptf𝑘𝑠superscriptr𝑘𝑠subscript𝑤𝑖𝑗\mathcal{A}_{s}^{k}=\{(i,j)\in E\mid x_{ik}=1,i\notin V(s),j=\operatorname{% argmin}_{j\in\{\operatorname{f}^{k}(s),\operatorname{r}^{k}(s)\}}w_{ij}\}caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { ( italic_i , italic_j ) ∈ italic_E ∣ italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 1 , italic_i ∉ italic_V ( italic_s ) , italic_j = roman_argmin start_POSTSUBSCRIPT italic_j ∈ { roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) , roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT }, which denotes an edge in E𝐸Eitalic_E with unvisited vertices at one end and front and rear vertices of the path corresponding to agent k𝑘kitalic_k at the other end. Notably, these sets do not overlap with each other, i.e., {i(i,j)𝒜sk1}{i(i,j)𝒜sk2}=,k1,k2{1,,K},k1k2formulae-sequenceconditional-set𝑖𝑖𝑗superscriptsubscript𝒜𝑠subscript𝑘1conditional-set𝑖𝑖𝑗superscriptsubscript𝒜𝑠subscript𝑘2for-allsubscript𝑘1formulae-sequencesubscript𝑘21𝐾subscript𝑘1subscript𝑘2\{i\mid(i,j)\in\mathcal{A}_{s}^{k_{1}}\}\cap\{i\mid(i,j)\in\mathcal{A}_{s}^{k_% {2}}\}=\varnothing,\forall k_{1},k_{2}\in\{1,\ldots,K\},k_{1}\neq k_{2}{ italic_i ∣ ( italic_i , italic_j ) ∈ caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ∩ { italic_i ∣ ( italic_i , italic_j ) ∈ caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } = ∅ , ∀ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ { 1 , … , italic_K } , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This ensures that the states at each time step consist of K𝐾Kitalic_K disjoint paths.

To enhance the model’s viability in addressing large-scale problems, the feasible action space 𝒜sksuperscriptsubscript𝒜𝑠𝑘\mathcal{A}_{s}^{k}caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for each agent in this phase is restricted to a maximum of v(G)/K𝑣𝐺𝐾v(G)/Kitalic_v ( italic_G ) / italic_K actions, encompassing those closest to the respective agent.

Algorithm 1 Vertex-agent assignment heurisitic algorithm
1:Vertex set V𝑉Vitalic_V, Number of agents K𝐾Kitalic_K, Current state s𝑠sitalic_s
2:Initialize decision variables xik0,i=1,,v(G),k=1,,Kformulae-sequencesubscript𝑥𝑖𝑘0formulae-sequence𝑖1𝑣𝐺𝑘1𝐾x_{ik}\leftarrow 0,\ i=1,\ldots,v(G),\ k=1,\ldots,Kitalic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ← 0 , italic_i = 1 , … , italic_v ( italic_G ) , italic_k = 1 , … , italic_K
3:Initialize unassigned vertex list UV𝑈𝑉U\leftarrow Vitalic_U ← italic_V
4:for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K do
5:     iargmin{min{xifk(s),xirk(s)}iU}𝑖argminconditionalsubscript𝑥𝑖superscriptf𝑘𝑠subscript𝑥𝑖superscriptr𝑘𝑠𝑖𝑈i\leftarrow\operatorname{argmin}\{\min\{x_{i\operatorname{f}^{k}(s)},x_{i% \operatorname{r}^{k}(s)}\}\mid i\in U\}italic_i ← roman_argmin { roman_min { italic_x start_POSTSUBSCRIPT italic_i roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT } ∣ italic_i ∈ italic_U }
6:     xik1subscript𝑥𝑖𝑘1x_{ik}\leftarrow 1italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ← 1
7:     UU{i}𝑈𝑈𝑖U\leftarrow U\setminus\{i\}italic_U ← italic_U ∖ { italic_i }
8:end for
9:while U𝑈U\neq\varnothingitalic_U ≠ ∅ do
10:     iU[0]𝑖𝑈delimited-[]0i\leftarrow U[0]italic_i ← italic_U [ 0 ]
11:     kargmin{min{xifk(s),xirk(s)}k{1,,K}}𝑘argminconditionalsubscript𝑥𝑖superscriptf𝑘𝑠subscript𝑥𝑖superscriptr𝑘𝑠𝑘1𝐾k\leftarrow\operatorname{argmin}\{\min\{x_{i\operatorname{f}^{k}(s)},x_{i% \operatorname{r}^{k}(s)}\}\mid k\in\{1,\ldots,K\}\}italic_k ← roman_argmin { roman_min { italic_x start_POSTSUBSCRIPT italic_i roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT } ∣ italic_k ∈ { 1 , … , italic_K } }
12:     xik1subscript𝑥𝑖𝑘1x_{ik}\leftarrow 1italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ← 1
13:     UU{i}𝑈𝑈𝑖U\leftarrow U\setminus\{i\}italic_U ← italic_U ∖ { italic_i }
14:end while
15:return {xij}i=1,,v(G),k=1,,Ksubscriptsubscript𝑥𝑖𝑗formulae-sequence𝑖1𝑣𝐺𝑘1𝐾\{x_{ij}\}_{i=1,\ldots,v(G),\ k=1,\ldots,K}{ italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , italic_v ( italic_G ) , italic_k = 1 , … , italic_K end_POSTSUBSCRIPT

3.2.2 Subpath Merging

The subpath merging stage can be conceptualized as addressing a specific variant of TSP using single-agent reinforcement learning. Upon completing the subpath generation, the state STsubscript𝑆superscript𝑇S_{T^{\prime}}italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT encompasses two distinct components: firstly, K𝐾Kitalic_K disjoint paths each of length Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and secondly, isolated vertices {Ii}i1,,v(G)K(T+1)subscriptsubscript𝐼𝑖𝑖1𝑣𝐺𝐾superscript𝑇1\{I_{i}\}_{i\in 1,\ldots,v(G)-K(T^{\prime}+1)}{ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ 1 , … , italic_v ( italic_G ) - italic_K ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUBSCRIPT, which can also be envisaged as |I|𝐼|I|| italic_I | paths of length 00, creating a total of K+|I|𝐾𝐼K+|I|italic_K + | italic_I | paths. To connect these paths into a complete tour, K+|I|𝐾𝐼K+|I|italic_K + | italic_I | extra edges need to be incorporated. This gives rise to a graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of size 2(K+|I|)2𝐾𝐼2(K+|I|)2 ( italic_K + | italic_I | ) constructed as follows:

V(G)𝑉superscript𝐺\displaystyle V(G^{\prime})italic_V ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ={f1(ST),,fK(ST),I1f,,I|I|f,r1(ST),,rK(ST),I1r,,I|I|r},absentsuperscriptf1subscript𝑆superscript𝑇superscriptf𝐾subscript𝑆superscript𝑇subscriptsuperscript𝐼f1subscriptsuperscript𝐼f𝐼superscriptr1subscript𝑆superscript𝑇superscriptr𝐾subscript𝑆superscript𝑇subscriptsuperscript𝐼r1subscriptsuperscript𝐼r𝐼\displaystyle=\{\operatorname{f}^{1}(S_{T^{\prime}}),\ldots,\operatorname{f}^{% K}(S_{T^{\prime}}),I^{\operatorname{f}}_{1},\ldots,I^{\operatorname{f}}_{|I|},% \operatorname{r}^{1}(S_{T^{\prime}}),\ldots,\operatorname{r}^{K}(S_{T^{\prime}% }),I^{\operatorname{r}}_{1},\ldots,I^{\operatorname{r}}_{|I|}\},= { roman_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , … , roman_f start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_I start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | italic_I | end_POSTSUBSCRIPT , roman_r start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , … , roman_r start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_I start_POSTSUPERSCRIPT roman_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUPERSCRIPT roman_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | italic_I | end_POSTSUBSCRIPT } ,
E(G)𝐸superscript𝐺\displaystyle E(G^{\prime})italic_E ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ={(i,j)i,jV(G),ij}{(f1(ST),r1(ST)),,(fK(ST),rK(ST)),,(I1f,I1r),,(I|I|f,I|I|r)}absentconditional-set𝑖𝑗formulae-sequence𝑖𝑗𝑉superscript𝐺𝑖𝑗superscript𝑓1subscript𝑆superscript𝑇superscript𝑟1subscript𝑆superscript𝑇superscript𝑓𝐾subscript𝑆superscript𝑇superscript𝑟𝐾subscript𝑆superscript𝑇superscriptsubscript𝐼1𝑓superscriptsubscript𝐼1𝑟superscriptsubscript𝐼𝐼𝑓superscriptsubscript𝐼𝐼𝑟\displaystyle=\{(i,j)\mid i,j\in V(G^{\prime}),i\neq j\}\setminus\{(f^{1}(S_{T% ^{\prime}}),r^{1}(S_{T^{\prime}})),\ldots,(f^{K}(S_{T^{\prime}}),r^{K}(S_{T^{% \prime}})),\ldots,(I_{1}^{f},I_{1}^{r}),\ldots,(I_{|I|}^{f},I_{|I|}^{r})\}= { ( italic_i , italic_j ) ∣ italic_i , italic_j ∈ italic_V ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_i ≠ italic_j } ∖ { ( italic_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_r start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) , … , ( italic_f start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_r start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) , … , ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) , … , ( italic_I start_POSTSUBSCRIPT | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) }
w(G)𝑤superscript𝐺\displaystyle w(G^{\prime})italic_w ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =w(G).absent𝑤𝐺\displaystyle=w(G).= italic_w ( italic_G ) .

Here, Iif=Iir=Iisuperscriptsubscript𝐼𝑖fsuperscriptsubscript𝐼𝑖rsubscript𝐼𝑖I_{i}^{\operatorname{f}}=I_{i}^{\operatorname{r}}=I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT = italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_r end_POSTSUPERSCRIPT = italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i{1,,v(G)K(T+1)}𝑖1𝑣𝐺𝐾superscript𝑇1i\in\{1,\ldots,v(G)-K(T^{\prime}+1)\}italic_i ∈ { 1 , … , italic_v ( italic_G ) - italic_K ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) }. V(G)𝑉superscript𝐺V(G^{\prime})italic_V ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) consists of the front and rear vertices of each path in state STsubscript𝑆superscript𝑇S_{T^{\prime}}italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. E(G)𝐸superscript𝐺E(G^{\prime})italic_E ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) comprises all potential edges that may be required to amalgamate subpaths into cycles. The edge weights correspond to those in the original graph G𝐺Gitalic_G. Our objective is to ascertain an algorithm that is both highly effective and efficient in order to identify K+|I|𝐾𝐼K+|I|italic_K + | italic_I | edges within the graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, forming a tour encompassing edges {(i,j)|ij|=K+|I|}conditional-set𝑖𝑗𝑖𝑗𝐾𝐼\{(i,j)\mid|i-j|=K+|I|\}{ ( italic_i , italic_j ) ∣ | italic_i - italic_j | = italic_K + | italic_I | }. This involves selecting an initial vertex and subsequently adding K+|I|𝐾𝐼K+|I|italic_K + | italic_I | edges, thereby rendering T=T+K+|I|𝑇superscript𝑇𝐾𝐼T=T^{\prime}+K+|I|italic_T = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I |. Given that reinforcement learning enables rapid approximate solutions in batch compared to exact algorithms, we employ single-agent reinforcement learning to address this sub-problem. The superscript 1111 indicating the single agent is omitted below for simplicity. The initial vertex qTsubscript𝑞superscript𝑇q_{T^{\prime}}italic_q start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT can be selected at random from the set of all vertices in V(G)𝑉superscript𝐺V(G^{\prime})italic_V ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). For time steps t{T+1,,T+K+|I|1}𝑡conditional-setsuperscript𝑇1superscript𝑇limit-from𝐾conditional𝐼1t\in\{T^{\prime}+1,\ldots,T^{\prime}+K+|I|-1\}italic_t ∈ { italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , … , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | - 1 }, we begin by determining the other end of the current subpath from vertex qt1subscript𝑞𝑡1q_{t-1}italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, denoted as ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Subsequently, we derive the set of feasible actions 𝒜s{(pt,qt)(pt,qt)E(G),j{qT,pT+1,qT+1,,pt1,qt1}}subscript𝒜𝑠conditional-setsubscript𝑝𝑡subscript𝑞𝑡formulae-sequencesubscript𝑝𝑡subscript𝑞𝑡𝐸superscript𝐺𝑗subscript𝑞superscript𝑇subscript𝑝superscript𝑇1subscript𝑞superscript𝑇1subscript𝑝𝑡1subscript𝑞𝑡1\mathcal{A}_{s}\leftarrow\{(p_{t},q_{t})\mid(p_{t},q_{t})\in E(G^{\prime}),j% \notin\{q_{T^{\prime}},p_{T^{\prime}+1},q_{T^{\prime}+1},\ldots,p_{t-1},q_{t-1% }\}\}caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← { ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∣ ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ italic_E ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_j ∉ { italic_q start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } } and the action chosen at that particular time step is denoted as (pt,qt)subscript𝑝𝑡subscript𝑞𝑡(p_{t},q_{t})( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In the final time step t=T+K+|I|𝑡superscript𝑇𝐾𝐼t=T^{\prime}+K+|I|italic_t = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I |, a terminal edge must be selected to seamlessly integrate the path into a complete cycle.

The comprehensive CARSS algorithm is depicted in Algorithm 2, while an illustrative example is showcased in Figure 1.

Algorithm 2 Cooperative Attention-guided Reinforcement Subpath Synthesis (CARSS) Algorithm
1:Graph G=(V,E,w)𝐺𝑉𝐸𝑤G=(V,E,w)italic_G = ( italic_V , italic_E , italic_w ), number of agents K𝐾Kitalic_K, starting vertices of agents v1,,vKVsubscript𝑣1subscript𝑣𝐾𝑉v_{1},\ldots,v_{K}\in Vitalic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∈ italic_V
2:sG[{v1,,vK}]𝑠𝐺delimited-[]subscript𝑣1subscript𝑣𝐾s\leftarrow G[\{v_{1},\ldots,v_{K}\}]italic_s ← italic_G [ { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } ]
3:for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K do
4:     fk(s)vksuperscriptf𝑘𝑠subscript𝑣𝑘\operatorname{f}^{k}(s)\leftarrow v_{k}roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) ← italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
5:     rk(s)vksuperscriptr𝑘𝑠subscript𝑣𝑘\operatorname{r}^{k}(s)\leftarrow v_{k}roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) ← italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
6:end for
7:Tv(G)K(1 if K divides v(G) else 0)superscript𝑇𝑣𝐺𝐾1 if 𝐾 divides 𝑣𝐺 else 0T^{\prime}\leftarrow\left\lfloor\frac{v(G)}{K}\right\rfloor-(1\text{ if }K% \text{ divides }v(G)\text{ else }0)italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← ⌊ divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG ⌋ - ( 1 if italic_K divides italic_v ( italic_G ) else 0 )
8:for t=1,,T𝑡1superscript𝑇t=1,\ldots,T^{\prime}italic_t = 1 , … , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT do
9:     Apply Algorithm 1 with (V,K,s)𝑉𝐾𝑠(V,K,s)( italic_V , italic_K , italic_s ) to obtain {xij}i=1,,v(G),k=1,,Ksubscriptsubscript𝑥𝑖𝑗formulae-sequence𝑖1𝑣𝐺𝑘1𝐾\{x_{ij}\}_{i=1,\ldots,v(G),\ k=1,\ldots,K}{ italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , italic_v ( italic_G ) , italic_k = 1 , … , italic_K end_POSTSUBSCRIPT
10:     𝒜sk{(i,j)Exik=1,iV(s),j=argminj{fk(s),rk(s)}wij},k=1,,Kformulae-sequencesuperscriptsubscript𝒜𝑠𝑘conditional-set𝑖𝑗𝐸formulae-sequencesubscript𝑥𝑖𝑘1formulae-sequence𝑖𝑉𝑠𝑗subscriptargmin𝑗superscriptf𝑘𝑠superscriptr𝑘𝑠subscript𝑤𝑖𝑗𝑘1𝐾\mathcal{A}_{s}^{k}\leftarrow\{(i,j)\in E\mid x_{ik}=1,i\notin V(s),j=% \operatorname{argmin}_{j\in\{\operatorname{f}^{k}(s),\operatorname{r}^{k}(s)\}% }w_{ij}\},k=1,\ldots,Kcaligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ← { ( italic_i , italic_j ) ∈ italic_E ∣ italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 1 , italic_i ∉ italic_V ( italic_s ) , italic_j = roman_argmin start_POSTSUBSCRIPT italic_j ∈ { roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) , roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } , italic_k = 1 , … , italic_K
11:     Apply parameterized policy to obtain Atk𝒜sk,k=1,,Kformulae-sequencesubscriptsuperscript𝐴𝑘𝑡superscriptsubscript𝒜𝑠𝑘𝑘1𝐾A^{k}_{t}\in\mathcal{A}_{s}^{k},k=1,\ldots,Kitalic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k = 1 , … , italic_K
12:     ss+k=1KAtk𝑠𝑠superscriptsubscript𝑘1𝐾subscriptsuperscript𝐴𝑘𝑡s\leftarrow s+\sum_{k=1}^{K}A^{k}_{t}italic_s ← italic_s + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
13:     Update fk(s)superscriptf𝑘𝑠\operatorname{f}^{k}(s)roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) and rk(s)superscriptr𝑘𝑠\operatorname{r}^{k}(s)roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s )
14:end for
15:IV(G)V(s)𝐼𝑉𝐺𝑉𝑠I\leftarrow V(G)\setminus V(s)italic_I ← italic_V ( italic_G ) ∖ italic_V ( italic_s )\triangleright |I|=v(G)K(T+1)𝐼𝑣𝐺𝐾superscript𝑇1|I|=v(G)-K(T^{\prime}+1)| italic_I | = italic_v ( italic_G ) - italic_K ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 )
16:{I1f,,I|I|f}Isubscriptsuperscript𝐼𝑓1subscriptsuperscript𝐼𝑓𝐼𝐼\{I^{f}_{1},\ldots,I^{f}_{|I|}\}\leftarrow I{ italic_I start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | italic_I | end_POSTSUBSCRIPT } ← italic_I
17:{I1r,,I|I|r}Isubscriptsuperscript𝐼𝑟1subscriptsuperscript𝐼𝑟𝐼𝐼\{I^{r}_{1},\ldots,I^{r}_{|I|}\}\leftarrow I{ italic_I start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | italic_I | end_POSTSUBSCRIPT } ← italic_I
18:V(G){f1(ST),,fK(ST),I1f,,I|I|f,r1(ST),,rK(ST),I1r,,I|I|r}𝑉superscript𝐺superscriptf1subscript𝑆superscript𝑇superscriptf𝐾subscript𝑆superscript𝑇subscriptsuperscript𝐼f1subscriptsuperscript𝐼f𝐼superscriptr1subscript𝑆superscript𝑇superscriptr𝐾subscript𝑆superscript𝑇subscriptsuperscript𝐼r1subscriptsuperscript𝐼r𝐼V(G^{\prime})\leftarrow\{\operatorname{f}^{1}(S_{T^{\prime}}),\ldots,% \operatorname{f}^{K}(S_{T^{\prime}}),I^{\operatorname{f}}_{1},\ldots,I^{% \operatorname{f}}_{|I|},\operatorname{r}^{1}(S_{T^{\prime}}),\ldots,% \operatorname{r}^{K}(S_{T^{\prime}}),I^{\operatorname{r}}_{1},\ldots,I^{% \operatorname{r}}_{|I|}\}italic_V ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ← { roman_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , … , roman_f start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_I start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | italic_I | end_POSTSUBSCRIPT , roman_r start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , … , roman_r start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_I start_POSTSUPERSCRIPT roman_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUPERSCRIPT roman_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | italic_I | end_POSTSUBSCRIPT }
19:E(G){(i,j)i,jV(G),ij}{(f1(ST),r1(ST)),,(fK(ST),rK(ST)),,(I1f,I1r),,(I|I|f,I|I|r)}𝐸superscript𝐺conditional-set𝑖𝑗formulae-sequence𝑖𝑗𝑉superscript𝐺𝑖𝑗superscript𝑓1subscript𝑆superscript𝑇superscript𝑟1subscript𝑆superscript𝑇superscript𝑓𝐾subscript𝑆superscript𝑇superscript𝑟𝐾subscript𝑆superscript𝑇superscriptsubscript𝐼1𝑓superscriptsubscript𝐼1𝑟superscriptsubscript𝐼𝐼𝑓superscriptsubscript𝐼𝐼𝑟E(G^{\prime})\leftarrow\{(i,j)\mid i,j\in V(G^{\prime}),i\neq j\}\setminus\{(f% ^{1}(S_{T^{\prime}}),r^{1}(S_{T^{\prime}})),\ldots,(f^{K}(S_{T^{\prime}}),r^{K% }(S_{T^{\prime}})),\ldots,(I_{1}^{f},I_{1}^{r}),\ldots,(I_{|I|}^{f},I_{|I|}^{r% })\}italic_E ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ← { ( italic_i , italic_j ) ∣ italic_i , italic_j ∈ italic_V ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_i ≠ italic_j } ∖ { ( italic_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_r start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) , … , ( italic_f start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_r start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) , … , ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) , … , ( italic_I start_POSTSUBSCRIPT | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) }
20:w(G)w(G)𝑤superscript𝐺𝑤𝐺w(G^{\prime})\leftarrow w(G)italic_w ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ← italic_w ( italic_G )
21:qTRandomSelect(V(G))subscript𝑞superscript𝑇RandomSelect𝑉superscript𝐺q_{T^{\prime}}\leftarrow\operatorname{RandomSelect}(V(G^{\prime}))italic_q start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← roman_RandomSelect ( italic_V ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
22:for t=T+1,,T+K+|I|1𝑡superscript𝑇1superscript𝑇𝐾𝐼1t=T^{\prime}+1,\ldots,T^{\prime}+K+|I|-1italic_t = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , … , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | - 1 do
23:     ptsubscript𝑝𝑡absentp_{t}\leftarrowitalic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← the other end of the subpath of vertex qt1subscript𝑞𝑡1q_{t-1}italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT
24:     𝒜s{(pt,qt)(pt,qt)E(G),j{qT,pT+1,qT+1,,pt1,qt1}}subscript𝒜𝑠conditional-setsubscript𝑝𝑡subscript𝑞𝑡formulae-sequencesubscript𝑝𝑡subscript𝑞𝑡𝐸superscript𝐺𝑗subscript𝑞superscript𝑇subscript𝑝superscript𝑇1subscript𝑞superscript𝑇1subscript𝑝𝑡1subscript𝑞𝑡1\mathcal{A}_{s}\leftarrow\{(p_{t},q_{t})\mid(p_{t},q_{t})\in E(G^{\prime}),j% \notin\{q_{T^{\prime}},p_{T^{\prime}+1},q_{T^{\prime}+1},\ldots,p_{t-1},q_{t-1% }\}\}caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← { ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∣ ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ italic_E ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_j ∉ { italic_q start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } }
25:     Apply parameterized policy to obtain (pt,qt)=At𝒜ssubscript𝑝𝑡subscript𝑞𝑡subscript𝐴𝑡subscript𝒜𝑠(p_{t},q_{t})=A_{t}\in\mathcal{A}_{s}( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
26:     ss+At𝑠𝑠subscript𝐴𝑡s\leftarrow s+A_{t}italic_s ← italic_s + italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
27:end for
28:pT+K+|I|subscript𝑝superscript𝑇𝐾𝐼absentp_{T^{\prime}+K+|I|}\leftarrowitalic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT ← the other end of the subpath of vertex qT+K+|I|1subscript𝑞superscript𝑇𝐾𝐼1q_{T^{\prime}+K+|I|-1}italic_q start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | - 1 end_POSTSUBSCRIPT
29:AT+K+|I|(pT+K+|I|,qT)subscript𝐴superscript𝑇𝐾𝐼subscript𝑝superscript𝑇𝐾𝐼subscript𝑞superscript𝑇A_{T^{\prime}+K+|I|}\leftarrow(p_{T^{\prime}+K+|I|},q_{T^{\prime}})italic_A start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT ← ( italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )
30:ss+AT+K+|I|𝑠𝑠subscript𝐴superscript𝑇𝐾𝐼s\leftarrow s+A_{T^{\prime}+K+|I|}italic_s ← italic_s + italic_A start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT
31:return s𝑠sitalic_s
Refer to caption
Figure 1: Illustration of CARSS Algorithm for TSP Solving. In this instance, there are 10 vertices, with the number of agents set to K=3𝐾3K=3italic_K = 3, termination time of subpath generation set to T=2superscript𝑇2T^{\prime}=2italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 2, and the number of isolated vertices denoted as |I|=1𝐼1|I|=1| italic_I | = 1. The solid gray lines and arrowed lines represent the action space and the selected actions, respectively. This example entails solving assignment subproblems during two rounds of subroute generation. Within each assignment, dashed lines illustrate the assignment relationships between agent subpaths and unconnected vertices, with corresponding labels indicating the order of assignments.

3.3 Policy Parameterization

3.3.1 Subpath Generation Parameterization

The input data to the subpath generation stage within the CARSS algorithm comprises the 2D Euclidean spatial coordinates of the graph’s vertex set V(G)𝑉𝐺V(G)italic_V ( italic_G ), represented as Xv(G)×2𝑋superscript𝑣𝐺2X\in\mathbb{R}^{v(G)\times 2}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_v ( italic_G ) × 2 end_POSTSUPERSCRIPT. The action index Utksubscriptsuperscript𝑈𝑘𝑡U^{k}_{t}italic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is determined by the probability distribution of feasible actions for agent k𝑘kitalic_k at time step t𝑡titalic_t within state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, denoted as π𝜽d(Ut1,,UtKG,St)subscriptsuperscript𝜋𝑑𝜽subscriptsuperscript𝑈1𝑡conditionalsubscriptsuperscript𝑈𝐾𝑡𝐺subscript𝑆𝑡\pi^{d}_{\boldsymbol{\theta}}(U^{1}_{t},\ldots,U^{K}_{t}\mid G,S_{t})italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_U start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_U start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_G , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This index is selected from the set {1,,v(G)/K}1𝑣𝐺𝐾\{1,\ldots,v(G)/K\}{ 1 , … , italic_v ( italic_G ) / italic_K }, ultimately defining the action Atksubscriptsuperscript𝐴𝑘𝑡A^{k}_{t}italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

To efficiently capture information about neighboring vertices for each vertex, we employ a map** from the 2D vertex coordinates Xv(G)×2𝑋superscript𝑣𝐺2X\in\mathbb{R}^{v(G)\times 2}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_v ( italic_G ) × 2 end_POSTSUPERSCRIPT to higher-dimensional vertex features hiv,i=1,,v(G)formulae-sequencesubscriptsuperscript𝑣𝑖𝑖1𝑣𝐺h^{v}_{i},i=1,\ldots,v(G)italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_v ( italic_G ):

Hlvsubscriptsuperscript𝐻𝑣𝑙\displaystyle H^{v}_{l}italic_H start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT =FFN(MHA(Hl1v,Hl1v,Hl1v,Hl1v,Jv(G)))v(G)×dv,absentFFNMHAsubscriptsuperscript𝐻𝑣𝑙1subscriptsuperscript𝐻𝑣𝑙1subscriptsuperscript𝐻𝑣𝑙1subscriptsuperscript𝐻𝑣𝑙1subscript𝐽𝑣𝐺superscript𝑣𝐺subscript𝑑𝑣\displaystyle=\operatorname{FFN}\left(\operatorname{MHA}\left(H^{v}_{l-1},H^{v% }_{l-1},H^{v}_{l-1},H^{v}_{l-1},J_{v(G)}\right)\right)\in\mathbb{R}^{v(G)% \times d_{v}},= roman_FFN ( roman_MHA ( italic_H start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_J start_POSTSUBSCRIPT italic_v ( italic_G ) end_POSTSUBSCRIPT ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_v ( italic_G ) × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,
HLencvvsubscriptsuperscript𝐻𝑣superscript𝐿subscriptenc𝑣\displaystyle H^{v}_{L^{\text{enc}_{v}}}italic_H start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT =[h1v,,hv(G)v]T.absentsuperscriptsubscriptsuperscript𝑣1subscriptsuperscript𝑣𝑣𝐺𝑇\displaystyle=[h^{v}_{1},\cdots,h^{v}_{v(G)}]^{T}.= [ italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v ( italic_G ) end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

Here, H0v=XWx+𝟏v(G)bxsubscriptsuperscript𝐻𝑣0𝑋superscript𝑊𝑥subscript1𝑣𝐺superscript𝑏𝑥H^{v}_{0}=XW^{x}+\boldsymbol{1}_{v(G)}b^{x}italic_H start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_X italic_W start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT + bold_1 start_POSTSUBSCRIPT italic_v ( italic_G ) end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT, where Wx2×dvsuperscript𝑊𝑥superscript2subscript𝑑𝑣W^{x}\in\mathbb{R}^{2\times d_{v}}italic_W start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and bx1×dvsuperscript𝑏𝑥superscript1subscript𝑑𝑣b^{x}\in\mathbb{R}^{1\times d_{v}}italic_b start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are trainable parameters. The parameter dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT indicates the dimension of vertex features, l{1,Lencv}𝑙1superscript𝐿subscriptenc𝑣l\in\{1,\ldots L^{\text{enc}_{v}}\}italic_l ∈ { 1 , … italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } represents the layer index of the vertex encoder within the subpath generation stage, and Jv(G)subscript𝐽𝑣𝐺J_{v(G)}italic_J start_POSTSUBSCRIPT italic_v ( italic_G ) end_POSTSUBSCRIPT denotes the v(G)𝑣𝐺v(G)italic_v ( italic_G )-order square matrix filled with 1111 entries.

Drawing inspiration from Zong Zefang et al.’s MAPDP paper (Zong et al., 2022), we adopt a novel approach by concatenating the feature vectors of both the front and rear vertices across all agents. This strategy facilitates the sharing of positional information among agents and yields the global information feature vector for agent k𝑘kitalic_k:

commtk=FFN([hfk(St)v,hrk(St)v])1×dv.subscriptsuperscriptcomm𝑘𝑡FFNsubscriptsuperscript𝑣superscriptf𝑘subscript𝑆𝑡subscriptsuperscript𝑣superscriptr𝑘subscript𝑆𝑡superscript1subscript𝑑𝑣\operatorname{comm}^{k}_{t}=\operatorname{FFN}([h^{v}_{\operatorname{f}^{k}(S_% {t})},h^{v}_{\operatorname{r}^{k}(S_{t})}])\in\mathbb{R}^{1\times d_{v}}.roman_comm start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_FFN ( [ italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ] ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

Subsequently, it is concatenated with the feature vectors of the agent’s starting and ending vertices (comprising the comprehensive feature representation of the visited vertices), thus forming the feature vector for agent k𝑘kitalic_k:

contexttk=FFN([hfk(St)v,hrk(St)v,1|v(G)v(St)|iV(G)V(St)hiv),commtk])1×dv.\operatorname{context}^{k}_{t}=\operatorname{FFN}([h^{v}_{\operatorname{f}^{k}% (S_{t})},h^{v}_{\operatorname{r}^{k}(S_{t})},\frac{1}{|v(G)-v(S_{t})|}\sum_{i% \in V(G)\setminus V(S_{t})}h^{v}_{i}),\operatorname{comm}^{k}_{t}])\in\mathbb{% R}^{1\times d_{v}}.roman_context start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_FFN ( [ italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG | italic_v ( italic_G ) - italic_v ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_V ( italic_G ) ∖ italic_V ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , roman_comm start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

To facilitate cooperative interactions, a multi-head attention mechanism is then employed on the feature vectors of the aforementioned K𝐾Kitalic_K agents. This mechanism enhances the exchange of vital information among the agents:

Ht,0asubscriptsuperscript𝐻𝑎𝑡0\displaystyle H^{a}_{t,0}italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT =[contextt1;;contexttK]K×dv,absentsubscriptsuperscriptcontext1𝑡subscriptsuperscriptcontext𝐾𝑡superscript𝐾subscript𝑑𝑣\displaystyle=[\operatorname{context}^{1}_{t};\ldots;\operatorname{context}^{K% }_{t}]\in\mathbb{R}^{K\times d_{v}},= [ roman_context start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; … ; roman_context start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,
Ht,lasubscriptsuperscript𝐻𝑎𝑡𝑙\displaystyle H^{a}_{t,l}italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT =FFN(MHA(Ht,l1a,Ht,l1a,Ht,l1a,Ht,l1a,Ht,l1a,JK))K×dv.absentFFNMHAsubscriptsuperscript𝐻𝑎𝑡𝑙1subscriptsuperscript𝐻𝑎𝑡𝑙1subscriptsuperscript𝐻𝑎𝑡𝑙1subscriptsuperscript𝐻𝑎𝑡𝑙1subscriptsuperscript𝐻𝑎𝑡𝑙1subscript𝐽𝐾superscript𝐾subscript𝑑𝑣\displaystyle=\operatorname{FFN}\left(\operatorname{MHA}\left(H^{a}_{t,l-1},H^% {a}_{t,l-1},H^{a}_{t,l-1},H^{a}_{t,l-1},H^{a}_{t,l-1},J_{K}\right)\right)\in% \mathbb{R}^{K\times d_{v}}.= roman_FFN ( roman_MHA ( italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l - 1 end_POSTSUBSCRIPT , italic_J start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

Here, l{1,Lenca}𝑙1superscript𝐿subscriptenc𝑎l\in\{1,\ldots L^{\text{enc}_{a}}\}italic_l ∈ { 1 , … italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } denotes the layer index of the agent encoder, and JKsubscript𝐽𝐾J_{K}italic_J start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT represents a K𝐾Kitalic_K-order square matrix composed entirely of 1111 entries.

Inspired by the utilization of the original Transformer model by Bresson and Laurent (2021) for addressing the TSP, we have introduced a novel concept of memory vectors with increasing lengths over time. These vectors enhance the model’s capacity to incorporate historical information progressively:

memoryt=[memoryt1;hi:(i,j)=At1v]K×t×dv.subscriptmemory𝑡subscriptmemory𝑡1subscriptsuperscript𝑣:𝑖𝑖𝑗subscript𝐴𝑡1superscript𝐾𝑡subscript𝑑𝑣\operatorname{memory}_{t}=[\operatorname{memory}_{t-1};h^{v}_{i:(i,j)=A_{t-1}}% ]\in\mathbb{R}^{K\times t\times d_{v}}.roman_memory start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ roman_memory start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i : ( italic_i , italic_j ) = italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_t × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

Subsequently, these feature vectors and memory vectors of the agents are harnessed as inputs to the multi-head attention mechanism. This integration enables the model to adeptly capture the characteristics of partial solutions:

ht,0dsubscriptsuperscript𝑑𝑡0\displaystyle h^{d}_{t,0}italic_h start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT =Reshape(Ht,encaa,(K,1,dv)),absentReshapesubscriptsuperscript𝐻𝑎𝑡subscriptenc𝑎𝐾1subscript𝑑𝑣\displaystyle=\operatorname{Reshape}\left(H^{a}_{t,\text{enc}_{a}},(K,1,d_{v})% \right),= roman_Reshape ( italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , enc start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( italic_K , 1 , italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) ,
ht,ldsubscriptsuperscript𝑑𝑡𝑙\displaystyle h^{d}_{t,l}italic_h start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT =FFN(MHA(ht,l1d,memoryt,memoryt))K×1×dv.absentFFNMHAsubscriptsuperscript𝑑𝑡𝑙1subscriptmemory𝑡subscriptmemory𝑡superscript𝐾1subscript𝑑𝑣\displaystyle=\operatorname{FFN}(\operatorname{MHA}(h^{d}_{t,l-1},% \operatorname{memory}_{t},\operatorname{memory}_{t}))\in\mathbb{R}^{K\times 1% \times d_{v}}.= roman_FFN ( roman_MHA ( italic_h start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l - 1 end_POSTSUBSCRIPT , roman_memory start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_memory start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

Here, the operation ReshapeReshape\operatorname{Reshape}roman_Reshape alters the dimensions of the tensor Ht,encaasubscriptsuperscript𝐻𝑎𝑡subscriptenc𝑎H^{a}_{t,\text{enc}_{a}}italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , enc start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT from K×dvsuperscript𝐾subscript𝑑𝑣\mathbb{R}^{K\times d_{v}}blackboard_R start_POSTSUPERSCRIPT italic_K × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to K×1×dvsuperscript𝐾1subscript𝑑𝑣\mathbb{R}^{K\times 1\times d_{v}}blackboard_R start_POSTSUPERSCRIPT italic_K × 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, preserving its elements while aligning its dimensions with those of the memory vector memorytsubscriptmemory𝑡\operatorname{memory}_{t}roman_memory start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time step t𝑡titalic_t. l{1,Ldecg}𝑙1superscript𝐿subscriptdec𝑔l\in\{1,\ldots L^{\text{dec}_{g}}\}italic_l ∈ { 1 , … italic_L start_POSTSUPERSCRIPT dec start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } designates the decoder layer index.

Subsequently, we construct feature vectors for each assigned vertex of every agent along with their corresponding masks. It should be noted that in Subsection 3.2, the solution to the assignment problem xiksubscript𝑥𝑖𝑘x_{ik}italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT has been approximated using a heuristic algorithm. Each non-zero element of xiksubscript𝑥𝑖𝑘x_{ik}italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT represents the vertex i𝑖iitalic_i being assigned to the agent k𝑘kitalic_k. However, the number of vertices assigned to each agent, denoted as αk=|{ixik=1}|subscript𝛼𝑘conditional-set𝑖subscript𝑥𝑖𝑘1\alpha_{k}=|\{i\mid x_{ik}=1\}|italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = | { italic_i ∣ italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 1 } |, varies. To enable an efficient computation of the policy, we propose selecting feature vectors of v(G)/K𝑣𝐺𝐾v(G)/Kitalic_v ( italic_G ) / italic_K vertices that are assigned to each agent based on their proximity. If the feasible actions are fewer than v(G)K𝑣𝐺𝐾\frac{v(G)}{K}divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG, we supplement the feature vectors with those of arbitrarily chosen infeasible vertices. This process yields the feature vector assigntsubscriptassign𝑡\operatorname{assign}_{t}roman_assign start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which represents the assignments. Subsequently, we utilize the mask Mtassignsuperscriptsubscript𝑀𝑡assignM_{t}^{\operatorname{assign}}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_assign end_POSTSUPERSCRIPT to prevent the model from generating these additionally introduced actions. This can be formally expressed as follows:

assigntksubscriptsuperscriptassign𝑘𝑡\displaystyle\operatorname{assign}^{k}_{t}roman_assign start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ={[[hiv]i:xik=1;[hiv]i:xik=0[:v(G)Kαk]]if αk<v(G)K,[hiv]i:xik=1[:v(G)K]otherwise.v(G)K×dv,\displaystyle=\begin{cases}\left[[h^{v}_{i}]_{i:x_{ik}=1};[h^{v}_{i}]_{i:x_{ik% }=0}[:\frac{v(G)}{K}-\alpha_{k}]\right]&\text{if }\alpha_{k}<\frac{v(G)}{K},\\ [h^{v}_{i}]_{i:x_{ik}=1}[:\frac{v(G)}{K}]&\text{otherwise}.\end{cases}\in% \mathbb{R}^{\frac{v(G)}{K}\times d_{v}},= { start_ROW start_CELL [ [ italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i : italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT ; [ italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i : italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT [ : divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG - italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ] end_CELL start_CELL if italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG , end_CELL end_ROW start_ROW start_CELL [ italic_h start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i : italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT [ : divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG ] end_CELL start_CELL otherwise . end_CELL end_ROW ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,
assigntsubscriptassign𝑡\displaystyle\operatorname{assign}_{t}roman_assign start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =[assignt1;;assigntK]K×v(G)K×dv,absentsubscriptsuperscriptassign1𝑡subscriptsuperscriptassign𝐾𝑡superscript𝐾𝑣𝐺𝐾subscript𝑑𝑣\displaystyle=[\operatorname{assign}^{1}_{t};\ldots;\operatorname{assign}^{K}_% {t}]\in\mathbb{R}^{K\times\frac{v(G)}{K}\times d_{v}},= [ roman_assign start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; … ; roman_assign start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,
Mtassignsuperscriptsubscript𝑀𝑡assign\displaystyle M_{t}^{\operatorname{assign}}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_assign end_POSTSUPERSCRIPT ={Mtijassign}i=1,,v(G)K,k=1,,K={1if xmap(i)k=1,0otherwise.K×1×v(G)K.absentsubscriptsuperscriptsubscript𝑀𝑡𝑖𝑗assignformulae-sequence𝑖1𝑣𝐺𝐾𝑘1𝐾cases1if subscript𝑥map𝑖𝑘10otherwisesuperscript𝐾1𝑣𝐺𝐾\displaystyle=\{M_{tij}^{\operatorname{assign}}\}_{i=1,\ldots,\frac{v(G)}{K},k% =1,\ldots,K}=\begin{cases}1&\text{if }x_{\operatorname{map}(i)k}=1,\\ 0&\text{otherwise}.\end{cases}\in\mathbb{R}^{K\times 1\times\frac{v(G)}{K}}.= { italic_M start_POSTSUBSCRIPT italic_t italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_assign end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG , italic_k = 1 , … , italic_K end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_x start_POSTSUBSCRIPT roman_map ( italic_i ) italic_k end_POSTSUBSCRIPT = 1 , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise . end_CELL end_ROW ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × 1 × divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG end_POSTSUPERSCRIPT .

Here, the notation [:i][:i][ : italic_i ] signifies selecting the first i𝑖iitalic_i rows of the matrix, and the function map()map\operatorname{map}(\cdot)roman_map ( ⋅ ) serves to associate the mask index back to the vertex index of the initial instance G𝐺Gitalic_G during the subpath generation stage.

Finally, we proceed to identify the index of the selected feasible action at time step t𝑡titalic_t for each agent k𝑘kitalic_k within the assignment list, denoted as Utk{1,,v(G)K}subscriptsuperscript𝑈𝑘𝑡1𝑣𝐺𝐾U^{k}_{t}\in\{1,\ldots,\frac{v(G)}{K}\}italic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 1 , … , divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG }. Subsequently, we map Utksubscriptsuperscript𝑈𝑘𝑡U^{k}_{t}italic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT back to the original graph’s vertex index and determine the adjacent edge closer to the front and rear of that agent. This mapped vertex and edge combination serves as the action Atksubscriptsuperscript𝐴𝑘𝑡A^{k}_{t}italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time step t𝑡titalic_t, with t𝑡titalic_t ranging from 1111 to Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The probability distribution of Ut1,,UtKsubscriptsuperscript𝑈1𝑡subscriptsuperscript𝑈𝐾𝑡U^{1}_{t},\ldots,U^{K}_{t}italic_U start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_U start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is formulated using the input features of the agents, namely ht,Ldecgdsubscriptsuperscript𝑑𝑡superscript𝐿subscriptdec𝑔h^{d}_{t,L^{\text{dec}_{g}}}italic_h start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_L start_POSTSUPERSCRIPT dec start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, the vertex feature vectors for assignment denoted as assigntsubscriptassign𝑡\operatorname{assign}_{t}roman_assign start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the mask Mtassignsubscriptsuperscript𝑀assign𝑡M^{\operatorname{assign}}_{t}italic_M start_POSTSUPERSCRIPT roman_assign end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This distribution is computed according to the following equation:

π𝜽d(Ut1,,UtKG,St)=softmax(Ctanh(Mtassign(Mtanh))(ht,LdecgdW1d+b1d)(assigntW2d+𝟏Kb2d)T/d)).\pi^{d}_{\boldsymbol{\theta}}(U^{1}_{t},\ldots,U^{K}_{t}\mid G,S_{t})=% \operatorname{softmax}\Bigg{(}C\operatorname{tanh}\Big{(}M^{\operatorname{% assign}}_{t}\odot(M^{\operatorname{tanh}})\Big{)}(h^{d}_{t,L^{\text{dec}_{g}}}% W^{d}_{1}+b^{d}_{1})(\operatorname{assign}_{t}W^{d}_{2}+\boldsymbol{1}_{K}b^{d% }_{2})^{T}/\sqrt{d}\Big{)}\Bigg{)}.italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_U start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_U start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_G , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_softmax ( italic_C roman_tanh ( italic_M start_POSTSUPERSCRIPT roman_assign end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ ( italic_M start_POSTSUPERSCRIPT roman_tanh end_POSTSUPERSCRIPT ) ) ( italic_h start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_L start_POSTSUPERSCRIPT dec start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( roman_assign start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_1 start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_d end_ARG ) ) .

Here, C=10𝐶10C=10italic_C = 10 serves as a crop** threshold. The parameters W1d,W2ddv×dvsubscriptsuperscript𝑊𝑑1subscriptsuperscript𝑊𝑑2superscriptsubscript𝑑𝑣subscript𝑑𝑣W^{d}_{1},W^{d}_{2}\in\mathbb{R}^{d_{v}\times d_{v}}italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and b1d,b2d1×dvsubscriptsuperscript𝑏𝑑1subscriptsuperscript𝑏𝑑2superscript1subscript𝑑𝑣b^{d}_{1},b^{d}_{2}\in\mathbb{R}^{1\times d_{v}}italic_b start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are trainable parameters.

Once the index Utksubscriptsuperscript𝑈𝑘𝑡U^{k}_{t}italic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for agent k𝑘kitalic_k within the assignment vector at time step t𝑡titalic_t is determined, the corresponding action can be computed as:

Atk=(map(Utk),j=argminj{fk(St),rk(St)}wmap(Utk)j).subscriptsuperscript𝐴𝑘𝑡mapsubscriptsuperscript𝑈𝑘𝑡𝑗subscriptargmin𝑗superscriptf𝑘subscript𝑆𝑡superscriptr𝑘subscript𝑆𝑡subscript𝑤mapsubscriptsuperscript𝑈𝑘𝑡𝑗A^{k}_{t}=\left(\operatorname{map}(U^{k}_{t}),j=\operatorname{argmin}_{j\in\{% \operatorname{f}^{k}(S_{t}),\operatorname{r}^{k}(S_{t})\}}w_{\operatorname{map% }(U^{k}_{t})j}\right).italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( roman_map ( italic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_j = roman_argmin start_POSTSUBSCRIPT italic_j ∈ { roman_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , roman_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT roman_map ( italic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_j end_POSTSUBSCRIPT ) .

Here, the operation map()map\operatorname{map}(\cdot)roman_map ( ⋅ ) serves to establish a correspondence between the indices within the mask and the vertices of instance graph G𝐺Gitalic_G. This process results in connecting that vertex to the nearest endpoint along the subpath represented by agent k𝑘kitalic_k.

3.3.2 Subpath Merging Parameterization

The input to the subpath merging phase in the CARSS algorithm is represented by the graph denoted as Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This graph is utilized in conjunction with the probability distribution of feasible actions at time t𝑡titalic_t under the state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, denoted as π𝜽c(UtG,St)subscriptsuperscript𝜋𝑐𝜽conditionalsubscript𝑈𝑡superscript𝐺subscript𝑆𝑡\pi^{c}_{\boldsymbol{\theta}}(U_{t}\mid G^{\prime},S_{t})italic_π start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The selection of the vertex index Utsubscript𝑈𝑡U_{t}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a crucial step in determining the eventual action Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with Utsubscript𝑈𝑡U_{t}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT taking values from the set {1,,2(K+|I|)}12𝐾𝐼\{1,\ldots,2(K+|I|)\}{ 1 , … , 2 ( italic_K + | italic_I | ) }.

It’s important to note that the input graph size for this phase is 2(K+|I|)2𝐾𝐼2(K+|I|)2 ( italic_K + | italic_I | ), where the vertices comprise the front and rear endpoints of the paths corresponding to each agent from the previous phase, and the total number of edges to be added is K+|I|𝐾𝐼K+|I|italic_K + | italic_I |. In order to facilitate the efficient extraction of information from the opposite end of the road as well as from other vertices within the neighborhood, a two-step process is employed. Initially, the two-dimensional vertex coordinates of the front (rear) vertex are concatenated with the two-dimensional coordinates of the rear (front) vertex. Subsequently, this amalgamated information is projected into a higher-dimensional vertex feature space denoted as hiv,i=1,,v(G)formulae-sequencesubscriptsuperscriptsuperscript𝑣𝑖𝑖1𝑣superscript𝐺{h^{\prime}}^{v}_{i},i=1,\ldots,v(G^{\prime})italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). This transformation can be expressed as follows:

Xsuperscript𝑋\displaystyle X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =[[Xf1(ST),Xr1(ST)];;[XfK(ST),XrK(ST)];[XI1,XI1];;[XIv(G)KT,XIv(G)KT]]v(G)×4absentsubscript𝑋superscriptf1subscript𝑆superscript𝑇subscript𝑋superscriptr1subscript𝑆superscript𝑇subscript𝑋superscriptf𝐾subscript𝑆superscript𝑇subscript𝑋superscriptr𝐾subscript𝑆superscript𝑇subscript𝑋subscript𝐼1subscript𝑋subscript𝐼1subscript𝑋subscript𝐼𝑣𝐺𝐾superscript𝑇subscript𝑋subscript𝐼𝑣𝐺𝐾superscript𝑇superscript𝑣superscript𝐺4\displaystyle=[[X_{\operatorname{f}^{1}(S_{T^{\prime}})},X_{\operatorname{r}^{% 1}(S_{T^{\prime}})}];\ldots;[X_{\operatorname{f}^{K}(S_{T^{\prime}})},X_{% \operatorname{r}^{K}(S_{T^{\prime}})}];[X_{I_{1}},X_{I_{1}}];\ldots;[X_{I_{v(G% )-KT^{\prime}}},X_{I_{v(G)-KT^{\prime}}}]]\in\mathbb{R}^{v(G^{\prime})\times 4}= [ [ italic_X start_POSTSUBSCRIPT roman_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT roman_r start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ] ; … ; [ italic_X start_POSTSUBSCRIPT roman_f start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT roman_r start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ] ; [ italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ; … ; [ italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_v ( italic_G ) - italic_K italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_v ( italic_G ) - italic_K italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) × 4 end_POSTSUPERSCRIPT
H0vsubscriptsuperscript𝐻superscript𝑣0\displaystyle H^{v^{\prime}}_{0}italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =XWx+𝟏v(G)bxv(G)×dvabsentsuperscript𝑋superscript𝑊superscript𝑥subscript1𝑣𝐺superscript𝑏superscript𝑥superscript𝑣superscript𝐺subscript𝑑𝑣\displaystyle=X^{\prime}W^{x^{\prime}}+\boldsymbol{1}_{v(G)}b^{x^{\prime}}\in% \mathbb{R}^{v(G^{\prime})\times d_{v}}= italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + bold_1 start_POSTSUBSCRIPT italic_v ( italic_G ) end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
Hlvsubscriptsuperscript𝐻superscript𝑣𝑙\displaystyle H^{v^{\prime}}_{l}italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT =FFN(MHA(Hl1v,Hl1v,Hl1v,Hl1v,Jv(G)))v(G)×dv,absentFFNMHAsubscriptsuperscript𝐻superscript𝑣𝑙1subscriptsuperscript𝐻superscript𝑣𝑙1subscriptsuperscript𝐻superscript𝑣𝑙1subscriptsuperscript𝐻superscript𝑣𝑙1subscript𝐽𝑣superscript𝐺superscript𝑣superscript𝐺subscript𝑑𝑣\displaystyle=\operatorname{FFN}\left(\operatorname{MHA}\left(H^{v^{\prime}}_{% l-1},H^{v^{\prime}}_{l-1},H^{v^{\prime}}_{l-1},H^{v^{\prime}}_{l-1},J_{v(G^{% \prime})}\right)\right)\in\mathbb{R}^{v(G^{\prime})\times d_{v}},= roman_FFN ( roman_MHA ( italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_J start_POSTSUBSCRIPT italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,
HLencvvsubscriptsuperscript𝐻superscript𝑣superscript𝐿subscriptencsuperscript𝑣\displaystyle H^{v^{\prime}}_{L^{\text{enc}_{v^{\prime}}}}italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT =[h1v,,hv(G)v]T.absentsuperscriptsubscriptsuperscriptsuperscript𝑣1subscriptsuperscriptsuperscript𝑣𝑣superscript𝐺𝑇\displaystyle=[h^{v^{\prime}}_{1},\cdots,h^{v^{\prime}}_{v(G^{\prime})}]^{T}.= [ italic_h start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_h start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

Here, Xi1×2subscript𝑋𝑖superscript12X_{i}\in\mathbb{R}^{1\times 2}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 2 end_POSTSUPERSCRIPT signifies the coordinates of the i𝑖iitalic_ith vertex in graph G𝐺Gitalic_G. The parameters Wx4×dvsuperscript𝑊superscript𝑥superscript4subscript𝑑𝑣W^{x^{\prime}}\in\mathbb{R}^{4\times d_{v}}italic_W start_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and bx1×dvsuperscript𝑏superscript𝑥superscript1subscript𝑑𝑣b^{x^{\prime}}\in\mathbb{R}^{1\times d_{v}}italic_b start_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are trainable parameters. The dimensionality of the feature vector is denoted by dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The layer index of the vertex encoder at the subpath merging stage is denoted by l{1,Lencv}𝑙1superscript𝐿subscriptencsuperscript𝑣l\in\{1,\ldots L^{\text{enc}_{v^{\prime}}}\}italic_l ∈ { 1 , … italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }, and Jv(G)subscript𝐽𝑣𝐺J_{v(G)}italic_J start_POSTSUBSCRIPT italic_v ( italic_G ) end_POSTSUBSCRIPT represents a square matrix of order v(G)𝑣𝐺v(G)italic_v ( italic_G ) with all elements equal to 1111.

Building upon Kool et al.’s pioneering work in employing reinforcement learning for solving pathfinding problems (Kool et al., 2018), we adopt a similar approach to formulate the feature representation of states, allowing us to effectively capture the relevant information from the graph’s vertex features and incorporate it into the state representation. Specifically, we construct the feature representation by averaging the vertex feature vectors on the graph, combining the feature vector of the initially selected vertex, and concatenating it with the feature vector of the previously chosen vertex in the sequence of steps"

hgraphsubscriptgraph\displaystyle h_{\operatorname{graph}}italic_h start_POSTSUBSCRIPT roman_graph end_POSTSUBSCRIPT =1v(G)i=1v(G)hiv1×dvabsent1𝑣superscript𝐺superscriptsubscript𝑖1𝑣superscript𝐺subscriptsuperscriptsuperscript𝑣𝑖superscript1subscript𝑑𝑣\displaystyle=\frac{1}{v(G^{\prime})}\sum_{i=1}^{v(G^{\prime})}h^{v^{\prime}}_% {i}\in\mathbb{R}^{1\times d_{v}}= divide start_ARG 1 end_ARG start_ARG italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
hfrontsubscriptfront\displaystyle h_{\operatorname{front}}italic_h start_POSTSUBSCRIPT roman_front end_POSTSUBSCRIPT =hUT+1v1×dvabsentsubscriptsuperscriptsuperscript𝑣subscript𝑈superscript𝑇1superscript1subscript𝑑𝑣\displaystyle=h^{v^{\prime}}_{U_{T^{\prime}+1}}\in\mathbb{R}^{1\times d_{v}}= italic_h start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
hrearsubscriptrear\displaystyle h_{\operatorname{rear}}italic_h start_POSTSUBSCRIPT roman_rear end_POSTSUBSCRIPT =hUtv1×dvabsentsubscriptsuperscriptsuperscript𝑣subscript𝑈𝑡superscript1subscript𝑑𝑣\displaystyle=h^{v^{\prime}}_{U_{t}}\in\mathbb{R}^{1\times d_{v}}= italic_h start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
hstatesubscriptstate\displaystyle h_{\operatorname{state}}italic_h start_POSTSUBSCRIPT roman_state end_POSTSUBSCRIPT =[hgraph,hfront,hrear]Ws+𝟏v(G)bsv(G)×dvabsentsubscriptgraphsubscriptfrontsubscriptrearsuperscript𝑊𝑠subscript1𝑣superscript𝐺superscript𝑏𝑠superscript𝑣superscript𝐺subscript𝑑𝑣\displaystyle=[h_{\operatorname{graph}},h_{\operatorname{front}},h_{% \operatorname{rear}}]W^{s}+\boldsymbol{1}_{v(G^{\prime})}b^{s}\in\mathbb{R}^{v% (G^{\prime})\times d_{v}}= [ italic_h start_POSTSUBSCRIPT roman_graph end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT roman_front end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT roman_rear end_POSTSUBSCRIPT ] italic_W start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + bold_1 start_POSTSUBSCRIPT italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

Here, Ws3dv×dvsuperscript𝑊𝑠superscript3subscript𝑑𝑣subscript𝑑𝑣W^{s}\in\mathbb{R}^{3d_{v}\times d_{v}}italic_W start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and bs1×dvsuperscript𝑏𝑠superscript1subscript𝑑𝑣b^{s}\in\mathbb{R}^{1\times d_{v}}italic_b start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are trainable parameters. The parameter Utsubscript𝑈𝑡U_{t}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT corresponds to the vertex indices selected from the graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT during the subpath merging phase at time step t𝑡titalic_t.

Ultimately, we determine the index Ut{1,,K+|I|}subscript𝑈𝑡1𝐾𝐼U_{t}\in\{1,\ldots,K+|I|\}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 1 , … , italic_K + | italic_I | } of the viable action chosen at time t𝑡titalic_t from the set of vertices in graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using the parameterized strategy πϕc(Ut,G,St)subscriptsuperscript𝜋𝑐bold-italic-ϕsubscript𝑈𝑡𝐺subscript𝑆𝑡\pi^{c}_{\boldsymbol{\phi}}(U_{t},G,S_{t})italic_π start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_G , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This index is then translated back to the vertex indexes of the original graph to yield action Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t, where t{T+1,,T+K+|I|1}𝑡conditional-setsuperscript𝑇1superscript𝑇limit-from𝐾conditional𝐼1t\in\{T^{\prime}+1,\ldots,T^{\prime}+K+|I|-1\}italic_t ∈ { italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , … , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | - 1 }. The probability distribution of Utsubscript𝑈𝑡U_{t}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is computed based on the input feature vector of states ht,0p=hstatesubscriptsuperscriptsuperscript𝑝𝑡0superscriptstateh^{p^{\prime}}_{t,0}=h^{\operatorname{state}}italic_h start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT = italic_h start_POSTSUPERSCRIPT roman_state end_POSTSUPERSCRIPT, the vertex feature vector of graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, denoted as HLencvvsubscriptsuperscript𝐻superscript𝑣superscript𝐿subscriptencsuperscript𝑣H^{v^{\prime}}_{L^{\text{enc}_{v^{\prime}}}}italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, and the mask Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The calculation follows this pattern:

ht,lpsubscriptsuperscriptsuperscript𝑝𝑡𝑙\displaystyle h^{p^{\prime}}_{t,l}italic_h start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT =MHA(ht,l1p,HLencvv,HLencvv,Mt),absentMHAsubscriptsuperscriptsuperscript𝑝𝑡𝑙1subscriptsuperscript𝐻superscript𝑣superscript𝐿subscriptencsuperscript𝑣subscriptsuperscript𝐻superscript𝑣superscript𝐿subscriptencsuperscript𝑣subscript𝑀𝑡\displaystyle=\operatorname{MHA}(h^{p^{\prime}}_{t,l-1},H^{v^{\prime}}_{L^{% \text{enc}_{v^{\prime}}}},H^{v^{\prime}}_{L^{\text{enc}_{v^{\prime}}}},M_{t}),= roman_MHA ( italic_h start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
πϕc(UtG,St)subscriptsuperscript𝜋𝑐bold-italic-ϕconditionalsubscript𝑈𝑡𝐺subscript𝑆𝑡\displaystyle\pi^{c}_{\boldsymbol{\phi}}(U_{t}\mid G,S_{t})italic_π start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_G , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =softmax(Ctanh(Mt(ht,LdeccpW1c+b1c)(HLencvvW2c+𝟏v(G)b2c)T/d)).absentsoftmax𝐶tanhdirect-productsubscript𝑀𝑡subscriptsuperscriptsuperscript𝑝𝑡superscript𝐿subscriptdec𝑐subscriptsuperscript𝑊𝑐1subscriptsuperscript𝑏𝑐1superscriptsubscriptsuperscript𝐻superscript𝑣superscript𝐿subscriptencsuperscript𝑣subscriptsuperscript𝑊𝑐2subscript1𝑣superscript𝐺subscriptsuperscript𝑏𝑐2𝑇𝑑\displaystyle=\operatorname{softmax}\Bigg{(}C\operatorname{tanh}\Big{(}M_{t}% \odot(h^{p^{\prime}}_{t,L^{\text{dec}_{c}}}W^{c}_{1}+b^{c}_{1})(H^{v^{\prime}}% _{L^{\text{enc}_{v^{\prime}}}}W^{c}_{2}+\boldsymbol{1}_{v(G^{\prime})}b^{c}_{2% })^{T}/\sqrt{d}\Big{)}\Bigg{)}.= roman_softmax ( italic_C roman_tanh ( italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ ( italic_h start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_L start_POSTSUPERSCRIPT dec start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_H start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_1 start_POSTSUBSCRIPT italic_v ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_d end_ARG ) ) .

In the equation above, l{1,Ldecc}𝑙1superscript𝐿subscriptdec𝑐l\in\{1,\ldots L^{\text{dec}_{c}}\}italic_l ∈ { 1 , … italic_L start_POSTSUPERSCRIPT dec start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } denotes the decoder layer index specific to the subpath merging stage. The parameter C𝐶Citalic_C is set to 10101010 as a crop** threshold, while Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the vertex mask that has yet to be accessed. W1c,W2cdv×dvsubscriptsuperscript𝑊𝑐1subscriptsuperscript𝑊𝑐2superscriptsubscript𝑑𝑣subscript𝑑𝑣W^{c}_{1},W^{c}_{2}\in\mathbb{R}^{d_{v}\times d_{v}}italic_W start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and b1c,b2c1×dvsubscriptsuperscript𝑏𝑐1subscriptsuperscript𝑏𝑐2superscript1subscript𝑑𝑣b^{c}_{1},b^{c}_{2}\in\mathbb{R}^{1\times d_{v}}italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are trainable parameters. Once the vertex index Utsubscript𝑈𝑡U_{t}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is established on graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, chosen by agent k𝑘kitalic_k at time t𝑡titalic_t, the corresponding action becomes

At=(VUt1+(K+|I|)(bool(Ut1<K+|I|))(G),VUt(G)).subscript𝐴𝑡subscript𝑉subscript𝑈𝑡1𝐾𝐼boolsubscript𝑈𝑡1𝐾𝐼superscript𝐺subscript𝑉subscript𝑈𝑡superscript𝐺A_{t}=\left(V_{U_{t-1}+(K+|I|)\cdot(\operatorname{bool}(U_{t-1}<K+|I|))}(G^{% \prime}),V_{U_{t}}(G^{\prime})\right).italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( italic_K + | italic_I | ) ⋅ ( roman_bool ( italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT < italic_K + | italic_I | ) ) end_POSTSUBSCRIPT ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_V start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) .

Here, Vi(G)subscript𝑉𝑖superscript𝐺V_{i}(G^{\prime})italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denotes the i𝑖iitalic_i-th vertex in graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and bool()bool\operatorname{bool}(\cdot)roman_bool ( ⋅ ) functions as a logic operator returning either 00 or 1111. It serves to determine whether Ut1subscript𝑈𝑡1U_{t-1}italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT corresponds to the front or rear vertex. If it represents the front vertex, its index is increased by K+|I|𝐾𝐼K+|I|italic_K + | italic_I | to ensure that the newly selected vertex at time t𝑡titalic_t connects to its endpoint. Conversely, if it represents the rear vertex, its index is decreased by K+|I|𝐾𝐼K+|I|italic_K + | italic_I | to link it properly.

At the final time step t=T+K+|I|𝑡superscript𝑇𝐾𝐼t=T^{\prime}+K+|I|italic_t = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I |, a decisive selection of a singular edge ensures the formation of a cycle. As a result, the need for parameterized strategies is obviated.

3.4 Policy Optimization

This section introduces optimization methods for the parameterized strategies involved in the CARSS algorithm’s subroute generation and subroute merging phases. In the subroute generation phase, initially, a set of N𝑁Nitalic_N groups of starting vertices {U0n,1,,U0n,K}n=1Nsuperscriptsubscriptsubscriptsuperscript𝑈𝑛10subscriptsuperscript𝑈𝑛𝐾0𝑛1𝑁\{U^{n,1}_{0},\ldots,U^{n,K}_{0}\}_{n=1}^{N}{ italic_U start_POSTSUPERSCRIPT italic_n , 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_U start_POSTSUPERSCRIPT italic_n , italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is selected within the vertex set V(G)𝑉𝐺V(G)italic_V ( italic_G ), ensuring distinct vertices within each group. Subsequently, leveraging the probability distribution of the policy π𝜽dsubscriptsuperscript𝜋𝑑𝜽\pi^{d}_{\boldsymbol{\theta}}italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT, N𝑁Nitalic_N trajectories are sampled for the same instance. This results in a sequence of states, assignment vector indices, and reward trajectories {(St1n,Ut1n,1,,Ut1n,K,Rtn)t=1T}n=1Nsuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscript𝑆𝑡1𝑛superscriptsubscript𝑈𝑡1𝑛1superscriptsubscript𝑈𝑡1𝑛𝐾superscriptsubscript𝑅𝑡𝑛𝑡1superscript𝑇𝑛1𝑁\{(S_{t-1}^{n},U_{t-1}^{n,1},\ldots,U_{t-1}^{n,K},R_{t}^{n})_{t=1}^{T^{\prime}% }\}_{n=1}^{N}{ ( italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , 1 end_POSTSUPERSCRIPT , … , italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_K end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where K𝐾Kitalic_K is the number of agents and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the termination time of the subroute generation phase.

Moving to the subroute merging phase, a choice is made to establish 2(K+|I|)2𝐾𝐼2(K+|I|)2 ( italic_K + | italic_I | ) sets of initial vertex indices for time Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Specifically, indices are assigned as UT+11,1=UT+12,1==UT+1N,1=1,,UT+11,2(K+|I|)=UT+12,2(K+|I|)==UT+1N,2(K+|I|)=2(K+|I|)formulae-sequencesuperscriptsubscript𝑈superscript𝑇111superscriptsubscript𝑈superscript𝑇121superscriptsubscript𝑈superscript𝑇1𝑁11superscriptsubscript𝑈superscript𝑇112𝐾𝐼superscriptsubscript𝑈superscript𝑇122𝐾𝐼superscriptsubscript𝑈superscript𝑇1𝑁2𝐾𝐼2𝐾𝐼U_{T^{\prime}+1}^{1,1}=U_{T^{\prime}+1}^{2,1}=\ldots=U_{T^{\prime}+1}^{N,1}=1,% \ldots,U_{T^{\prime}+1}^{1,2(K+|I|)}=U_{T^{\prime}+1}^{2,2(K+|I|)}=\ldots=U_{T% ^{\prime}+1}^{N,2(K+|I|)}=2(K+|I|)italic_U start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 , 1 end_POSTSUPERSCRIPT = … = italic_U start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N , 1 end_POSTSUPERSCRIPT = 1 , … , italic_U start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , 2 ( italic_K + | italic_I | ) end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 , 2 ( italic_K + | italic_I | ) end_POSTSUPERSCRIPT = … = italic_U start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N , 2 ( italic_K + | italic_I | ) end_POSTSUPERSCRIPT = 2 ( italic_K + | italic_I | ). Then, employing the probability distribution of policy πϕcsubscriptsuperscript𝜋𝑐bold-italic-ϕ\pi^{c}_{\boldsymbol{\phi}}italic_π start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT, each of the 2(K+|I|)2𝐾𝐼2(K+|I|)2 ( italic_K + | italic_I | ) indices is sampled independently. This process yields another sequence of states, vertex indices in graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and reward trajectories {{(St1n,m,Ut1n,m,Rtn,m)t=TT+K+|I|}m=12(K+|I|)}n=1Nsuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscript𝑆𝑡1𝑛𝑚superscriptsubscript𝑈𝑡1𝑛𝑚superscriptsubscript𝑅𝑡𝑛𝑚𝑡superscript𝑇superscript𝑇𝐾𝐼𝑚12𝐾𝐼𝑛1𝑁\{\{(S_{t-1}^{n,m},U_{t-1}^{n,m},R_{t}^{n,m})_{t=T^{\prime}}^{T^{\prime}+K+|I|% }\}_{m=1}^{2(K+|I|)}\}_{n=1}^{N}{ { ( italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT , italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 ( italic_K + | italic_I | ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Here, I=V(G)V(ST)𝐼𝑉𝐺𝑉subscript𝑆superscript𝑇I=V(G)\setminus V(S_{T^{\prime}})italic_I = italic_V ( italic_G ) ∖ italic_V ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) signifies the set of isolated points in graph G𝐺Gitalic_G at the end of the subroute generation phase, and Vi(G)subscript𝑉𝑖superscript𝐺V_{i}(G^{\prime})italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) represents the i𝑖iitalic_i-th vertex in graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

It is important to note, as defined Section 3.1, that the reward functions are structured such that Rtn=0superscriptsubscript𝑅𝑡𝑛0R_{t}^{n}=0italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 0 for all t{1,,T}𝑡1superscript𝑇t\in\{1,\ldots,T^{\prime}\}italic_t ∈ { 1 , … , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } and Rtn,m=0superscriptsubscript𝑅𝑡𝑛𝑚0R_{t}^{n,m}=0italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT = 0 for all t{T+1,T+K+|I|}𝑡superscript𝑇1superscript𝑇𝐾𝐼t\in\{T^{\prime}+1\ldots,T^{\prime}+K+|I|\}italic_t ∈ { italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 … , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | }, and for all n{1,,N}𝑛1𝑁n\in\{1,\ldots,N\}italic_n ∈ { 1 , … , italic_N } and m{1,,2(K+|I|)}𝑚12𝐾𝐼m\in\{1,\ldots,2(K+|I|)\}italic_m ∈ { 1 , … , 2 ( italic_K + | italic_I | ) }. Non-zero rewards are solely associated with RT+K+|I|n,msuperscriptsubscript𝑅superscript𝑇𝐾𝐼𝑛𝑚R_{T^{\prime}+K+|I|}^{n,m}italic_R start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT, for all n{1,,N}𝑛1𝑁n\in\{1,\ldots,N\}italic_n ∈ { 1 , … , italic_N } and m{1,,2(K+|I|)}𝑚12𝐾𝐼m\in\{1,\ldots,2(K+|I|)\}italic_m ∈ { 1 , … , 2 ( italic_K + | italic_I | ) }, representing the final circuit length.

When considering the policy gradient for each agent, the remaining agents are treated as part of the environment. With reference to the Policy Gradient Theorem (Sutton et al., 1999), the gradient of the expected cumulative reward can be approximated as follows:

J(𝜽)𝐽𝜽\displaystyle\nabla J(\boldsymbol{\theta})∇ italic_J ( bold_italic_θ ) 1Nn=1N1Kk=1K(minm{1,,2(K+|I|)}RT+K+|I|n,mbd)logt=1Tπ𝜽(Utn,kG,Stn,k),absent1𝑁superscriptsubscript𝑛1𝑁1𝐾superscriptsubscript𝑘1𝐾subscript𝑚12𝐾𝐼superscriptsubscript𝑅superscript𝑇𝐾𝐼𝑛𝑚superscript𝑏𝑑superscriptsubscriptproduct𝑡1superscript𝑇subscript𝜋𝜽conditionalsuperscriptsubscript𝑈𝑡𝑛𝑘𝐺superscriptsubscript𝑆𝑡𝑛𝑘\displaystyle\approx\frac{1}{N}\sum_{n=1}^{N}\frac{1}{K}\sum_{k=1}^{K}\left(% \min_{m\in\{1,\ldots,2(K+|I|)\}}R_{T^{\prime}+K+|I|}^{n,m}-b^{d}\right)\nabla% \log\prod_{t=1}^{T^{\prime}}\pi_{\boldsymbol{\theta}}(U_{t}^{n,k}\mid G,S_{t}^% {n,k}),≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( roman_min start_POSTSUBSCRIPT italic_m ∈ { 1 , … , 2 ( italic_K + | italic_I | ) } end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT - italic_b start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ∇ roman_log ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT ∣ italic_G , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT ) ,
J(ϕ)𝐽bold-italic-ϕ\displaystyle\nabla J(\boldsymbol{\phi})∇ italic_J ( bold_italic_ϕ ) 1Nn=1N12(K+|I|)m=12(K+|I|)(RT+K+|I|n,mbc)logt=T+1T+K+|I|πϕ(Utn,mG,Stn,m)..absent1𝑁superscriptsubscript𝑛1𝑁12𝐾𝐼superscriptsubscript𝑚12𝐾𝐼superscriptsubscript𝑅superscript𝑇𝐾𝐼𝑛𝑚superscript𝑏𝑐superscriptsubscriptproduct𝑡superscript𝑇1superscript𝑇𝐾𝐼subscript𝜋bold-italic-ϕconditionalsuperscriptsubscript𝑈𝑡𝑛𝑚superscript𝐺superscriptsubscript𝑆𝑡𝑛𝑚\displaystyle\approx\frac{1}{N}\sum_{n=1}^{N}\frac{1}{2(K+|I|)}\sum_{m=1}^{2(K% +|I|)}\left(R_{T^{\prime}+K+|I|}^{n,m}-b^{c}\right)\nabla\log\prod_{t=T^{% \prime}+1}^{T^{\prime}+K+|I|}\pi_{\boldsymbol{\phi}}(U_{t}^{n,m}\mid G^{\prime% },S_{t}^{n,m})..≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 ( italic_K + | italic_I | ) end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 ( italic_K + | italic_I | ) end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT - italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ∇ roman_log ∏ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT ∣ italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT ) . .

Here, bd=1Nn=1Nminm{1,,2(K+|I|)}RT+K+|I|n,msuperscript𝑏𝑑1𝑁superscriptsubscript𝑛1𝑁subscript𝑚12𝐾𝐼superscriptsubscript𝑅superscript𝑇𝐾𝐼𝑛𝑚b^{d}=\frac{1}{N}\sum_{n=1}^{N}\min_{m\in\{1,\ldots,2(K+|I|)\}}R_{T^{\prime}+K% +|I|}^{n,m}italic_b start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_min start_POSTSUBSCRIPT italic_m ∈ { 1 , … , 2 ( italic_K + | italic_I | ) } end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT and bc=12(K+|I|)m=12(K+|I|)RT+K+|I|n,msuperscript𝑏𝑐12𝐾𝐼superscriptsubscript𝑚12𝐾𝐼superscriptsubscript𝑅superscript𝑇𝐾𝐼𝑛𝑚b^{c}=\frac{1}{2(K+|I|)}\sum_{m=1}^{2(K+|I|)}R_{T^{\prime}+K+|I|}^{n,m}italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 ( italic_K + | italic_I | ) end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 ( italic_K + | italic_I | ) end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT. The former corresponds to the Policy Optimization with Multiple Optima (POMO) baseline (Kwon et al., 2020) obtained by sampling the decoding of merged subpaths from the 2(K+|I|)2𝐾𝐼2(K+|I|)2 ( italic_K + | italic_I | ) randomly selected vertices in the subpath merging stage. The latter signifies the POMO baseline obtained by sampling N𝑁Nitalic_N vertices decoded from randomly chosen vertices in the subpath generation stage.

We trained the model using the independent REINFORCE algorithm (Williams, 1992) with the POMO baseline, employing the Adam optimizer (Kingma and Ba, 2015) for parameter updates. The training procedure is detailed in Algorithm 3.

Algorithm 3 Independent REINFORCE algorithm with Policy Optimization Multiple Optima baseline
1:Number of iterations E𝐸Eitalic_E, batch size B𝐵Bitalic_B, trajectory samples per instance N𝑁Nitalic_N
2:Initialize 𝜽,ϕ𝜽bold-italic-ϕ\boldsymbol{\theta},\boldsymbol{\phi}bold_italic_θ , bold_italic_ϕ
3:for epoch=1,,Eepoch1𝐸\text{epoch}=1,\ldots,Eepoch = 1 , … , italic_E do
4:     for i=1,,B𝑖1𝐵i=1,\ldots,Bitalic_i = 1 , … , italic_B do
5:         GiRandomInstance()subscript𝐺𝑖RandomInstanceG_{i}\leftarrow\operatorname{RandomInstance()}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← start_OPFUNCTION roman_RandomInstance ( ) end_OPFUNCTION
6:         Tv(G)K(1 if K divides v(G) else 0)superscript𝑇𝑣𝐺𝐾1 if 𝐾 divides 𝑣𝐺 else 0T^{\prime}\leftarrow\left\lfloor\frac{v(G)}{K}\right\rfloor-(1\text{ if }K% \text{ divides }v(G)\text{ else }0)italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← ⌊ divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG ⌋ - ( 1 if italic_K divides italic_v ( italic_G ) else 0 )
7:         for n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N do
8:              for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K do
9:                  U0n,kRandomSelect(V(Gi){U0n,k}k{1,,k1})superscriptsubscript𝑈0𝑛𝑘RandomSelect𝑉subscript𝐺𝑖subscriptsuperscriptsubscript𝑈0𝑛superscript𝑘superscript𝑘1𝑘1U_{0}^{n,k}\leftarrow\operatorname{RandomSelect}\left(V(G_{i})\setminus\{U_{0}% ^{n,k^{\prime}}\}_{k^{\prime}\in\{1,\ldots,k-1\}}\right)italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT ← roman_RandomSelect ( italic_V ( italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∖ { italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 1 , … , italic_k - 1 } end_POSTSUBSCRIPT )
10:              end for
11:              {(St1n,Ut1n,1,,Ut1n,K,Rtn)}t=1TRollout(Gi,π𝜽)superscriptsubscriptsuperscriptsubscript𝑆𝑡1𝑛superscriptsubscript𝑈𝑡1𝑛1superscriptsubscript𝑈𝑡1𝑛𝐾superscriptsubscript𝑅𝑡𝑛𝑡1superscript𝑇Rolloutsubscript𝐺𝑖subscript𝜋𝜽\{(S_{t-1}^{n},U_{t-1}^{n,1},\ldots,U_{t-1}^{n,K},R_{t}^{n})\}_{t=1}^{T^{% \prime}}\leftarrow\operatorname{Rollout}(G_{i},\pi_{\boldsymbol{\theta}}){ ( italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , 1 end_POSTSUPERSCRIPT , … , italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_K end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ← roman_Rollout ( italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT )
12:              IV(Gi)V(STn)𝐼𝑉subscript𝐺𝑖𝑉subscriptsuperscript𝑆𝑛superscript𝑇I\leftarrow V(G_{i})\setminus V(S^{n}_{T^{\prime}})italic_I ← italic_V ( italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∖ italic_V ( italic_S start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )
13:              GiConstructSubgraph(Gi,STn)subscriptsuperscript𝐺𝑖ConstructSubgraphsubscript𝐺𝑖subscriptsuperscript𝑆𝑛superscript𝑇G^{\prime}_{i}\leftarrow\operatorname{ConstructSubgraph}(G_{i},S^{n}_{T^{% \prime}})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← roman_ConstructSubgraph ( italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )
14:              UT+1n,mm,n{1,,N},m{1,,2(K+|I|)}formulae-sequencesuperscriptsubscript𝑈superscript𝑇1superscript𝑛superscript𝑚𝑚formulae-sequencesuperscript𝑛1𝑁superscript𝑚12𝐾𝐼U_{T^{\prime}+1}^{n^{\prime},m^{\prime}}\leftarrow m,n^{\prime}\in\{1,\ldots,N% \},m^{\prime}\in\{1,\ldots,2(K+|I|)\}italic_U start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ← italic_m , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 1 , … , italic_N } , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 1 , … , 2 ( italic_K + | italic_I | ) }
15:              {{(St1n,m,Ut1n,m,Rtn,m)}t=T+1T+K+|I|}m=12(K+|I|)Rollout(Gi,πϕ)superscriptsubscriptsuperscriptsubscriptsuperscriptsubscript𝑆𝑡1𝑛𝑚superscriptsubscript𝑈𝑡1𝑛𝑚superscriptsubscript𝑅𝑡𝑛𝑚𝑡superscript𝑇1superscript𝑇𝐾𝐼𝑚12𝐾𝐼Rolloutsubscriptsuperscript𝐺𝑖subscript𝜋bold-italic-ϕ\{\{(S_{t-1}^{n,m},U_{t-1}^{n,m},R_{t}^{n,m})\}_{t=T^{\prime}+1}^{T^{\prime}+K% +|I|}\}_{m=1}^{2(K+|I|)}\leftarrow\operatorname{Rollout}(G^{\prime}_{i},\pi_{% \boldsymbol{\phi}}){ { ( italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT , italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 ( italic_K + | italic_I | ) end_POSTSUPERSCRIPT ← roman_Rollout ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT )
16:         end for
17:         bd=1Nn=1Nminm{1,,2(K+|I|)}RT+K+|I|n,msuperscript𝑏𝑑1𝑁superscriptsubscript𝑛1𝑁subscript𝑚12𝐾𝐼superscriptsubscript𝑅superscript𝑇𝐾𝐼𝑛𝑚b^{d}=\frac{1}{N}\sum_{n=1}^{N}\min_{m\in\{1,\ldots,2(K+|I|)\}}R_{T^{\prime}+K% +|I|}^{n,m}italic_b start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_min start_POSTSUBSCRIPT italic_m ∈ { 1 , … , 2 ( italic_K + | italic_I | ) } end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT
18:         bc12(K+|I|)m=12(K+|I|)RT+K+|I|n,msuperscript𝑏𝑐12𝐾𝐼superscriptsubscript𝑚12𝐾𝐼superscriptsubscript𝑅superscript𝑇𝐾𝐼𝑛𝑚b^{c}\leftarrow\frac{1}{2(K+|I|)}\sum_{m=1}^{2(K+|I|)}R_{T^{\prime}+K+|I|}^{n,m}italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ← divide start_ARG 1 end_ARG start_ARG 2 ( italic_K + | italic_I | ) end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 ( italic_K + | italic_I | ) end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT
19:         J(𝜽)1Nn=1N1Kk=1K(minm{1,,2(K+|I|)}RT+K+|I|n,mbd)logt=1Tπ𝜽(Utn,kG,Stn,k)𝐽𝜽1𝑁superscriptsubscript𝑛1𝑁1𝐾superscriptsubscript𝑘1𝐾subscript𝑚12𝐾𝐼superscriptsubscript𝑅superscript𝑇𝐾𝐼𝑛𝑚superscript𝑏𝑑superscriptsubscriptproduct𝑡1superscript𝑇subscript𝜋𝜽conditionalsuperscriptsubscript𝑈𝑡𝑛𝑘𝐺superscriptsubscript𝑆𝑡𝑛𝑘\nabla J(\boldsymbol{\theta})\leftarrow\frac{1}{N}\sum_{n=1}^{N}\frac{1}{K}% \sum_{k=1}^{K}\left(\min_{m\in\{1,\ldots,2(K+|I|)\}}R_{T^{\prime}+K+|I|}^{n,m}% -b^{d}\right)\nabla\log\prod_{t=1}^{T^{\prime}}\pi_{\boldsymbol{\theta}}(U_{t}% ^{n,k}\mid G,S_{t}^{n,k})∇ italic_J ( bold_italic_θ ) ← divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( roman_min start_POSTSUBSCRIPT italic_m ∈ { 1 , … , 2 ( italic_K + | italic_I | ) } end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT - italic_b start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ∇ roman_log ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT ∣ italic_G , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT )
20:         J(ϕ)1Nn=1N12(K+|I|)m=12(K+|I|)(RT+K+|I|n,mbc)logt=T+1T+K+|I|πϕ(Utn,mG,Stn,m)𝐽bold-italic-ϕ1𝑁superscriptsubscript𝑛1𝑁12𝐾𝐼superscriptsubscript𝑚12𝐾𝐼superscriptsubscript𝑅superscript𝑇𝐾𝐼𝑛𝑚superscript𝑏𝑐superscriptsubscriptproduct𝑡superscript𝑇1superscript𝑇𝐾𝐼subscript𝜋bold-italic-ϕconditionalsuperscriptsubscript𝑈𝑡𝑛𝑚superscript𝐺superscriptsubscript𝑆𝑡𝑛𝑚\nabla J(\boldsymbol{\phi})\leftarrow\frac{1}{N}\sum_{n=1}^{N}\frac{1}{2(K+|I|% )}\sum_{m=1}^{2(K+|I|)}\left(R_{T^{\prime}+K+|I|}^{n,m}-b^{c}\right)\nabla\log% \prod_{t=T^{\prime}+1}^{T^{\prime}+K+|I|}\pi_{\boldsymbol{\phi}}(U_{t}^{n,m}% \mid G^{\prime},S_{t}^{n,m})∇ italic_J ( bold_italic_ϕ ) ← divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 ( italic_K + | italic_I | ) end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 ( italic_K + | italic_I | ) end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT - italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ∇ roman_log ∏ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_K + | italic_I | end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT ∣ italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_m end_POSTSUPERSCRIPT )
21:         𝜽Adam(𝜽,J(𝜽))𝜽Adam𝜽𝐽𝜽\boldsymbol{\theta}\leftarrow\operatorname{Adam}(\boldsymbol{\theta},\nabla J(% \boldsymbol{\theta}))bold_italic_θ ← roman_Adam ( bold_italic_θ , ∇ italic_J ( bold_italic_θ ) )
22:         ϕAdam(ϕ,J(ϕ))bold-italic-ϕAdambold-italic-ϕ𝐽bold-italic-ϕ\boldsymbol{\phi}\leftarrow\operatorname{Adam}(\boldsymbol{\phi},\nabla J(% \boldsymbol{\phi}))bold_italic_ϕ ← roman_Adam ( bold_italic_ϕ , ∇ italic_J ( bold_italic_ϕ ) )
23:     end for
24:end for

3.5 Complexity Analysis

In this section, we present an analysis of the overall time and space complexity of the CARSS algorithm. This analysis highlights the advantages of our approach in terms of complexity compared to classical reinforcement learning-based methods for solving TSP (Kool et al., 2018; Bresson and Laurent, 2021). Moreover, it underscores the potential for training on larger problem instances.

Firstly, let’s consider the algorithm’s time complexity. This algorithm involves the utilization of the self-attention mechanism from multiple Transformer models during both the encoding and decoding processes. For a sequence of length n𝑛nitalic_n, the time complexity of this algorithm is determined by O(n2d+nd2)𝑂superscript𝑛2𝑑𝑛superscript𝑑2O(n^{2}d+nd^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d + italic_n italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), as discussed in Vaswani et al. (2017), where d𝑑ditalic_d represents the model’s dimension. For the sake of simplicity, we can omit the term nd2𝑛superscript𝑑2nd^{2}italic_n italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, particularly since during algorithm execution, the difference between n𝑛nitalic_n and d𝑑ditalic_d tends to be marginal or in scenarios where n>d𝑛𝑑n>ditalic_n > italic_d. Additionally, we will disregard certain network-specific parameters like Lencasuperscript𝐿subscriptenc𝑎L^{\text{enc}_{a}}italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, Ldecgsuperscript𝐿subscriptdec𝑔L^{\text{dec}_{g}}italic_L start_POSTSUPERSCRIPT dec start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, etc. Furthermore, our analysis will focus solely on the case where a single trajectory is sampled in both phases (N=1𝑁1N=1italic_N = 1). The computational approach is as follows:

O((K2Ht,Lencaa+KTht,Ldecgd+Kv(G)Kπ𝜽d)TdSubpath generation+(K+|I|)πϕd(K+|I|)dSubpath merging)=O((Kv(G)+2(v(G))2K+4K2)d),𝑂subscriptsuperscriptsuperscript𝐾2subscriptsuperscript𝐻𝑎𝑡superscript𝐿subscriptenc𝑎superscript𝐾superscript𝑇subscriptsuperscript𝑑𝑡superscript𝐿subscriptdec𝑔superscript𝐾𝑣𝐺𝐾subscriptsuperscript𝜋𝑑𝜽superscript𝑇𝑑Subpath generationsubscriptsuperscript𝐾𝐼subscriptsuperscript𝜋𝑑bold-italic-ϕ𝐾𝐼𝑑Subpath merging𝑂𝐾𝑣𝐺2superscript𝑣𝐺2𝐾4superscript𝐾2𝑑O\left(\underbrace{\left(\overbrace{K^{2}}^{H^{a}_{t,L^{\text{enc}_{a}}}}+% \overbrace{KT^{\prime}}^{h^{d}_{t,L^{\text{dec}_{g}}}}+\overbrace{K\frac{v(G)}% {K}}^{\pi^{d}_{\boldsymbol{\theta}}}\right)\cdot T^{\prime}d}_{\text{Subpath % generation}}+\underbrace{\overbrace{(K+|I|)}^{\pi^{d}_{\boldsymbol{\phi}}}% \cdot(K+|I|)d}_{\text{Subpath merging}}\right)=O\left(\left(Kv(G)+2\frac{\left% (v(G)\right)^{2}}{K}+4K^{2}\right)d\right),italic_O ( under⏟ start_ARG ( over⏞ start_ARG italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + over⏞ start_ARG italic_K italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_L start_POSTSUPERSCRIPT dec start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + over⏞ start_ARG italic_K divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ⋅ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_d end_ARG start_POSTSUBSCRIPT Subpath generation end_POSTSUBSCRIPT + under⏟ start_ARG over⏞ start_ARG ( italic_K + | italic_I | ) end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ ( italic_K + | italic_I | ) italic_d end_ARG start_POSTSUBSCRIPT Subpath merging end_POSTSUBSCRIPT ) = italic_O ( ( italic_K italic_v ( italic_G ) + 2 divide start_ARG ( italic_v ( italic_G ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG + 4 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_d ) ,

Next, we delve into the consideration of the algorithm’s space complexity. For a sequence of length n𝑛nitalic_n, the space complexity of the self-attention module is O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), as indicated by Vaswani et al. (2017). However, recent advancements in the field have demonstrated that for encoders, this complexity can be reduced to n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG (Rabe and Staats, 2021). By employing this technique, the space complexity can be expressed as follows:

O(KKHt,Lencaa+K(T)2ht,Ldecgd+K(v(G)K)2π𝜽dSubpath generation+(K+|I|)2πϕdSubpath merging)=O(2(v(G))2K+4K2+K32),𝑂subscriptsuperscript𝐾𝐾subscriptsuperscript𝐻𝑎𝑡superscript𝐿subscriptenc𝑎superscript𝐾superscriptsuperscript𝑇2subscriptsuperscript𝑑𝑡superscript𝐿subscriptdec𝑔superscript𝐾superscript𝑣𝐺𝐾2subscriptsuperscript𝜋𝑑𝜽Subpath generationsubscriptsuperscriptsuperscript𝐾𝐼2subscriptsuperscript𝜋𝑑bold-italic-ϕSubpath merging𝑂2superscript𝑣𝐺2𝐾4superscript𝐾2superscript𝐾32O\left(\underbrace{\overbrace{K\sqrt{K}}^{H^{a}_{t,L^{\text{enc}_{a}}}}+% \overbrace{K\left(T^{\prime}\right)^{2}}^{h^{d}_{t,L^{\text{dec}_{g}}}}+% \overbrace{K\left(\frac{v(G)}{K}\right)^{2}}^{\pi^{d}_{\boldsymbol{\theta}}}}_% {\text{Subpath generation}}+\underbrace{\overbrace{(K+|I|)^{2}}^{\pi^{d}_{% \boldsymbol{\phi}}}}_{\text{Subpath merging}}\right)=O\left(2\frac{\left(v(G)% \right)^{2}}{K}+4K^{2}+K^{\frac{3}{2}}\right),italic_O ( under⏟ start_ARG over⏞ start_ARG italic_K square-root start_ARG italic_K end_ARG end_ARG start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + over⏞ start_ARG italic_K ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_L start_POSTSUPERSCRIPT dec start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + over⏞ start_ARG italic_K ( divide start_ARG italic_v ( italic_G ) end_ARG start_ARG italic_K end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Subpath generation end_POSTSUBSCRIPT + under⏟ start_ARG over⏞ start_ARG ( italic_K + | italic_I | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Subpath merging end_POSTSUBSCRIPT ) = italic_O ( 2 divide start_ARG ( italic_v ( italic_G ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG + 4 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_K start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) ,

Through these calculations, it is evident that the CARSS algorithm, which involves the collaborative efforts of multiple agents to solve TSP as opposed to the conventional algorithm with K=1𝐾1K=1italic_K = 1, significantly reduces the temporal and spatial complexities during both training and testing phases. Specifically, the CARSS algorithm achieves a complexity reduction of approximately 1K1𝐾\frac{1}{K}divide start_ARG 1 end_ARG start_ARG italic_K end_ARG times that of the original algorithm. This reduction in complexity translates to a substantial enhancement in the scalability of the model using the same computational resources.

4 Experiments

In this section, we outline the training process and experimental results of the CARSS algorithm. The training and test datasets are prepared in alignment with Kool et al. (2018). All instance vertices are drawn from a uniform distribution U[0,1]×[0,1]subscript𝑈0101U_{[0,1]\times[0,1]}italic_U start_POSTSUBSCRIPT [ 0 , 1 ] × [ 0 , 1 ] end_POSTSUBSCRIPT. The instance sizes, v(G)𝑣𝐺v(G)italic_v ( italic_G ), are set to {100,200,500,1000}1002005001000\{100,200,500,1000\}{ 100 , 200 , 500 , 1000 }, and the number of agents, K𝐾Kitalic_K, ranges from {2,3,,10,20,25}23102025\{2,3,\ldots,10,20,25\}{ 2 , 3 , … , 10 , 20 , 25 }. Both training and testing are performed on a GeForce RTX 3090 GPU, where instances with sizes less than 100 utilize a single GPU, while the rest employ two GPUs; however, testing is executed on a single GPU. The decoding strategy involves a greedy approach, selecting the action with the highest probability from the model’s action distribution. The optimization gap is computed as (Obj./BKS1)×100%Obj.BKS1percent100(\text{Obj.}/\text{BKS}-1)\times 100\%( Obj. / BKS - 1 ) × 100 %, with Obj. representing the cost associated with a solution calculated by a specific algorithm, and BKS denoting the cost of the instance’s optimal solution.

The model’s hyperparameters are largely consistent with Kool et al. (2018). The vertex parameters and hidden layer dimensions of the feedforward neural networks are set to dv=256subscript𝑑𝑣256d_{v}=256italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 256 and df=512subscript𝑑𝑓512d_{f}=512italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 512 respectively. Within the generated subpath model, H=8𝐻8H=8italic_H = 8 attention heads are employed. The encoder comprises Lencv=3superscript𝐿subscriptenc𝑣3L^{\text{enc}_{v}}=3italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = 3 layers of vertex feature aggregation attention, Lenca=3superscript𝐿subscriptenc𝑎3L^{\text{enc}_{a}}=3italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = 3 layers of agent feature aggregation attention, and the decoder has a single attention layer Ldecg=1superscript𝐿subscriptdec𝑔1L^{\text{dec}_{g}}=1italic_L start_POSTSUPERSCRIPT dec start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = 1. Here, the superscripts encvsubscriptenc𝑣\text{enc}_{v}enc start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, encasubscriptenc𝑎\text{enc}_{a}enc start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, and decgsubscriptdec𝑔\text{dec}_{g}dec start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT correspond to the encoder for vertex features, encoder for agent features, and decoder for generating policies, respectively. In the subpath merging model, H=8𝐻8H=8italic_H = 8 attention heads are used, and in both the encoder and decoder, there are Lencv=3superscript𝐿subscriptencsuperscript𝑣3L^{\text{enc}_{v^{\prime}}}=3italic_L start_POSTSUPERSCRIPT enc start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = 3 and Ldecc=1superscript𝐿subscriptdec𝑐1L^{\text{dec}_{c}}=1italic_L start_POSTSUPERSCRIPT dec start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = 1 attention layers, where the superscripts encvsubscriptencsuperscript𝑣\text{enc}_{v^{\prime}}enc start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and deccsubscriptdec𝑐\text{dec}_{c}dec start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT pertain to the vertex encoder and policy decoder, respectively. The model undergoes E=100𝐸100E=100italic_E = 100 iterations, with each iteration comprising B=1000𝐵1000B=1000italic_B = 1000 batches, and each batch containing 512 instances. The learning rate remains fixed at 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

Under the aforementioned settings, training a single iteration of this model on a GeForce RTX 3090 GPU takes approximately 10 to 25 minutes for instances with 100 vertices, around 25 to 31 minutes for instances with 200 vertices, and about 49 minutes for instances with 500 vertices. It’s noteworthy that the training time for models with the same training set size varies based on the number of agents; larger numbers of agents correspond to shorter training times. For the single-agent Attention Model (AM) proposed by Kool et al. (2018), its training times on smaller instances align with those of the CARSS algorithm, potentially due to the relatively high number of sequential execution steps during training or excessive sampling, which suggests room for optimization. As for memory consumption, the model can be trained on instances with 100 or 200 vertices using a single GPU, consuming up to a maximum of 12000 MiB of memory. However, for instances with 500 vertices, two GPUs are required, with each consuming around 16000 MiB of memory. On the other hand, the AM model requires two GPUs for training on instances with 200 vertices, consuming approximately 15000 MiB of memory per card. For larger instances, dual-GPU training is infeasible. This highlights the substantial memory optimization improvements achieved by the CARSS algorithm during training.

4.1 Performance on Random Instances

As shown in 1, we employed the CARSS algorithm to conduct tests on randomly generated instances with a maximum size of 1000 vertices. The Best Known Solution (BKS) was obtained using solvers such as Concorde or Gurobi. Since conventional reinforcement learning-based solving methods perform worse than 2-opt and insertion algorithms for instances of this size, only these few algorithms were included in the comparison. Test instances were generated randomly within the domain U[0,1]×[0,1]subscript𝑈0101U_{[0,1]\times[0,1]}italic_U start_POSTSUBSCRIPT [ 0 , 1 ] × [ 0 , 1 ] end_POSTSUBSCRIPT, with each instance type consisting of 10000/v(G)10000𝑣𝐺10000/v(G)10000 / italic_v ( italic_G ) samples. During testing, 4096 results were obtained using greedy decoding for each instance, with the best result and solving time reported. The average values were then computed based on instance sets of equivalent scale. In the "Solver" column, "AM (sample)" represents the sample decoding version of the single-agent algorithm proposed by Kool et al. (Kool et al., 2018). On the other hand, CARSS(v(G),K)CARSS𝑣𝐺𝐾\text{CARSS}(v(G),K)CARSS ( italic_v ( italic_G ) , italic_K ) indicates the CARSS algorithm trained with K𝐾Kitalic_K agents on a graph of size v(G)𝑣𝐺v(G)italic_v ( italic_G ). As Gurobi’s solving time becomes prohibitively long for larger instances, it wasn’t employed to solve instances with 500 and 1000 vertices, and thus, "–" is used to indicate untested results.

Observing the results, it is evident that for instances with 100 vertices, CARSS (100,2) outperforms the farthest insertion algorithm but slightly lags behind AM (sample). For instances with 200 vertices, the performance of CARSS (100,4) is on par with the farthest insertion method, and superior to AM (sample). As the instance size increases to 500 or 1000 vertices, the optimization gap of CARSS (500,20) is inferior to the nearest insertion algorithm but far better than AM (sample). Even with the increase in sampling iterations, the algorithm retains the potential to achieve better solutions. In terms of testing time, the CARSS algorithm consistently outperforms AM (sample).

Table 1: Results of CARSS algorithm on random instances

Problem Size 100 200 500 1000 Algorithm Obj. Gap Time Obj. Gap Time Obj. Gap Time Obj. Gap Time Concorde 7.74 0.00% 0.189s 10.71 0.00% 1.015s 16.55 0.00% 18.844s 23.09 0.00% 1.366m Gurobi 7.74 0.00% 1.008s 10.71 0.00% 14.585s - - 2-opt 8.34 7.79% 0.198s 11.67 8.94% 0.606s 18.20 9.98% 2.948s 25.60 10.90% 31.792s FI 8.34 7.85% 0.006s 11.68 9.06% 0.022s 18.26 10.37% 0.160s 25.74 11.52% 1.014s RI 8.51 9.95% 0.004s 11.94 11.54% 0.009s 18.46 11.56% 0.038s 26.10 13.07% 0.111s NI 9.45 22.20% 0.006s 13.28 23.97% 0.022s 20.63 24.66% 0.153s 28.93 25.32% 0.999s AM (sample) 7.92 2.39% 1.119m 11.50 7.48% 1.547m 22.65 36.82% 3.180m 42.94 85.96% 6.482m CARSS (100, 2) 8.09 4.53% 7.998s 12.11 13.03% 16.577s 21.87 32.15% 44.526s 35.28 52.83% 1.671m CARSS (100, 3) 8.15 5.39% 6.631s 12.13 13.25% 12.985s 21.78 31.63% 34.216s 34.97 51.48% 1.274m CARSS (100, 4) 8.12 4.93% 5.589s 12.00 12.02% 11.232s 21.24 28.36% 29.356s 33.85 46.62% 1.081m CARSS (100, 5) 8.15 5.34% 5.372s 12.03 12.36% 10.323s 21.17 27.93% 26.184s 33.32 44.36% 58.039s CARSS (100, 6) 8.23 6.44% 5.239s 12.26 14.51% 9.595s 21.56 30.30% 24.265s 34.03 47.40% 53.644s CARSS (100, 7) 8.34 7.87% 4.771s 12.33 15.10% 9.389s 21.71 31.22% 22.947s 34.14 47.90% 50.660s CARSS (100, 8) 8.34 7.83% 5.345s 12.33 15.10% 10.649s 22.04 33.20% 23.163s 34.78 50.67% 50.139s CARSS (100, 9) 8.56 10.60% 4.827s 12.70 18.57% 9.012s 22.47 35.77% 22.853s 36.28 57.16% 46.253s CARSS (100, 10) 8.18 5.76% 7.802s 12.13 13.23% 11.628s 21.37 29.12% 23.896s 34.08 47.60% 49.034s CARSS (200, 5) 8.23 6.36% 5.529s 12.03 12.35% 10.327s 20.81 25.76% 26.170s 32.56 41.04% 57.606s CARSS (200, 10) 8.23 6.38% 7.801s 12.10 12.98% 11.667s 20.84 25.95% 24.418s 32.29 39.88% 49.065s CARSS (500, 20) 8.13 5.08% 33.401s 12.17 13.61% 36.003s 20.58 24.35% 46.409s 31.03 34.41% 1.098m

4.2 Sensitivity Analysis

This segment delves into the relationship between the training loss of the subpath generation model and the problem’s scale and the number of agents involved. The training process is categorized into three groups based on the instance size v(G)𝑣𝐺v(G)italic_v ( italic_G ), the number of agents K𝐾Kitalic_K, and the subproblem scale in the subpath merging phase (K+|I|)𝐾𝐼(K+|I|)( italic_K + | italic_I | ). Figure 2 illustrates the relationship between the training loss L(𝜽)𝐿𝜽L(\boldsymbol{\theta})italic_L ( bold_italic_θ ) and the number of iterations E𝐸Eitalic_E. Solid lines represent the mean of the loss within each category, while shaded regions depict the fluctuation range in terms of standard deviation. A higher absolute value of the loss implies a greater potential for improvement with more training iterations.

From the graph, we observe that the training loss of the subpath generation model exhibits similar trends across various problem scales. This suggests the stability of the CARSS algorithm’s performance when training on different problem sizes, without encountering training difficulties due to excessively large problem scales. However, as the number of agents or the subproblem scale in the subpath merging phase increases, the absolute value of the loss diminishes and its rate of reduction slows down over the training process. This phenomenon can be attributed to the multi-modal nature of the cooperative multi-agent environment.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Variation of training loss with number of iterations of CARSS algorithm

4.3 Example Solutions

As depicted in Figure 3, we selected an instance with a size of 100 and employed the CARSS algorithm with three distinct configurations, where the number of agents K𝐾Kitalic_K was set to {2,5,10}2510\{2,5,10\}{ 2 , 5 , 10 }, to solve it using a greedy decoding strategy. In the figure, solid black dots represent vertices in the instance, red solid dots indicate the initial vertices chosen by each agent, and hollow black dots depict isolated vertices not selected by any agent, totaling |I|𝐼|I|| italic_I | in number. Different colored solid lines correspond to subpaths of different agents in the final solution, amounting to a total of K𝐾Kitalic_K subpaths, while dashed lines symbolize the K+|I|𝐾𝐼K+|I|italic_K + | italic_I | edges added to connect all agent subpaths and isolated vertices into a cycle during the subpath merging phase. From the illustration, it becomes apparent that the CARSS algorithm adeptly captures the characteristics of optimal solutions for TSP. Solutions exhibit superior quality with fewer agents, and the traveling salesman’s route demonstrates minimal instances of overlap**. However, as the number of agents increases, a potential decline in algorithm performance can be observed. This might arise from the simplicity in the map** of the vertex index Utn,Ksubscriptsuperscript𝑈𝑛𝐾𝑡U^{n,K}_{t}italic_U start_POSTSUPERSCRIPT italic_n , italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, output by the policy network during the subpath generation phase, to the action Atn,ksubscriptsuperscript𝐴𝑛𝑘𝑡A^{n,k}_{t}italic_A start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t. In this process, a choice is made to connect the chosen vertex to the nearest end of the subpath, rather than "inserting" the selected vertex into the current subpath, as seen in algorithms like the farthest insertion method. This discrepancy could lead to a decline in performance. This issue is evident in the solution provided by CARSS (100,10) for the loop in the upper right corner of the route, where the corresponding subpath formed by the blue agent shows such behavior.

Refer to caption
(a) 2 agents
Refer to caption
(b) 5 agents
Refer to caption
(c) 10 agents
Figure 3: Example of CARSS algorithm’s solution on a graph with 100 vertices

5 Conclusion

In this paper, we introduced CARSS algorithm, a groundbreaking approach for solving TSP using cooperative MARL. CARSS strategically decomposes the TSP solving process into subpath generation and subpath merging steps, leveraging the power of cooperative MARL to tackle the challenges posed by large-scale instances. By employing attention mechanisms for feature embedding and parameterization, CARSS enhances the agents’ ability to learn and generate high-quality solutions. The independent REINFORCE algorithm facilitates the training of the CARSS model, contributing to its efficiency and effectiveness.

Our contributions to the field are threefold: firstly, the introduction of CARSS, an innovative algorithmic framework that harnesses cooperative MARL for TSP solving; secondly, the integration of attention mechanisms, which significantly elevate the agents’ learning capabilities; and thirdly, the empirical demonstration of CARSS’s superiority over single-agent alternatives. Through comprehensive experiments, we showed that CARSS outperforms conventional approaches in terms of delivering reduced memory consumption, improved scalability, and notable reductions in testing time and optimization gaps for large-scale TSP instances.

As the field of combinatorial optimization and reinforcement learning continues to evolve, CARSS presents a robust strategy that capitalizes on the synergy of multiple agents and attention mechanisms. While our work demonstrates remarkable advancements in tackling TSP, future research could explore the application of CARSS to other combinatorial optimization problems and delve deeper into optimizing the attention mechanisms to further enhance the agents’ learning efficiency. We anticipate that CARSS will play a pivotal role in advancing the capabilities of MARL in addressing complex real-world optimization challenges.

In conclusion, our study underscores the effectiveness of the CARSS algorithm, shedding light on its potential to revolutionize the way we approach TSP and related problems. By combining the strengths of cooperative MARL, attention mechanisms, and subpath synthesis, CARSS represents a significant stride toward efficient and scalable solutions for the TSP.

Acknowledgments

This research was supported by National Key R&D Program of China (2021YFA1000403), the National Natural Science Foundation of China (Nos. 11991022), the Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDA27000000) and the Fundamental Research Funds for the Central Universities.

References

  • Chvátal et al. [2009] Vašek Chvátal, William Cook, George B. Dantzig, Delbert R. Fulkerson, and Selmer M. Johnson. Solution of a large-scale traveling-salesman problem. In 50 Years of Integer Programming 1958-2008, pages 7–28. Springer Berlin Heidelberg, November 2009. doi:10.1007/978-3-540-68279-0_1. URL https://doi.org/10.1007/978-3-540-68279-0_1.
  • Held and Karp [1962] Michael Held and Richard M. Karp. A dynamic programming approach to sequencing problems. Journal of the Society for Industrial and Applied Mathematics, 10(1):196–210, March 1962. doi:10.1137/0110015. URL https://doi.org/10.1137/0110015.
  • Bellman [1962] Richard Bellman. Dynamic programming treatment of the travelling salesman problem. Journal of the ACM, 9(1):61–63, January 1962. doi:10.1145/321105.321111. URL https://doi.org/10.1145/321105.321111.
  • Rosenkrantz et al. [1974] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis. Approximate algorithms for the traveling salesperson problem. In 15th Annual Symposium on Switching and Automata Theory (swat 1974). IEEE, October 1974. doi:10.1109/swat.1974.4. URL https://doi.org/10.1109/swat.1974.4.
  • Helsgaun [2000] Keld Helsgaun. An effective implementation of the lin–kernighan traveling salesman heuristic. European Journal of Operational Research, 126(1):106–130, October 2000. doi:10.1016/s0377-2217(99)00284-2. URL https://doi.org/10.1016/s0377-2217(99)00284-2.
  • Dorigo and Gambardella [1997] M. Dorigo and L.M. Gambardella. Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, 1(1):53–66, April 1997. doi:10.1109/4235.585892. URL https://doi.org/10.1109/4235.585892.
  • Albrecht and Ramamoorthy [2013] Stefano V. Albrecht and Subramanian Ramamoorthy. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS ’13, page 1155–1156, Richland, SC, 2013. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450319935.
  • Mordatch and Abbeel [2018] Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018. ISBN 978-1-57735-800-8.
  • Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6382–6393, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  • Samvelyan et al. [2019] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, page 2186–2188, Richland, SC, 2019. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450363099.
  • Christianos et al. [2020] Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. Shared experience actor-critic for multi-agent reinforcement learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  • Dhamankar et al. [2020] Gauraang Dhamankar, Jose R. Vazquez-Canteli, and Zoltan Nagy. Benchmarking multi-agent deep reinforcement learning algorithms on a building energy demand coordination task. In Proceedings of the 1st International Workshop on Reinforcement Learning for Energy Management in Buildings & Cities, RLEM’20, page 15–19, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450381932. doi:10.1145/3427773.3427870. URL https://doi.org/10.1145/3427773.3427870.
  • Kurach et al. [2020] Karol Kurach, Anton Raichuk, Piotr Stanczyk, Michal Zajac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, and Sylvain Gelly. Google research football: A novel reinforcement learning environment. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 4501–4510. AAAI Press, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/5878.
  • Bard et al. [2020] Nolan Bard, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, Iain Dunning, Shibl Mourad, Hugo Larochelle, Marc G. Bellemare, and Michael Bowling. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020. ISSN 0004-3702. doi:https://doi.org/10.1016/j.artint.2019.103216. URL https://www.sciencedirect.com/science/article/pii/S0004370219300116.
  • Vinyals et al. [2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015.
  • Bello et al. [2016] Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. ArXiv, abs/1611.09940, 2016.
  • Khalil et al. [2017] Elias Boutros Khalil, Hanjun Dai, Yuyu Zhang, Bistra N. Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In NIPS, 2017.
  • Kool et al. [2018] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! In International Conference on Learning Representations, 2018.
  • Bresson and Laurent [2021] Xavier Bresson and Thomas Laurent. The transformer network for the traveling salesman problem. ArXiv, abs/2103.03012, 2021.
  • Joshi et al. [2020] Chaitanya K. Joshi, Quentin Cappart, Louis-Martin Rousseau, Thomas Laurent, and Xavier Bresson. Learning tsp requires rethinking generalization. ArXiv, abs/2006.07054, 2020.
  • Zhang et al. [2020] Ke Zhang, Fang He, Zhengchao Zhang, Xi Lin, and Meng Li. Multi-vehicle routing problems with soft time windows: A multi-agent reinforcement learning approach. Transportation Research Part C: Emerging Technologies, 121:102861, December 2020. doi:10.1016/j.trc.2020.102861. URL https://doi.org/10.1016/j.trc.2020.102861.
  • Zong et al. [2022] Zefang Zong, Meng Zheng, Yong Li, and Depeng **. Mapdp: Cooperative multi-agent reinforcement learning to solve pickup and delivery problems. In AAAI Conference on Artificial Intelligence, 2022.
  • Vaswani et al. [2017] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017.
  • Sutton et al. [1999] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  • Kwon et al. [2020] Yeong-Dae Kwon, **ho Choo, Byoungjip Kim, Iljoo Yoon, Seungjai Min, and Youngjune Gwon. Pomo: Policy optimization with multiple optima for reinforcement learning. ArXiv, abs/2010.16011, 2020.
  • Williams [1992] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, May 1992. doi:10.1007/bf00992696. URL https://doi.org/10.1007/bf00992696.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.longhoe.net/abs/1412.6980.
  • Rabe and Staats [2021] Markus N. Rabe and Charles Staats. Self-attention does not need o(n2) memory. ArXiv, abs/2112.05682, 2021.