CARSS: Cooperative Attention-guided Reinforcement Subpath Synthesis for Solving Traveling Salesman Problem

Yuchen Shi
Department of Mathematical Sciences
University of Chinese Academy of Sciences
Bei**g 100049, China
[email protected]
&Congying Han
Department of Mathematical Sciences
University of Chinese Academy of Sciences
Bei**g 100049, China
[email protected]
\ANDTiande Guo
Department of Mathematical Sciences
University of Chinese Academy of Sciences
Bei**g 100049, China
[email protected]
Corresponding author

Abstract

This paper introduces CARSS (Cooperative Attention-guided Reinforcement Subpath Synthesis), a novel approach to address the Traveling Salesman Problem (TSP) by leveraging cooperative Multi-Agent Reinforcement Learning (MARL). CARSS decomposes the TSP solving process into two distinct yet synergistic steps: "subpath generation" and "subpath merging." In the former, a cooperative MARL framework is employed to iteratively generate subpaths using multiple agents. In the latter, these subpaths are progressively merged to form a complete cycle. The algorithm’s primary objective is to enhance efficiency in terms of training memory consumption, testing time, and scalability, through the adoption of a multi-agent divide and conquer paradigm. Notably, attention mechanisms play a pivotal role in feature embedding and parameterization strategies within CARSS. The training of the model is facilitated by the independent REINFORCE algorithm. Empirical experiments reveal CARSS’s superiority compared to single-agent alternatives: it demonstrates reduced GPU memory utilization, accommodates training graphs nearly 2.5 times larger, and exhibits the potential for scaling to even more extensive problem sizes. Furthermore, CARSS substantially reduces testing time and optimization gaps by approximately 50% for TSP instances of up to 1000 vertices, when compared to standard decoding methods.

1 Introduction

The Traveling Salesman Problem (TSP) stands as one of the quintessential combinatorial optimization challenges, seeking the shortest route to visit a set of cities and return to the origin. Its NP-hard nature has spurred continuous research into develo** efficient algorithms capable of tackling real-world instances. Traditional methods, such as exact algorithms based on cutting plane method (Chvátal et al., 2009) or dynamic programming (Held and Karp, 1962; Bellman, 1962), and heuristic algorithms based on insertion (Rosenkrantz et al., 1974), local search (Helsgaun, 2000) or population (Dorigo and Gambardella, 1997), often struggle with scalability and optimality for larger problem sizes, prompting the exploration of innovative paradigms that transcend the limitations of single-agent approaches.

In recent times, the field of Multi-Agent Reinforcement Learning (MARL) has gained prominence as a promising avenue for tackling intricate optimization problems. Notable examples include Level-Based Foraging (Albrecht and Ramamoorthy, 2013), Multi-Agent Particle Environment (Mordatch and Abbeel, 2018; Lowe et al., 2017), StarCraft Multi-Agent Challenge (Samvelyan et al., 2019), Multi-Robot Warehouse (Christianos et al., 2020; Dhamankar et al., 2020), Google Research Football (Kurach et al., 2020), and Hanabi (Bard et al., 2020). Through harnessing the collaborative proficiencies of multiple agents, cooperative MARL brings about the potential to enhance the efficiency of problem-solving processes, overcome computational bottlenecks, and advance scalability. Within this context, we introduce a pioneering algorithm—Cooperative Attention-guided Reinforcement Subpath Synthesis (CARSS)—crafted to transform the approach to solving the Traveling Salesperson Problem (TSP).

CARSS adopts a distinctive two-step strategy to decompose the TSP solving process. The first step, termed "subpath generation", harnesses the power of cooperative MARL to iteratively generate subpaths. Each agent contributes to constructing a subpath, collectively working towards achieving an optimal solution. The second step, "subpath merging," involves the incremental fusion of these subpaths to ultimately form a complete cycle that represents the solution to the TSP. This decomposition not only capitalizes on the strengths of MARL but also strategically divides the problem to mitigate the computational and memory burdens associated with large-scale instances.

A notable feature of CARSS lies in its incorporation of attention mechanisms, which serve a dual role in both feature embedding and parameterization strategies. These mechanisms enhance the agents’ ability to capture relevant information and learn effectively from their interactions with the environment. The training of the CARSS model is facilitated by the independent REINFORCE algorithm, a proven reinforcement learning technique.

Our contributions are threefold:

•

A novel algorithm, CARSS, is introduced for solving the TSP by leveraging cooperative MARL and attention mechanisms. The algorithm decomposes the problem into "subpath generation" and "subpath merging" steps, addressing memory consumption and scalability challenges.
•

The proposed approach demonstrates substantial improvements in terms of memory efficiency and testing times when compared to conventional single-agent algorithms. CARSS extends the capability to train on larger problem instances while maintaining solution quality.
•

Empirical results show that the CARSS algorithm reduces testing times and optimization gaps by approximately 50% for TSP instances of up to 1000 vertices, underscoring its potential to significantly enhance the efficiency of TSP-solving techniques.

2 Related Works

A considerable portion of the research in the realm of solving the TSP through supervised and reinforcement learning has been rooted in constructive modeling methodologies (Vinyals et al., 2015; Bello et al., 2016; Khalil et al., 2017; Kool et al., 2018; Bresson and Laurent, 2021). These approaches involve the stepwise selection of individual points, akin to methods driven by a singular agent. However, it is noteworthy that these methodologies tend to exhibit elevated time and space complexities when confronted with the task of addressing expansive problem scales. As a testament to this, numerous algorithms demonstrate their efficacy solely on problems of modest proportions, typically up to a size of 200, utilizing a prescribed quantum of GPU resources. For instance, Joshi et al. (2020) expound upon the challenges by affirming that "Training on large TSP200 from scratch is intractable and sample inefficient." This intrinsic computational burden consequently restricts their performance when applied to more substantial problem instances. Nevertheless, in contrast to these conventional paradigms, the CARSS algorithm introduces a pioneering approach that strategically decomposes the TSP-solving process into two distinct stages: subpath generation and subpath merging. By leveraging the principles of MARL, CARSS endeavors to surmount the limitations of memory consumption during training, mitigate testing duration, and amplify its scalability.

In the realm of TSP variations, Zhang et al. (2020) introduced a MARL-oriented framework addressing the vehicle routing problem encompassing soft time windows for a multi-vehicle scenario. This approach hinged upon predefined regulations, dictating a rotational decision-making process among vehicles. Notably, all vehicles shared a singular policy network, inadvertently rendering the framework functionally akin to single-agent control. Building upon this premise, Zong et al. (2022) advanced the paradigm by fashioning independent policy networks, eschewing the necessity for predetermined coordination rules in scenarios involving vehicle interaction. This liberation substantially expanded the exploration capacity within the collective of vehicle agents, efficiently tackling the intricacies posed by pickup and delivery problems. Extending this innovation, the CARSS algorithm extrapolates the concept into the domain of TSP, orchestrating a divide-and-conquer methodology tailored to surmount the challenges of larger-scale problem instances.

3 Method

In this section, we present the methodology of CARSS. The subsections that follow outline the cooperative Markov game formulation, algorithm specifics, policy parameterization, policy optimization, and complexity analysis.

3.1 Cooperative Markov Game Formulation for Traveling Salesman Problem

Define TSP as the tuple $(\mathcal{I},f,\mu)$ , where $\mathcal{I}$ represents the set of graph instances, $f(G)$ denotes the set of all feasible solutions for the graph $G=(V(G),E(G),w(G))$ within the context of $\mathcal{I}$ . Here, each graph $G$ comprises $v(G)$ vertices and $e(G)$ edges. The function $\mu(G,H)$ quantifies the value of solution $H$ within the set $f(G)$ concerning the problem’s objective. In the context of TSP, $\mu(G,H)$ equates to $\sum_{e\in H}w_{e}$ , where $w_{e}$ signifies the weight of edge $e$ . The ultimate objective of the problem is to determine the solution $H$ that minimizes this objective value across all instances $G\in\mathcal{I}$ , formally expressed as $\operatorname{argmin}_{H\in f(G)}\mu(G,H)$ .

For a multi-agent system involving $K\leq\frac{v(G)}{2}$ agents, we can establish the corresponding cooperative Markov game $(K,\mathcal{S},{\mathcal{A}^{k}}_{k\in{1,\ldots,K}},P,r)$ as follows:

•

The state space, denoted as $\mathcal{S}=\{s\mid s\subseteq H,H\in f(G),G\in\mathcal{I}\}$ , encompasses all possible states. The initial state, $s_{0}$ , is represented by the null graph $K_{0}$ , while the state space at the final time step $T$ is $\mathcal{T}=\{H\mid H\in f(G),G\in\mathcal{I}\}$ . Each agent shares an identical state at every time step and enjoys full access to all environmental observations.
•

The action space for agent $k$ , noted as $\mathcal{A}^{k}=V(G)\cup E(G)$ , includes all vertices and edges of graph $G$ . Furthermore, $\mathcal{A}_{s}^{k}=(V(G)\setminus V(s))\cup(E(G)\setminus E(s))$ characterizes the set of feasible actions for agent $k$ within state $s$ .

•

The state transition probability function, $P:\mathcal{S}\times\mathcal{A}^{1}\times\cdots\times\mathcal{A}^{K}\rightarrow% \Delta\mathcal{S}$ , is defined as follows:

P(s^{\prime}\mid s,a^{1},\ldots,a^{K})=\begin{cases}1&\text{if}\ s\in\mathcal{% T}\text{ and }s^{\prime}=s,\\ \\ 1&\text{if}\ s\in\mathcal{S}\setminus\mathcal{T}\text{ and }s^{\prime}=s+\sum_% {k=1}^{K}a^{k},\\ \ 0&\text{otherwise}.\end{cases}

where $\Delta\mathcal{S}$ represents the probabilistic simplex in the state space $\mathcal{S}$ , and $s+a$ signifies the disjoint union of graph $s$ and graph $a$ .

•

The reward function $r:\mathcal{S}\times\mathcal{A}^{1}\times\cdots\times\mathcal{A}^{K}\times% \mathcal{S}\rightarrow\delta S$ is defined by $r(s,a^{1},\ldots,a^{K},s^{\prime})=-\mu(G,s^{\prime})$ . It evaluates to $0$ if $s$ does not belong to $\mathcal{T}$ but $s^{\prime}$ does; otherwise, it is $0$ .

Solving TSP involves the acquisition of a strategy denoted as $\pi_{\boldsymbol{\theta}}:\mathcal{S}\rightarrow\Delta(\mathcal{A}^{1}\times% \cdots\times\mathcal{A}^{K})$ , which is crafted to optimize the expected partial return $J(\boldsymbol{\theta})=\mathbb{E}_{\pi{\boldsymbol{\theta}}}[R_{T}]$ .

In the scenario where the number of agents, denoted as $K=1$ , and the approach employed is deep reinforcement learning for solving TSP, the model adheres to the classical reinforcement learning methodology for addressing the TSP (Kool et al., 2018; Bresson and Laurent, 2021). However, this conventional approach exhibits several limitations:

•

As a TSP route comprises a composition of $v(G)$ edges, invoking the policy network a minimum of $v(G)$ times diminishes the potential benefits of parallel computation within sequential models. Additionally, the substantial computational overhead hampers the training of the model for larger-scale problems.
•

Policy networks conventionally rely on the computation of the attention matrix, entailing a time and space complexity of $O(v(G)^{2})$ (Vaswani et al., 2017). This propensity for excessive memory utilization renders training infeasible for more extensive problem instances.
•

The actions generated through policy network sampling do not invariably constitute feasible solutions. Consequently, decoding necessitates the application of a mask to regulate the selection of visited vertices. However, as the termination point approaches, the number of visited vertices increases, resulting in a diminished space of viable actions. Consequently, the efficiency of attention matrix computation diminishes, leading to inefficient resource utilization.

On the other hand, in scenarios where the number of agents $K>1$ , the policy network is invoked a minimum of $v(G)/K$ times. By predetermining the feasible actions for each agent, the number of viable actions per agent is averaged to $v(G)/K$ , thereby reducing the time and space complexity of attention matrix computation to $O(\frac{v(G)^{2}}{K^{2}})$ . This approach also indirectly enhances the efficiency of computational resource utilization, alleviating some of the constraints faced in the single-agent setting.

3.2 CARSS Algorithm

To address the limitations inherent in solving TSP with a single agent, we propose the CARSS algorithm. Designed for tackling the TSP within Euclidean space, CARSS effectively mitigates these limitations by strategically reducing the action and state space of the underlying Markov game, thereby approximating its optimal strategy.

The CARSS algorithm is structured around two pivotal phases: subpath generation and subpath merging. Within this context, "subpaths" represent non-circular graphs that form integral parts of the problem’s final solution tour.

During the subpath generation phase, CARSS initialization features an empty graph, denoted as $K_{0}$ . Each agent independently selects a vertex to ensure non-overlap** choices. Subsequently, during each time step, every agent gradually extends an edge to their selected vertex. This synchronized edge addition results in the simultaneous incorporation of $K$ edges. This process continues until several subpaths of uniform lengths, devoid of intersections, are established. Here, "intersecting" indicates the absence of any intersection between the vertex sets of two subpaths. It is noteworthy that this phase constitutes the majority of the algorithm’s runtime due to its computationally intensive nature.

The subsequent subpath mergings phase can be analogously conceived as a single-agent approach. Within this phase, the algorithm connects $K$ subpaths and, at most, $K$ isolated points. This gradual connection is achieved by adding up to $2K$ edges, ultimately culminating in a complete cycle. This phase is crucial for addressing a specific TSP instance of a size not exceeding $4K$ . The computation time associated with this phase is nearly negligible due to the relatively diminutive size of the subproblem.

Subsequently, we provide the temporal range encompassing time step $t\in\{1,\ldots,T\}$ within the two phases of subpath generation and subpath merging. Furthermore, we expound upon the precise structure of the space of feasible actions $\mathcal{A}_{s}^{k}$ within state $s$ during these two phases.

3.2.1 Subpath Generation

In the subpath generation phase, we consider time steps denoted by $t\in\{1,\ldots,T^{\prime}\}$ , where

T^{\prime}=\begin{cases}\frac{v(G)}{K}-2&\text{if}\ K\text{ evenly divides }v(% G),\\ \\ \ \left\lfloor\frac{v(G)}{K}\right\rfloor-1&\text{otherwise}.\end{cases}

The rationale for treating the case of $K$ evenly dividing $v(G)$ separately arises from the following consideration: when the algorithm advances to the $(v(G)/K-2)$ th time step, a total of $K$ paths exist within the current state, each with a length of $v(G)/K-2$ . At this point, the number of visited vertices is $v(G)-K$ , and the number of isolated points is $K$ . Consequently, by considering isolated points as subpaths with a length of 0, the total count of subpaths to be connected amounts to $2K$ . When each agent carries out an additional action, the count of isolated points decreases to $0$ , resulting in a reduction of the number of subpaths to $K$ . It is evident that a scarcity of isolated points will lead to fewer optional vertices in later stages of the subpath generation phase, thereby reducing training efficiency. Hence, a balance is struck between the number of time steps in the subpath generation phase and the problem size in the subpath merging phase. This strategic compromise enhances overall algorithm performance by marginally decreasing the number of time steps in the subpath generation phase while augmenting the problem’s complexity during the subpath merging phase.

At the initial time step $t=1$ , the feasible action space for state $K_{0}$ is defined as $\mathcal{A}_{K_{0}}^{k}=V(G)$ , where each initial action corresponds to the overlap** initial endpoints of a subpath. For each agent $k$ , we respectively denote these front and rear endpoints at its current state as $\operatorname{f}^{k}(s)$ and $\operatorname{r}^{k}(s)$ . For subsequent time steps, when $s\neq K_{0}$ , i.e., when $t\in\{2,\ldots,T^{\prime}\}$ , $\mathcal{A}_{s}^{k}$ is determined by addressing the following specialized assignment problem:

$\displaystyle\operatorname{minimize}$	$\displaystyle\sum_{i=1}^{v(G)}\sum_{k=1}^{K}x_{i,k}\min\{w_{i\operatorname{f}^% {k}(s)},w_{i\operatorname{r}^{k}(s)}\}$
subject to	$\displaystyle\sum_{k=1}^{K}x_{ik}=1,$	$\displaystyle\quad i=1,\ldots,v(G)$
	$\displaystyle\sum_{n=1}^{v(G)}\left(x_{i\operatorname{f}^{k}(s)}+x_{i% \operatorname{r}^{k}(s)}\right)\geq 1,$	$\displaystyle\quad k=1,\ldots,K$
	$\displaystyle x_{ik}\in\{0,1\},$	$\displaystyle\quad i=1,\ldots,v(G),\ k=1,\ldots,K$

Here, $x_{ik}$ signifies whether the $i$ th vertex is assigned to the $k$ th agent. $\min\{w_{i\operatorname{f}^{k}(s)},w_{i\operatorname{r}^{k}(s)}\}$ represents the minimum distance from the $i$ th vertex to either the first or last endpoint of the path corresponding to the $k$ th agent. The objective function aims to minimize the total sum of distances from each vertex to its closest endpoint within the assigned agent’s path. The first constraint mandates that each vertex must be assigned to exactly one agent. The second constraint ensures that each agent is assigned at least one vertex, thereby guaranteeing that the length of subpaths generated by each agent consistently increases over time. It’s worth noting that the vertex assignment obtained from solving this problem might not necessarily lead to the optimal solution of the original problem. However, it can significantly reduce the action space for the agents, resulting in a substantial acceleration of subpath generation.

A heuristic is designed to efficiently solve the assignment problem. It involves iterating over each agent and having them select the nearest unassigned vertex to fulfill the first constraint. Subsequently, each unassigned vertex is assigned to its nearest agent. The "distances" between vertices $i$ and agents $k$ are defined with respect to the metric $\min_{i}\{w_{i\operatorname{f}^{k}(s)},w_{i\operatorname{r}^{k}(s)}\}$ . The complete algorithm for solving this assignment problem is presented in Algorithm 1.

Having obtained an approximate solution to the problem, the feasible action space for each agent $k$ in state $s$ is characterized as $\mathcal{A}_{s}^{k}=\{(i,j)\in E\mid x_{ik}=1,i\notin V(s),j=\operatorname{% argmin}_{j\in\{\operatorname{f}^{k}(s),\operatorname{r}^{k}(s)\}}w_{ij}\}$ , which denotes an edge in $E$ with unvisited vertices at one end and front and rear vertices of the path corresponding to agent $k$ at the other end. Notably, these sets do not overlap with each other, i.e., $\{i\mid(i,j)\in\mathcal{A}_{s}^{k_{1}}\}\cap\{i\mid(i,j)\in\mathcal{A}_{s}^{k_% {2}}\}=\varnothing,\forall k_{1},k_{2}\in\{1,\ldots,K\},k_{1}\neq k_{2}$ . This ensures that the states at each time step consist of $K$ disjoint paths.

To enhance the model’s viability in addressing large-scale problems, the feasible action space $\mathcal{A}_{s}^{k}$ for each agent in this phase is restricted to a maximum of $v(G)/K$ actions, encompassing those closest to the respective agent.

Algorithm 1 Vertex-agent assignment heurisitic algorithm

1:Vertex set

V

, Number of agents

K

, Current state

s

2:Initialize decision variables

x_{ik}\leftarrow 0,\ i=1,\ldots,v(G),\ k=1,\ldots,K

3:Initialize unassigned vertex list

U\leftarrow V

4:for

k=1,\ldots,K

i\leftarrow\operatorname{argmin}\{\min\{x_{i\operatorname{f}^{k}(s)},x_{i% \operatorname{r}^{k}(s)}\}\mid i\in U\}

x_{ik}\leftarrow 1

U\leftarrow U\setminus\{i\}

8:end for

9:while

U\neq\varnothing

10:

i\leftarrow U[0]

11:

k\leftarrow\operatorname{argmin}\{\min\{x_{i\operatorname{f}^{k}(s)},x_{i% \operatorname{r}^{k}(s)}\}\mid k\in\{1,\ldots,K\}\}

12:

x_{ik}\leftarrow 1

13:

U\leftarrow U\setminus\{i\}

14:end while

15:return

\{x_{ij}\}_{i=1,\ldots,v(G),\ k=1,\ldots,K}

3.2.2 Subpath Merging

The subpath merging stage can be conceptualized as addressing a specific variant of TSP using single-agent reinforcement learning. Upon completing the subpath generation, the state $S_{T^{\prime}}$ encompasses two distinct components: firstly, $K$ disjoint paths each of length $T^{\prime}$ , and secondly, isolated vertices $\{I_{i}\}_{i\in 1,\ldots,v(G)-K(T^{\prime}+1)}$ , which can also be envisaged as $|I|$ paths of length $0$ , creating a total of $K+|I|$ paths. To connect these paths into a complete tour, $K+|I|$ extra edges need to be incorporated. This gives rise to a graph $G^{\prime}$ of size $2(K+|I|)$ constructed as follows:

	$\displaystyle V(G^{\prime})$	$\displaystyle=\{\operatorname{f}^{1}(S_{T^{\prime}}),\ldots,\operatorname{f}^{% K}(S_{T^{\prime}}),I^{\operatorname{f}}_{1},\ldots,I^{\operatorname{f}}_{\|I\|},% \operatorname{r}^{1}(S_{T^{\prime}}),\ldots,\operatorname{r}^{K}(S_{T^{\prime}% }),I^{\operatorname{r}}_{1},\ldots,I^{\operatorname{r}}_{\|I\|}\},$
	$\displaystyle E(G^{\prime})$	$\displaystyle=\{(i,j)\mid i,j\in V(G^{\prime}),i\neq j\}\setminus\{(f^{1}(S_{T% ^{\prime}}),r^{1}(S_{T^{\prime}})),\ldots,(f^{K}(S_{T^{\prime}}),r^{K}(S_{T^{% \prime}})),\ldots,(I_{1}^{f},I_{1}^{r}),\ldots,(I_{\|I\|}^{f},I_{\|I\|}^{r})\}$
	$\displaystyle w(G^{\prime})$	$\displaystyle=w(G).$

Here, $I_{i}^{\operatorname{f}}=I_{i}^{\operatorname{r}}=I_{i}$ for $i\in\{1,\ldots,v(G)-K(T^{\prime}+1)\}$ . $V(G^{\prime})$ consists of the front and rear vertices of each path in state $S_{T^{\prime}}$ . $E(G^{\prime})$ comprises all potential edges that may be required to amalgamate subpaths into cycles. The edge weights correspond to those in the original graph $G$ . Our objective is to ascertain an algorithm that is both highly effective and efficient in order to identify $K+|I|$ edges within the graph $G^{\prime}$ , forming a tour encompassing edges $\{(i,j)\mid|i-j|=K+|I|\}$ . This involves selecting an initial vertex and subsequently adding $K+|I|$ edges, thereby rendering $T=T^{\prime}+K+|I|$ . Given that reinforcement learning enables rapid approximate solutions in batch compared to exact algorithms, we employ single-agent reinforcement learning to address this sub-problem. The superscript $1$ indicating the single agent is omitted below for simplicity. The initial vertex $q_{T^{\prime}}$ can be selected at random from the set of all vertices in $V(G^{\prime})$ . For time steps $t\in\{T^{\prime}+1,\ldots,T^{\prime}+K+|I|-1\}$ , we begin by determining the other end of the current subpath from vertex $q_{t-1}$ , denoted as $p_{t}$ . Subsequently, we derive the set of feasible actions $\mathcal{A}_{s}\leftarrow\{(p_{t},q_{t})\mid(p_{t},q_{t})\in E(G^{\prime}),j% \notin\{q_{T^{\prime}},p_{T^{\prime}+1},q_{T^{\prime}+1},\ldots,p_{t-1},q_{t-1% }\}\}$ and the action chosen at that particular time step is denoted as $(p_{t},q_{t})$ . In the final time step $t=T^{\prime}+K+|I|$ , a terminal edge must be selected to seamlessly integrate the path into a complete cycle.

The comprehensive CARSS algorithm is depicted in Algorithm 2, while an illustrative example is showcased in Figure 1.

Algorithm 2 Cooperative Attention-guided Reinforcement Subpath Synthesis (CARSS) Algorithm

1:Graph

G=(V,E,w)

, number of agents

K

, starting vertices of agents

v_{1},\ldots,v_{K}\in V

s\leftarrow G[\{v_{1},\ldots,v_{K}\}]

3:for

k=1,\ldots,K

\operatorname{f}^{k}(s)\leftarrow v_{k}

\operatorname{r}^{k}(s)\leftarrow v_{k}

6:end for

T^{\prime}\leftarrow\left\lfloor\frac{v(G)}{K}\right\rfloor-(1\text{ if }K% \text{ divides }v(G)\text{ else }0)

8:for

t=1,\ldots,T^{\prime}

9: Apply Algorithm 1 with

(V,K,s)

to obtain

\{x_{ij}\}_{i=1,\ldots,v(G),\ k=1,\ldots,K}

10:

\mathcal{A}_{s}^{k}\leftarrow\{(i,j)\in E\mid x_{ik}=1,i\notin V(s),j=% \operatorname{argmin}_{j\in\{\operatorname{f}^{k}(s),\operatorname{r}^{k}(s)\}% }w_{ij}\},k=1,\ldots,K

11: Apply parameterized policy to obtain

A^{k}_{t}\in\mathcal{A}_{s}^{k},k=1,\ldots,K

12:

s\leftarrow s+\sum_{k=1}^{K}A^{k}_{t}

13: Update

\operatorname{f}^{k}(s)

and

\operatorname{r}^{k}(s)

14:end for

15:

I\leftarrow V(G)\setminus V(s)

\triangleright

|I|=v(G)-K(T^{\prime}+1)

16:

\{I^{f}_{1},\ldots,I^{f}_{|I|}\}\leftarrow I

17:

\{I^{r}_{1},\ldots,I^{r}_{|I|}\}\leftarrow I

18:

V(G^{\prime})\leftarrow\{\operatorname{f}^{1}(S_{T^{\prime}}),\ldots,% \operatorname{f}^{K}(S_{T^{\prime}}),I^{\operatorname{f}}_{1},\ldots,I^{% \operatorname{f}}_{|I|},\operatorname{r}^{1}(S_{T^{\prime}}),\ldots,% \operatorname{r}^{K}(S_{T^{\prime}}),I^{\operatorname{r}}_{1},\ldots,I^{% \operatorname{r}}_{|I|}\}

19:

E(G^{\prime})\leftarrow\{(i,j)\mid i,j\in V(G^{\prime}),i\neq j\}\setminus\{(f% ^{1}(S_{T^{\prime}}),r^{1}(S_{T^{\prime}})),\ldots,(f^{K}(S_{T^{\prime}}),r^{K% }(S_{T^{\prime}})),\ldots,(I_{1}^{f},I_{1}^{r}),\ldots,(I_{|I|}^{f},I_{|I|}^{r% })\}

20:

w(G^{\prime})\leftarrow w(G)

21:

q_{T^{\prime}}\leftarrow\operatorname{RandomSelect}(V(G^{\prime}))

22:for

t=T^{\prime}+1,\ldots,T^{\prime}+K+|I|-1

23:

p_{t}\leftarrow

the other end of the subpath of vertex

q_{t-1}

24:

\mathcal{A}_{s}\leftarrow\{(p_{t},q_{t})\mid(p_{t},q_{t})\in E(G^{\prime}),j% \notin\{q_{T^{\prime}},p_{T^{\prime}+1},q_{T^{\prime}+1},\ldots,p_{t-1},q_{t-1% }\}\}

25: Apply parameterized policy to obtain

(p_{t},q_{t})=A_{t}\in\mathcal{A}_{s}

26:

s\leftarrow s+A_{t}

27:end for

28:

p_{T^{\prime}+K+|I|}\leftarrow

the other end of the subpath of vertex

q_{T^{\prime}+K+|I|-1}

29:

A_{T^{\prime}+K+|I|}\leftarrow(p_{T^{\prime}+K+|I|},q_{T^{\prime}})

30:

s\leftarrow s+A_{T^{\prime}+K+|I|}

31:return

s

Refer to caption — Figure 1: Illustration of CARSS Algorithm for TSP Solving. In this instance, there are 10 vertices, with the number of agents set to $K=3$ , termination time of subpath generation set to $T^{\prime}=2$ , and the number of isolated vertices denoted as $|I|=1$ . The solid gray lines and arrowed lines represent the action space and the selected actions, respectively. This example entails solving assignment subproblems during two rounds of subroute generation. Within each assignment, dashed lines illustrate the assignment relationships between agent subpaths and unconnected vertices, with corresponding labels indicating the order of assignments.

3.3 Policy Parameterization

3.3.1 Subpath Generation Parameterization

The input data to the subpath generation stage within the CARSS algorithm comprises the 2D Euclidean spatial coordinates of the graph’s vertex set $V(G)$ , represented as $X\in\mathbb{R}^{v(G)\times 2}$ . The action index $U^{k}_{t}$ is determined by the probability distribution of feasible actions for agent $k$ at time step $t$ within state $S_{t}$ , denoted as $\pi^{d}_{\boldsymbol{\theta}}(U^{1}_{t},\ldots,U^{K}_{t}\mid G,S_{t})$ . This index is selected from the set $\{1,\ldots,v(G)/K\}$ , ultimately defining the action $A^{k}_{t}$ .

To efficiently capture information about neighboring vertices for each vertex, we employ a map** from the 2D vertex coordinates $X\in\mathbb{R}^{v(G)\times 2}$ to higher-dimensional vertex features $h^{v}_{i},i=1,\ldots,v(G)$ :

	$\displaystyle H^{v}_{l}$	$\displaystyle=\operatorname{FFN}\left(\operatorname{MHA}\left(H^{v}_{l-1},H^{v% }_{l-1},H^{v}_{l-1},H^{v}_{l-1},J_{v(G)}\right)\right)\in\mathbb{R}^{v(G)% \times d_{v}},$
	$\displaystyle H^{v}_{L^{\text{enc}_{v}}}$	$\displaystyle=[h^{v}_{1},\cdots,h^{v}_{v(G)}]^{T}.$

Here, $H^{v}_{0}=XW^{x}+\boldsymbol{1}_{v(G)}b^{x}$ , where $W^{x}\in\mathbb{R}^{2\times d_{v}}$ and $b^{x}\in\mathbb{R}^{1\times d_{v}}$ are trainable parameters. The parameter $d_{v}$ indicates the dimension of vertex features, $l\in\{1,\ldots L^{\text{enc}_{v}}\}$ represents the layer index of the vertex encoder within the subpath generation stage, and $J_{v(G)}$ denotes the $v(G)$ -order square matrix filled with $1$ entries.

Drawing inspiration from Zong Zefang et al.’s MAPDP paper (Zong et al., 2022), we adopt a novel approach by concatenating the feature vectors of both the front and rear vertices across all agents. This strategy facilitates the sharing of positional information among agents and yields the global information feature vector for agent $k$ :

\operatorname{comm}^{k}_{t}=\operatorname{FFN}([h^{v}_{\operatorname{f}^{k}(S_% {t})},h^{v}_{\operatorname{r}^{k}(S_{t})}])\in\mathbb{R}^{1\times d_{v}}.

Subsequently, it is concatenated with the feature vectors of the agent’s starting and ending vertices (comprising the comprehensive feature representation of the visited vertices), thus forming the feature vector for agent $k$ :

\operatorname{context}^{k}_{t}=\operatorname{FFN}([h^{v}_{\operatorname{f}^{k}% (S_{t})},h^{v}_{\operatorname{r}^{k}(S_{t})},\frac{1}{|v(G)-v(S_{t})|}\sum_{i% \in V(G)\setminus V(S_{t})}h^{v}_{i}),\operatorname{comm}^{k}_{t}])\in\mathbb{% R}^{1\times d_{v}}.

To facilitate cooperative interactions, a multi-head attention mechanism is then employed on the feature vectors of the aforementioned $K$ agents. This mechanism enhances the exchange of vital information among the agents:

	$\displaystyle H^{a}_{t,0}$	$\displaystyle=[\operatorname{context}^{1}_{t};\ldots;\operatorname{context}^{K% }_{t}]\in\mathbb{R}^{K\times d_{v}},$
	$\displaystyle H^{a}_{t,l}$	$\displaystyle=\operatorname{FFN}\left(\operatorname{MHA}\left(H^{a}_{t,l-1},H^% {a}_{t,l-1},H^{a}_{t,l-1},H^{a}_{t,l-1},H^{a}_{t,l-1},J_{K}\right)\right)\in% \mathbb{R}^{K\times d_{v}}.$

Here, $l\in\{1,\ldots L^{\text{enc}_{a}}\}$ denotes the layer index of the agent encoder, and $J_{K}$ represents a $K$ -order square matrix composed entirely of $1$ entries.

Inspired by the utilization of the original Transformer model by Bresson and Laurent (2021) for addressing the TSP, we have introduced a novel concept of memory vectors with increasing lengths over time. These vectors enhance the model’s capacity to incorporate historical information progressively:

\operatorname{memory}_{t}=[\operatorname{memory}_{t-1};h^{v}_{i:(i,j)=A_{t-1}}% ]\in\mathbb{R}^{K\times t\times d_{v}}.

Subsequently, these feature vectors and memory vectors of the agents are harnessed as inputs to the multi-head attention mechanism. This integration enables the model to adeptly capture the characteristics of partial solutions:

	$\displaystyle h^{d}_{t,0}$	$\displaystyle=\operatorname{Reshape}\left(H^{a}_{t,\text{enc}_{a}},(K,1,d_{v})% \right),$
	$\displaystyle h^{d}_{t,l}$	$\displaystyle=\operatorname{FFN}(\operatorname{MHA}(h^{d}_{t,l-1},% \operatorname{memory}_{t},\operatorname{memory}_{t}))\in\mathbb{R}^{K\times 1% \times d_{v}}.$

Here, the operation $\operatorname{Reshape}$ alters the dimensions of the tensor $H^{a}_{t,\text{enc}_{a}}$ from $\mathbb{R}^{K\times d_{v}}$ to $\mathbb{R}^{K\times 1\times d_{v}}$ , preserving its elements while aligning its dimensions with those of the memory vector $\operatorname{memory}_{t}$ at time step $t$ . $l\in\{1,\ldots L^{\text{dec}_{g}}\}$ designates the decoder layer index.

Subsequently, we construct feature vectors for each assigned vertex of every agent along with their corresponding masks. It should be noted that in Subsection 3.2, the solution to the assignment problem $x_{ik}$ has been approximated using a heuristic algorithm. Each non-zero element of $x_{ik}$ represents the vertex $i$ being assigned to the agent $k$ . However, the number of vertices assigned to each agent, denoted as $\alpha_{k}=|\{i\mid x_{ik}=1\}|$ , varies. To enable an efficient computation of the policy, we propose selecting feature vectors of $v(G)/K$ vertices that are assigned to each agent based on their proximity. If the feasible actions are fewer than $\frac{v(G)}{K}$ , we supplement the feature vectors with those of arbitrarily chosen infeasible vertices. This process yields the feature vector $\operatorname{assign}_{t}$ , which represents the assignments. Subsequently, we utilize the mask $M_{t}^{\operatorname{assign}}$ to prevent the model from generating these additionally introduced actions. This can be formally expressed as follows:

	$\displaystyle\operatorname{assign}^{k}_{t}$	$\displaystyle=\begin{cases}\left[[h^{v}_{i}]_{i:x_{ik}=1};[h^{v}_{i}]_{i:x_{ik% }=0}[:\frac{v(G)}{K}-\alpha_{k}]\right]&\text{if }\alpha_{k}<\frac{v(G)}{K},\\ [h^{v}_{i}]_{i:x_{ik}=1}[:\frac{v(G)}{K}]&\text{otherwise}.\end{cases}\in% \mathbb{R}^{\frac{v(G)}{K}\times d_{v}},$
	$\displaystyle\operatorname{assign}_{t}$	$\displaystyle=[\operatorname{assign}^{1}_{t};\ldots;\operatorname{assign}^{K}_% {t}]\in\mathbb{R}^{K\times\frac{v(G)}{K}\times d_{v}},$
	$\displaystyle M_{t}^{\operatorname{assign}}$	$\displaystyle=\{M_{tij}^{\operatorname{assign}}\}_{i=1,\ldots,\frac{v(G)}{K},k% =1,\ldots,K}=\begin{cases}1&\text{if }x_{\operatorname{map}(i)k}=1,\\ 0&\text{otherwise}.\end{cases}\in\mathbb{R}^{K\times 1\times\frac{v(G)}{K}}.$

Here, the notation $[:i]$ signifies selecting the first $i$ rows of the matrix, and the function $\operatorname{map}(\cdot)$ serves to associate the mask index back to the vertex index of the initial instance $G$ during the subpath generation stage.

Finally, we proceed to identify the index of the selected feasible action at time step $t$ for each agent $k$ within the assignment list, denoted as $U^{k}_{t}\in\{1,\ldots,\frac{v(G)}{K}\}$ . Subsequently, we map $U^{k}_{t}$ back to the original graph’s vertex index and determine the adjacent edge closer to the front and rear of that agent. This mapped vertex and edge combination serves as the action $A^{k}_{t}$ at time step $t$ , with $t$ ranging from $1$ to $T^{\prime}$ . The probability distribution of $U^{1}_{t},\ldots,U^{K}_{t}$ is formulated using the input features of the agents, namely $h^{d}_{t,L^{\text{dec}_{g}}}$ , the vertex feature vectors for assignment denoted as $\operatorname{assign}_{t}$ , and the mask $M^{\operatorname{assign}}_{t}$ . This distribution is computed according to the following equation:

\pi^{d}_{\boldsymbol{\theta}}(U^{1}_{t},\ldots,U^{K}_{t}\mid G,S_{t})=% \operatorname{softmax}\Bigg{(}C\operatorname{tanh}\Big{(}M^{\operatorname{% assign}}_{t}\odot(M^{\operatorname{tanh}})\Big{)}(h^{d}_{t,L^{\text{dec}_{g}}}% W^{d}_{1}+b^{d}_{1})(\operatorname{assign}_{t}W^{d}_{2}+\boldsymbol{1}_{K}b^{d% }_{2})^{T}/\sqrt{d}\Big{)}\Bigg{)}.

Here, $C=10$ serves as a crop** threshold. The parameters $W^{d}_{1},W^{d}_{2}\in\mathbb{R}^{d_{v}\times d_{v}}$ and $b^{d}_{1},b^{d}_{2}\in\mathbb{R}^{1\times d_{v}}$ are trainable parameters.

Once the index $U^{k}_{t}$ for agent $k$ within the assignment vector at time step $t$ is determined, the corresponding action can be computed as:

A^{k}_{t}=\left(\operatorname{map}(U^{k}_{t}),j=\operatorname{argmin}_{j\in\{% \operatorname{f}^{k}(S_{t}),\operatorname{r}^{k}(S_{t})\}}w_{\operatorname{map% }(U^{k}_{t})j}\right).

Here, the operation $\operatorname{map}(\cdot)$ serves to establish a correspondence between the indices within the mask and the vertices of instance graph $G$ . This process results in connecting that vertex to the nearest endpoint along the subpath represented by agent $k$ .

3.3.2 Subpath Merging Parameterization

The input to the subpath merging phase in the CARSS algorithm is represented by the graph denoted as $G^{\prime}$ . This graph is utilized in conjunction with the probability distribution of feasible actions at time $t$ under the state $S_{t}$ , denoted as $\pi^{c}_{\boldsymbol{\theta}}(U_{t}\mid G^{\prime},S_{t})$ . The selection of the vertex index $U_{t}$ from the graph $G^{\prime}$ is a crucial step in determining the eventual action $A_{t}$ , with $U_{t}$ taking values from the set $\{1,\ldots,2(K+|I|)\}$ .

It’s important to note that the input graph size for this phase is $2(K+|I|)$ , where the vertices comprise the front and rear endpoints of the paths corresponding to each agent from the previous phase, and the total number of edges to be added is $K+|I|$ . In order to facilitate the efficient extraction of information from the opposite end of the road as well as from other vertices within the neighborhood, a two-step process is employed. Initially, the two-dimensional vertex coordinates of the front (rear) vertex are concatenated with the two-dimensional coordinates of the rear (front) vertex. Subsequently, this amalgamated information is projected into a higher-dimensional vertex feature space denoted as ${h^{\prime}}^{v}_{i},i=1,\ldots,v(G^{\prime})$ . This transformation can be expressed as follows:

	$\displaystyle X^{\prime}$	$\displaystyle=[[X_{\operatorname{f}^{1}(S_{T^{\prime}})},X_{\operatorname{r}^{% 1}(S_{T^{\prime}})}];\ldots;[X_{\operatorname{f}^{K}(S_{T^{\prime}})},X_{% \operatorname{r}^{K}(S_{T^{\prime}})}];[X_{I_{1}},X_{I_{1}}];\ldots;[X_{I_{v(G% )-KT^{\prime}}},X_{I_{v(G)-KT^{\prime}}}]]\in\mathbb{R}^{v(G^{\prime})\times 4}$
	$\displaystyle H^{v^{\prime}}_{0}$	$\displaystyle=X^{\prime}W^{x^{\prime}}+\boldsymbol{1}_{v(G)}b^{x^{\prime}}\in% \mathbb{R}^{v(G^{\prime})\times d_{v}}$
	$\displaystyle H^{v^{\prime}}_{l}$	$\displaystyle=\operatorname{FFN}\left(\operatorname{MHA}\left(H^{v^{\prime}}_{% l-1},H^{v^{\prime}}_{l-1},H^{v^{\prime}}_{l-1},H^{v^{\prime}}_{l-1},J_{v(G^{% \prime})}\right)\right)\in\mathbb{R}^{v(G^{\prime})\times d_{v}},$
	$\displaystyle H^{v^{\prime}}_{L^{\text{enc}_{v^{\prime}}}}$	$\displaystyle=[h^{v^{\prime}}_{1},\cdots,h^{v^{\prime}}_{v(G^{\prime})}]^{T}.$

Here, $X_{i}\in\mathbb{R}^{1\times 2}$ signifies the coordinates of the $i$ th vertex in graph $G$ . The parameters $W^{x^{\prime}}\in\mathbb{R}^{4\times d_{v}}$ and $b^{x^{\prime}}\in\mathbb{R}^{1\times d_{v}}$ are trainable parameters. The dimensionality of the feature vector is denoted by $d_{v}$ . The layer index of the vertex encoder at the subpath merging stage is denoted by $l\in\{1,\ldots L^{\text{enc}_{v^{\prime}}}\}$ , and $J_{v(G)}$ represents a square matrix of order $v(G)$ with all elements equal to $1$ .

Building upon Kool et al.’s pioneering work in employing reinforcement learning for solving pathfinding problems (Kool et al., 2018), we adopt a similar approach to formulate the feature representation of states, allowing us to effectively capture the relevant information from the graph’s vertex features and incorporate it into the state representation. Specifically, we construct the feature representation by averaging the vertex feature vectors on the graph, combining the feature vector of the initially selected vertex, and concatenating it with the feature vector of the previously chosen vertex in the sequence of steps"

	$\displaystyle h_{\operatorname{graph}}$	$\displaystyle=\frac{1}{v(G^{\prime})}\sum_{i=1}^{v(G^{\prime})}h^{v^{\prime}}_% {i}\in\mathbb{R}^{1\times d_{v}}$
	$\displaystyle h_{\operatorname{front}}$	$\displaystyle=h^{v^{\prime}}_{U_{T^{\prime}+1}}\in\mathbb{R}^{1\times d_{v}}$
	$\displaystyle h_{\operatorname{rear}}$	$\displaystyle=h^{v^{\prime}}_{U_{t}}\in\mathbb{R}^{1\times d_{v}}$
	$\displaystyle h_{\operatorname{state}}$	$\displaystyle=[h_{\operatorname{graph}},h_{\operatorname{front}},h_{% \operatorname{rear}}]W^{s}+\boldsymbol{1}_{v(G^{\prime})}b^{s}\in\mathbb{R}^{v% (G^{\prime})\times d_{v}}$

Here, $W^{s}\in\mathbb{R}^{3d_{v}\times d_{v}}$ and $b^{s}\in\mathbb{R}^{1\times d_{v}}$ are trainable parameters. The parameter $U_{t}$ corresponds to the vertex indices selected from the graph $G^{\prime}$ during the subpath merging phase at time step $t$ .

Ultimately, we determine the index $U_{t}\in\{1,\ldots,K+|I|\}$ of the viable action chosen at time $t$ from the set of vertices in graph $G^{\prime}$ using the parameterized strategy $\pi^{c}_{\boldsymbol{\phi}}(U_{t},G,S_{t})$ . This index is then translated back to the vertex indexes of the original graph to yield action $A_{t}$ at time $t$ , where $t\in\{T^{\prime}+1,\ldots,T^{\prime}+K+|I|-1\}$ . The probability distribution of $U_{t}$ is computed based on the input feature vector of states $h^{p^{\prime}}_{t,0}=h^{\operatorname{state}}$ , the vertex feature vector of graph $G^{\prime}$ , denoted as $H^{v^{\prime}}_{L^{\text{enc}_{v^{\prime}}}}$ , and the mask $M_{t}$ . The calculation follows this pattern:

	$\displaystyle h^{p^{\prime}}_{t,l}$	$\displaystyle=\operatorname{MHA}(h^{p^{\prime}}_{t,l-1},H^{v^{\prime}}_{L^{% \text{enc}_{v^{\prime}}}},H^{v^{\prime}}_{L^{\text{enc}_{v^{\prime}}}},M_{t}),$
	$\displaystyle\pi^{c}_{\boldsymbol{\phi}}(U_{t}\mid G,S_{t})$	$\displaystyle=\operatorname{softmax}\Bigg{(}C\operatorname{tanh}\Big{(}M_{t}% \odot(h^{p^{\prime}}_{t,L^{\text{dec}_{c}}}W^{c}_{1}+b^{c}_{1})(H^{v^{\prime}}% _{L^{\text{enc}_{v^{\prime}}}}W^{c}_{2}+\boldsymbol{1}_{v(G^{\prime})}b^{c}_{2% })^{T}/\sqrt{d}\Big{)}\Bigg{)}.$

In the equation above, $l\in\{1,\ldots L^{\text{dec}_{c}}\}$ denotes the decoder layer index specific to the subpath merging stage. The parameter $C$ is set to $10$ as a crop** threshold, while $M_{t}$ represents the vertex mask that has yet to be accessed. $W^{c}_{1},W^{c}_{2}\in\mathbb{R}^{d_{v}\times d_{v}}$ and $b^{c}_{1},b^{c}_{2}\in\mathbb{R}^{1\times d_{v}}$ are trainable parameters. Once the vertex index $U_{t}$ is established on graph $G^{\prime}$ , chosen by agent $k$ at time $t$ , the corresponding action becomes

A_{t}=\left(V_{U_{t-1}+(K+|I|)\cdot(\operatorname{bool}(U_{t-1}<K+|I|))}(G^{% \prime}),V_{U_{t}}(G^{\prime})\right).

Here, $V_{i}(G^{\prime})$ denotes the $i$ -th vertex in graph $G^{\prime}$ , and $\operatorname{bool}(\cdot)$ functions as a logic operator returning either $0$ or $1$ . It serves to determine whether $U_{t-1}$ corresponds to the front or rear vertex. If it represents the front vertex, its index is increased by $K+|I|$ to ensure that the newly selected vertex at time $t$ connects to its endpoint. Conversely, if it represents the rear vertex, its index is decreased by $K+|I|$ to link it properly.

At the final time step $t=T^{\prime}+K+|I|$ , a decisive selection of a singular edge ensures the formation of a cycle. As a result, the need for parameterized strategies is obviated.

3.4 Policy Optimization

This section introduces optimization methods for the parameterized strategies involved in the CARSS algorithm’s subroute generation and subroute merging phases. In the subroute generation phase, initially, a set of $N$ groups of starting vertices $\{U^{n,1}_{0},\ldots,U^{n,K}_{0}\}_{n=1}^{N}$ is selected within the vertex set $V(G)$ , ensuring distinct vertices within each group. Subsequently, leveraging the probability distribution of the policy $\pi^{d}_{\boldsymbol{\theta}}$ , $N$ trajectories are sampled for the same instance. This results in a sequence of states, assignment vector indices, and reward trajectories $\{(S_{t-1}^{n},U_{t-1}^{n,1},\ldots,U_{t-1}^{n,K},R_{t}^{n})_{t=1}^{T^{\prime}% }\}_{n=1}^{N}$ , where $K$ is the number of agents and $T^{\prime}$ is the termination time of the subroute generation phase.

Moving to the subroute merging phase, a choice is made to establish $2(K+|I|)$ sets of initial vertex indices for time $T^{\prime}$ in graph $G^{\prime}$ . Specifically, indices are assigned as $U_{T^{\prime}+1}^{1,1}=U_{T^{\prime}+1}^{2,1}=\ldots=U_{T^{\prime}+1}^{N,1}=1,% \ldots,U_{T^{\prime}+1}^{1,2(K+|I|)}=U_{T^{\prime}+1}^{2,2(K+|I|)}=\ldots=U_{T% ^{\prime}+1}^{N,2(K+|I|)}=2(K+|I|)$ . Then, employing the probability distribution of policy $\pi^{c}_{\boldsymbol{\phi}}$ , each of the $2(K+|I|)$ indices is sampled independently. This process yields another sequence of states, vertex indices in graph $G^{\prime}$ , and reward trajectories $\{\{(S_{t-1}^{n,m},U_{t-1}^{n,m},R_{t}^{n,m})_{t=T^{\prime}}^{T^{\prime}+K+|I|% }\}_{m=1}^{2(K+|I|)}\}_{n=1}^{N}$ . Here, $I=V(G)\setminus V(S_{T^{\prime}})$ signifies the set of isolated points in graph $G$ at the end of the subroute generation phase, and $V_{i}(G^{\prime})$ represents the $i$ -th vertex in graph $G^{\prime}$ .

It is important to note, as defined Section 3.1, that the reward functions are structured such that $R_{t}^{n}=0$ for all $t\in\{1,\ldots,T^{\prime}\}$ and $R_{t}^{n,m}=0$ for all $t\in\{T^{\prime}+1\ldots,T^{\prime}+K+|I|\}$ , and for all $n\in\{1,\ldots,N\}$ and $m\in\{1,\ldots,2(K+|I|)\}$ . Non-zero rewards are solely associated with $R_{T^{\prime}+K+|I|}^{n,m}$ , for all $n\in\{1,\ldots,N\}$ and $m\in\{1,\ldots,2(K+|I|)\}$ , representing the final circuit length.

When considering the policy gradient for each agent, the remaining agents are treated as part of the environment. With reference to the Policy Gradient Theorem (Sutton et al., 1999), the gradient of the expected cumulative reward can be approximated as follows:

	$\displaystyle\nabla J(\boldsymbol{\theta})$	$\displaystyle\approx\frac{1}{N}\sum_{n=1}^{N}\frac{1}{K}\sum_{k=1}^{K}\left(% \min_{m\in\{1,\ldots,2(K+\|I\|)\}}R_{T^{\prime}+K+\|I\|}^{n,m}-b^{d}\right)\nabla% \log\prod_{t=1}^{T^{\prime}}\pi_{\boldsymbol{\theta}}(U_{t}^{n,k}\mid G,S_{t}^% {n,k}),$
	$\displaystyle\nabla J(\boldsymbol{\phi})$	$\displaystyle\approx\frac{1}{N}\sum_{n=1}^{N}\frac{1}{2(K+\|I\|)}\sum_{m=1}^{2(K% +\|I\|)}\left(R_{T^{\prime}+K+\|I\|}^{n,m}-b^{c}\right)\nabla\log\prod_{t=T^{% \prime}+1}^{T^{\prime}+K+\|I\|}\pi_{\boldsymbol{\phi}}(U_{t}^{n,m}\mid G^{\prime% },S_{t}^{n,m})..$

Here, $b^{d}=\frac{1}{N}\sum_{n=1}^{N}\min_{m\in\{1,\ldots,2(K+|I|)\}}R_{T^{\prime}+K% +|I|}^{n,m}$ and $b^{c}=\frac{1}{2(K+|I|)}\sum_{m=1}^{2(K+|I|)}R_{T^{\prime}+K+|I|}^{n,m}$ . The former corresponds to the Policy Optimization with Multiple Optima (POMO) baseline (Kwon et al., 2020) obtained by sampling the decoding of merged subpaths from the $2(K+|I|)$ randomly selected vertices in the subpath merging stage. The latter signifies the POMO baseline obtained by sampling $N$ vertices decoded from randomly chosen vertices in the subpath generation stage.

We trained the model using the independent REINFORCE algorithm (Williams, 1992) with the POMO baseline, employing the Adam optimizer (Kingma and Ba, 2015) for parameter updates. The training procedure is detailed in Algorithm 3.

Algorithm 3 Independent REINFORCE algorithm with Policy Optimization Multiple Optima baseline

1:Number of iterations

E

, batch size

B

, trajectory samples per instance

N

2:Initialize

\boldsymbol{\theta},\boldsymbol{\phi}

3:for

\text{epoch}=1,\ldots,E

4: for

i=1,\ldots,B

G_{i}\leftarrow\operatorname{RandomInstance()}

T^{\prime}\leftarrow\left\lfloor\frac{v(G)}{K}\right\rfloor-(1\text{ if }K% \text{ divides }v(G)\text{ else }0)

7: for

n=1,\ldots,N

8: for

k=1,\ldots,K

U_{0}^{n,k}\leftarrow\operatorname{RandomSelect}\left(V(G_{i})\setminus\{U_{0}% ^{n,k^{\prime}}\}_{k^{\prime}\in\{1,\ldots,k-1\}}\right)

10: end for

11:

\{(S_{t-1}^{n},U_{t-1}^{n,1},\ldots,U_{t-1}^{n,K},R_{t}^{n})\}_{t=1}^{T^{% \prime}}\leftarrow\operatorname{Rollout}(G_{i},\pi_{\boldsymbol{\theta}})

12:

I\leftarrow V(G_{i})\setminus V(S^{n}_{T^{\prime}})

13:

G^{\prime}_{i}\leftarrow\operatorname{ConstructSubgraph}(G_{i},S^{n}_{T^{% \prime}})

14:

U_{T^{\prime}+1}^{n^{\prime},m^{\prime}}\leftarrow m,n^{\prime}\in\{1,\ldots,N% \},m^{\prime}\in\{1,\ldots,2(K+|I|)\}

15:

\{\{(S_{t-1}^{n,m},U_{t-1}^{n,m},R_{t}^{n,m})\}_{t=T^{\prime}+1}^{T^{\prime}+K% +|I|}\}_{m=1}^{2(K+|I|)}\leftarrow\operatorname{Rollout}(G^{\prime}_{i},\pi_{% \boldsymbol{\phi}})

16: end for

17:

b^{d}=\frac{1}{N}\sum_{n=1}^{N}\min_{m\in\{1,\ldots,2(K+|I|)\}}R_{T^{\prime}+K% +|I|}^{n,m}

18:

b^{c}\leftarrow\frac{1}{2(K+|I|)}\sum_{m=1}^{2(K+|I|)}R_{T^{\prime}+K+|I|}^{n,m}

19:

\nabla J(\boldsymbol{\theta})\leftarrow\frac{1}{N}\sum_{n=1}^{N}\frac{1}{K}% \sum_{k=1}^{K}\left(\min_{m\in\{1,\ldots,2(K+|I|)\}}R_{T^{\prime}+K+|I|}^{n,m}% -b^{d}\right)\nabla\log\prod_{t=1}^{T^{\prime}}\pi_{\boldsymbol{\theta}}(U_{t}% ^{n,k}\mid G,S_{t}^{n,k})

20:

\nabla J(\boldsymbol{\phi})\leftarrow\frac{1}{N}\sum_{n=1}^{N}\frac{1}{2(K+|I|% )}\sum_{m=1}^{2(K+|I|)}\left(R_{T^{\prime}+K+|I|}^{n,m}-b^{c}\right)\nabla\log% \prod_{t=T^{\prime}+1}^{T^{\prime}+K+|I|}\pi_{\boldsymbol{\phi}}(U_{t}^{n,m}% \mid G^{\prime},S_{t}^{n,m})

21:

\boldsymbol{\theta}\leftarrow\operatorname{Adam}(\boldsymbol{\theta},\nabla J(% \boldsymbol{\theta}))

22:

\boldsymbol{\phi}\leftarrow\operatorname{Adam}(\boldsymbol{\phi},\nabla J(% \boldsymbol{\phi}))

23: end for

24:end for

3.5 Complexity Analysis

In this section, we present an analysis of the overall time and space complexity of the CARSS algorithm. This analysis highlights the advantages of our approach in terms of complexity compared to classical reinforcement learning-based methods for solving TSP (Kool et al., 2018; Bresson and Laurent, 2021). Moreover, it underscores the potential for training on larger problem instances.

Firstly, let’s consider the algorithm’s time complexity. This algorithm involves the utilization of the self-attention mechanism from multiple Transformer models during both the encoding and decoding processes. For a sequence of length $n$ , the time complexity of this algorithm is determined by $O(n^{2}d+nd^{2})$ , as discussed in Vaswani et al. (2017), where $d$ represents the model’s dimension. For the sake of simplicity, we can omit the term $nd^{2}$ , particularly since during algorithm execution, the difference between $n$ and $d$ tends to be marginal or in scenarios where $n>d$ . Additionally, we will disregard certain network-specific parameters like $L^{\text{enc}_{a}}$ , $L^{\text{dec}_{g}}$ , etc. Furthermore, our analysis will focus solely on the case where a single trajectory is sampled in both phases ( $N=1$ ). The computational approach is as follows:

O\left(\underbrace{\left(\overbrace{K^{2}}^{H^{a}_{t,L^{\text{enc}_{a}}}}+% \overbrace{KT^{\prime}}^{h^{d}_{t,L^{\text{dec}_{g}}}}+\overbrace{K\frac{v(G)}% {K}}^{\pi^{d}_{\boldsymbol{\theta}}}\right)\cdot T^{\prime}d}_{\text{Subpath % generation}}+\underbrace{\overbrace{(K+|I|)}^{\pi^{d}_{\boldsymbol{\phi}}}% \cdot(K+|I|)d}_{\text{Subpath merging}}\right)=O\left(\left(Kv(G)+2\frac{\left% (v(G)\right)^{2}}{K}+4K^{2}\right)d\right),

Next, we delve into the consideration of the algorithm’s space complexity. For a sequence of length $n$ , the space complexity of the self-attention module is $O(n^{2})$ , as indicated by Vaswani et al. (2017). However, recent advancements in the field have demonstrated that for encoders, this complexity can be reduced to $\sqrt{n}$ (Rabe and Staats, 2021). By employing this technique, the space complexity can be expressed as follows:

O\left(\underbrace{\overbrace{K\sqrt{K}}^{H^{a}_{t,L^{\text{enc}_{a}}}}+% \overbrace{K\left(T^{\prime}\right)^{2}}^{h^{d}_{t,L^{\text{dec}_{g}}}}+% \overbrace{K\left(\frac{v(G)}{K}\right)^{2}}^{\pi^{d}_{\boldsymbol{\theta}}}}_% {\text{Subpath generation}}+\underbrace{\overbrace{(K+|I|)^{2}}^{\pi^{d}_{% \boldsymbol{\phi}}}}_{\text{Subpath merging}}\right)=O\left(2\frac{\left(v(G)% \right)^{2}}{K}+4K^{2}+K^{\frac{3}{2}}\right),

Through these calculations, it is evident that the CARSS algorithm, which involves the collaborative efforts of multiple agents to solve TSP as opposed to the conventional algorithm with $K=1$ , significantly reduces the temporal and spatial complexities during both training and testing phases. Specifically, the CARSS algorithm achieves a complexity reduction of approximately $\frac{1}{K}$ times that of the original algorithm. This reduction in complexity translates to a substantial enhancement in the scalability of the model using the same computational resources.

4 Experiments

In this section, we outline the training process and experimental results of the CARSS algorithm. The training and test datasets are prepared in alignment with Kool et al. (2018). All instance vertices are drawn from a uniform distribution $U_{[0,1]\times[0,1]}$ . The instance sizes, $v(G)$ , are set to $\{100,200,500,1000\}$ , and the number of agents, $K$ , ranges from $\{2,3,\ldots,10,20,25\}$ . Both training and testing are performed on a GeForce RTX 3090 GPU, where instances with sizes less than 100 utilize a single GPU, while the rest employ two GPUs; however, testing is executed on a single GPU. The decoding strategy involves a greedy approach, selecting the action with the highest probability from the model’s action distribution. The optimization gap is computed as $(\text{Obj.}/\text{BKS}-1)\times 100\%$ , with Obj. representing the cost associated with a solution calculated by a specific algorithm, and BKS denoting the cost of the instance’s optimal solution.

The model’s hyperparameters are largely consistent with Kool et al. (2018). The vertex parameters and hidden layer dimensions of the feedforward neural networks are set to $d_{v}=256$ and $d_{f}=512$ respectively. Within the generated subpath model, $H=8$ attention heads are employed. The encoder comprises $L^{\text{enc}_{v}}=3$ layers of vertex feature aggregation attention, $L^{\text{enc}_{a}}=3$ layers of agent feature aggregation attention, and the decoder has a single attention layer $L^{\text{dec}_{g}}=1$ . Here, the superscripts $\text{enc}_{v}$ , $\text{enc}_{a}$ , and $\text{dec}_{g}$ correspond to the encoder for vertex features, encoder for agent features, and decoder for generating policies, respectively. In the subpath merging model, $H=8$ attention heads are used, and in both the encoder and decoder, there are $L^{\text{enc}_{v^{\prime}}}=3$ and $L^{\text{dec}_{c}}=1$ attention layers, where the superscripts $\text{enc}_{v^{\prime}}$ and $\text{dec}_{c}$ pertain to the vertex encoder and policy decoder, respectively. The model undergoes $E=100$ iterations, with each iteration comprising $B=1000$ batches, and each batch containing 512 instances. The learning rate remains fixed at $10^{-4}$ .

Under the aforementioned settings, training a single iteration of this model on a GeForce RTX 3090 GPU takes approximately 10 to 25 minutes for instances with 100 vertices, around 25 to 31 minutes for instances with 200 vertices, and about 49 minutes for instances with 500 vertices. It’s noteworthy that the training time for models with the same training set size varies based on the number of agents; larger numbers of agents correspond to shorter training times. For the single-agent Attention Model (AM) proposed by Kool et al. (2018), its training times on smaller instances align with those of the CARSS algorithm, potentially due to the relatively high number of sequential execution steps during training or excessive sampling, which suggests room for optimization. As for memory consumption, the model can be trained on instances with 100 or 200 vertices using a single GPU, consuming up to a maximum of 12000 MiB of memory. However, for instances with 500 vertices, two GPUs are required, with each consuming around 16000 MiB of memory. On the other hand, the AM model requires two GPUs for training on instances with 200 vertices, consuming approximately 15000 MiB of memory per card. For larger instances, dual-GPU training is infeasible. This highlights the substantial memory optimization improvements achieved by the CARSS algorithm during training.

4.1 Performance on Random Instances

As shown in 1, we employed the CARSS algorithm to conduct tests on randomly generated instances with a maximum size of 1000 vertices. The Best Known Solution (BKS) was obtained using solvers such as Concorde or Gurobi. Since conventional reinforcement learning-based solving methods perform worse than 2-opt and insertion algorithms for instances of this size, only these few algorithms were included in the comparison. Test instances were generated randomly within the domain $U_{[0,1]\times[0,1]}$ , with each instance type consisting of $10000/v(G)$ samples. During testing, 4096 results were obtained using greedy decoding for each instance, with the best result and solving time reported. The average values were then computed based on instance sets of equivalent scale. In the "Solver" column, "AM (sample)" represents the sample decoding version of the single-agent algorithm proposed by Kool et al. (Kool et al., 2018). On the other hand, $\text{CARSS}(v(G),K)$ indicates the CARSS algorithm trained with $K$ agents on a graph of size $v(G)$ . As Gurobi’s solving time becomes prohibitively long for larger instances, it wasn’t employed to solve instances with 500 and 1000 vertices, and thus, "–" is used to indicate untested results.

Observing the results, it is evident that for instances with 100 vertices, CARSS (100,2) outperforms the farthest insertion algorithm but slightly lags behind AM (sample). For instances with 200 vertices, the performance of CARSS (100,4) is on par with the farthest insertion method, and superior to AM (sample). As the instance size increases to 500 or 1000 vertices, the optimization gap of CARSS (500,20) is inferior to the nearest insertion algorithm but far better than AM (sample). Even with the increase in sampling iterations, the algorithm retains the potential to achieve better solutions. In terms of testing time, the CARSS algorithm consistently outperforms AM (sample).

Table 1: Results of CARSS algorithm on random instances

Problem Size 100 200 500 1000 Algorithm Obj. Gap Time Obj. Gap Time Obj. Gap Time Obj. Gap Time Concorde 7.74 0.00% 0.189s 10.71 0.00% 1.015s 16.55 0.00% 18.844s 23.09 0.00% 1.366m Gurobi 7.74 0.00% 1.008s 10.71 0.00% 14.585s - - 2-opt 8.34 7.79% 0.198s 11.67 8.94% 0.606s 18.20 9.98% 2.948s 25.60 10.90% 31.792s FI 8.34 7.85% 0.006s 11.68 9.06% 0.022s 18.26 10.37% 0.160s 25.74 11.52% 1.014s RI 8.51 9.95% 0.004s 11.94 11.54% 0.009s 18.46 11.56% 0.038s 26.10 13.07% 0.111s NI 9.45 22.20% 0.006s 13.28 23.97% 0.022s 20.63 24.66% 0.153s 28.93 25.32% 0.999s AM (sample) 7.92 2.39% 1.119m 11.50 7.48% 1.547m 22.65 36.82% 3.180m 42.94 85.96% 6.482m CARSS (100, 2) 8.09 4.53% 7.998s 12.11 13.03% 16.577s 21.87 32.15% 44.526s 35.28 52.83% 1.671m CARSS (100, 3) 8.15 5.39% 6.631s 12.13 13.25% 12.985s 21.78 31.63% 34.216s 34.97 51.48% 1.274m CARSS (100, 4) 8.12 4.93% 5.589s 12.00 12.02% 11.232s 21.24 28.36% 29.356s 33.85 46.62% 1.081m CARSS (100, 5) 8.15 5.34% 5.372s 12.03 12.36% 10.323s 21.17 27.93% 26.184s 33.32 44.36% 58.039s CARSS (100, 6) 8.23 6.44% 5.239s 12.26 14.51% 9.595s 21.56 30.30% 24.265s 34.03 47.40% 53.644s CARSS (100, 7) 8.34 7.87% 4.771s 12.33 15.10% 9.389s 21.71 31.22% 22.947s 34.14 47.90% 50.660s CARSS (100, 8) 8.34 7.83% 5.345s 12.33 15.10% 10.649s 22.04 33.20% 23.163s 34.78 50.67% 50.139s CARSS (100, 9) 8.56 10.60% 4.827s 12.70 18.57% 9.012s 22.47 35.77% 22.853s 36.28 57.16% 46.253s CARSS (100, 10) 8.18 5.76% 7.802s 12.13 13.23% 11.628s 21.37 29.12% 23.896s 34.08 47.60% 49.034s CARSS (200, 5) 8.23 6.36% 5.529s 12.03 12.35% 10.327s 20.81 25.76% 26.170s 32.56 41.04% 57.606s CARSS (200, 10) 8.23 6.38% 7.801s 12.10 12.98% 11.667s 20.84 25.95% 24.418s 32.29 39.88% 49.065s CARSS (500, 20) 8.13 5.08% 33.401s 12.17 13.61% 36.003s 20.58 24.35% 46.409s 31.03 34.41% 1.098m

4.2 Sensitivity Analysis

This segment delves into the relationship between the training loss of the subpath generation model and the problem’s scale and the number of agents involved. The training process is categorized into three groups based on the instance size $v(G)$ , the number of agents $K$ , and the subproblem scale in the subpath merging phase $(K+|I|)$ . Figure 2 illustrates the relationship between the training loss $L(\boldsymbol{\theta})$ and the number of iterations $E$ . Solid lines represent the mean of the loss within each category, while shaded regions depict the fluctuation range in terms of standard deviation. A higher absolute value of the loss implies a greater potential for improvement with more training iterations.

From the graph, we observe that the training loss of the subpath generation model exhibits similar trends across various problem scales. This suggests the stability of the CARSS algorithm’s performance when training on different problem sizes, without encountering training difficulties due to excessively large problem scales. However, as the number of agents or the subproblem scale in the subpath merging phase increases, the absolute value of the loss diminishes and its rate of reduction slows down over the training process. This phenomenon can be attributed to the multi-modal nature of the cooperative multi-agent environment.

4.3 Example Solutions

As depicted in Figure 3, we selected an instance with a size of 100 and employed the CARSS algorithm with three distinct configurations, where the number of agents $K$ was set to $\{2,5,10\}$ , to solve it using a greedy decoding strategy. In the figure, solid black dots represent vertices in the instance, red solid dots indicate the initial vertices chosen by each agent, and hollow black dots depict isolated vertices not selected by any agent, totaling $|I|$ in number. Different colored solid lines correspond to subpaths of different agents in the final solution, amounting to a total of $K$ subpaths, while dashed lines symbolize the $K+|I|$ edges added to connect all agent subpaths and isolated vertices into a cycle during the subpath merging phase. From the illustration, it becomes apparent that the CARSS algorithm adeptly captures the characteristics of optimal solutions for TSP. Solutions exhibit superior quality with fewer agents, and the traveling salesman’s route demonstrates minimal instances of overlap**. However, as the number of agents increases, a potential decline in algorithm performance can be observed. This might arise from the simplicity in the map** of the vertex index $U^{n,K}_{t}$ , output by the policy network during the subpath generation phase, to the action $A^{n,k}_{t}$ at time $t$ . In this process, a choice is made to connect the chosen vertex to the nearest end of the subpath, rather than "inserting" the selected vertex into the current subpath, as seen in algorithms like the farthest insertion method. This discrepancy could lead to a decline in performance. This issue is evident in the solution provided by CARSS (100,10) for the loop in the upper right corner of the route, where the corresponding subpath formed by the blue agent shows such behavior.

5 Conclusion

In this paper, we introduced CARSS algorithm, a groundbreaking approach for solving TSP using cooperative MARL. CARSS strategically decomposes the TSP solving process into subpath generation and subpath merging steps, leveraging the power of cooperative MARL to tackle the challenges posed by large-scale instances. By employing attention mechanisms for feature embedding and parameterization, CARSS enhances the agents’ ability to learn and generate high-quality solutions. The independent REINFORCE algorithm facilitates the training of the CARSS model, contributing to its efficiency and effectiveness.

Our contributions to the field are threefold: firstly, the introduction of CARSS, an innovative algorithmic framework that harnesses cooperative MARL for TSP solving; secondly, the integration of attention mechanisms, which significantly elevate the agents’ learning capabilities; and thirdly, the empirical demonstration of CARSS’s superiority over single-agent alternatives. Through comprehensive experiments, we showed that CARSS outperforms conventional approaches in terms of delivering reduced memory consumption, improved scalability, and notable reductions in testing time and optimization gaps for large-scale TSP instances.

As the field of combinatorial optimization and reinforcement learning continues to evolve, CARSS presents a robust strategy that capitalizes on the synergy of multiple agents and attention mechanisms. While our work demonstrates remarkable advancements in tackling TSP, future research could explore the application of CARSS to other combinatorial optimization problems and delve deeper into optimizing the attention mechanisms to further enhance the agents’ learning efficiency. We anticipate that CARSS will play a pivotal role in advancing the capabilities of MARL in addressing complex real-world optimization challenges.

In conclusion, our study underscores the effectiveness of the CARSS algorithm, shedding light on its potential to revolutionize the way we approach TSP and related problems. By combining the strengths of cooperative MARL, attention mechanisms, and subpath synthesis, CARSS represents a significant stride toward efficient and scalable solutions for the TSP.

Acknowledgments

This research was supported by National Key R&D Program of China (2021YFA1000403), the National Natural Science Foundation of China (Nos. 11991022), the Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDA27000000) and the Fundamental Research Funds for the Central Universities.

References

Chvátal et al. [2009] Vašek Chvátal, William Cook, George B. Dantzig, Delbert R. Fulkerson, and Selmer M. Johnson. Solution of a large-scale traveling-salesman problem. In 50 Years of Integer Programming 1958-2008, pages 7–28. Springer Berlin Heidelberg, November 2009. doi:10.1007/978-3-540-68279-0_1. URL https://doi.org/10.1007/978-3-540-68279-0_1.
Held and Karp [1962] Michael Held and Richard M. Karp. A dynamic programming approach to sequencing problems. Journal of the Society for Industrial and Applied Mathematics, 10(1):196–210, March 1962. doi:10.1137/0110015. URL https://doi.org/10.1137/0110015.
Bellman [1962] Richard Bellman. Dynamic programming treatment of the travelling salesman problem. Journal of the ACM, 9(1):61–63, January 1962. doi:10.1145/321105.321111. URL https://doi.org/10.1145/321105.321111.
Rosenkrantz et al. [1974] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis. Approximate algorithms for the traveling salesperson problem. In 15th Annual Symposium on Switching and Automata Theory (swat 1974). IEEE, October 1974. doi:10.1109/swat.1974.4. URL https://doi.org/10.1109/swat.1974.4.
Helsgaun [2000] Keld Helsgaun. An effective implementation of the lin–kernighan traveling salesman heuristic. European Journal of Operational Research, 126(1):106–130, October 2000. doi:10.1016/s0377-2217(99)00284-2. URL https://doi.org/10.1016/s0377-2217(99)00284-2.
Dorigo and Gambardella [1997] M. Dorigo and L.M. Gambardella. Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, 1(1):53–66, April 1997. doi:10.1109/4235.585892. URL https://doi.org/10.1109/4235.585892.
Albrecht and Ramamoorthy [2013] Stefano V. Albrecht and Subramanian Ramamoorthy. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS ’13, page 1155–1156, Richland, SC, 2013. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450319935.
Mordatch and Abbeel [2018] Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018. ISBN 978-1-57735-800-8.
Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6382–6393, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
Samvelyan et al. [2019] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, page 2186–2188, Richland, SC, 2019. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450363099.
Christianos et al. [2020] Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. Shared experience actor-critic for multi-agent reinforcement learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
Dhamankar et al. [2020] Gauraang Dhamankar, Jose R. Vazquez-Canteli, and Zoltan Nagy. Benchmarking multi-agent deep reinforcement learning algorithms on a building energy demand coordination task. In Proceedings of the 1st International Workshop on Reinforcement Learning for Energy Management in Buildings & Cities, RLEM’20, page 15–19, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450381932. doi:10.1145/3427773.3427870. URL https://doi.org/10.1145/3427773.3427870.
Kurach et al. [2020] Karol Kurach, Anton Raichuk, Piotr Stanczyk, Michal Zajac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, and Sylvain Gelly. Google research football: A novel reinforcement learning environment. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 4501–4510. AAAI Press, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/5878.
Bard et al. [2020] Nolan Bard, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, Iain Dunning, Shibl Mourad, Hugo Larochelle, Marc G. Bellemare, and Michael Bowling. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020. ISSN 0004-3702. doi:https://doi.org/10.1016/j.artint.2019.103216. URL https://www.sciencedirect.com/science/article/pii/S0004370219300116.
Vinyals et al. [2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015.
Bello et al. [2016] Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. ArXiv, abs/1611.09940, 2016.
Khalil et al. [2017] Elias Boutros Khalil, Hanjun Dai, Yuyu Zhang, Bistra N. Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In NIPS, 2017.
Kool et al. [2018] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! In International Conference on Learning Representations, 2018.
Bresson and Laurent [2021] Xavier Bresson and Thomas Laurent. The transformer network for the traveling salesman problem. ArXiv, abs/2103.03012, 2021.
Joshi et al. [2020] Chaitanya K. Joshi, Quentin Cappart, Louis-Martin Rousseau, Thomas Laurent, and Xavier Bresson. Learning tsp requires rethinking generalization. ArXiv, abs/2006.07054, 2020.
Zhang et al. [2020] Ke Zhang, Fang He, Zhengchao Zhang, Xi Lin, and Meng Li. Multi-vehicle routing problems with soft time windows: A multi-agent reinforcement learning approach. Transportation Research Part C: Emerging Technologies, 121:102861, December 2020. doi:10.1016/j.trc.2020.102861. URL https://doi.org/10.1016/j.trc.2020.102861.
Zong et al. [2022] Zefang Zong, Meng Zheng, Yong Li, and Depeng **. Mapdp: Cooperative multi-agent reinforcement learning to solve pickup and delivery problems. In AAAI Conference on Artificial Intelligence, 2022.
Vaswani et al. [2017] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017.
Sutton et al. [1999] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
Kwon et al. [2020] Yeong-Dae Kwon, **ho Choo, Byoungjip Kim, Iljoo Yoon, Seungjai Min, and Youngjune Gwon. Pomo: Policy optimization with multiple optima for reinforcement learning. ArXiv, abs/2010.16011, 2020.
Williams [1992] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, May 1992. doi:10.1007/bf00992696. URL https://doi.org/10.1007/bf00992696.
Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.longhoe.net/abs/1412.6980.
Rabe and Staats [2021] Markus N. Rabe and Charles Staats. Self-attention does not need o(n2) memory. ArXiv, abs/2112.05682, 2021.