Combinatorial Optimization with Policy Adaptation using Latent Space Search

Felix Chalumeau^∗
InstaDeep
[email protected]
Shikha Surana^∗
InstaDeep
[email protected]
Clément Bonnet
InstaDeep
[email protected]
Nathan Grinsztajn
InstaDeep
[email protected]
Arnu Pretorius
InstaDeep
[email protected]
Alexandre Laterre
InstaDeep
[email protected]
Thomas D. Barrett
InstaDeep
[email protected]

Abstract

Combinatorial Optimization underpins many real-world applications and yet, designing performant algorithms to solve these complex, typically NP-hard, problems remains a significant research challenge. Reinforcement Learning (RL) provides a versatile framework for designing heuristics across a broad spectrum of problem domains. However, despite notable progress, RL has not yet supplanted industrial solvers as the go-to solution. Current approaches emphasize pre-training heuristics that construct solutions but often rely on search procedures with limited variance, such as stochastically sampling numerous solutions from a single policy or employing computationally expensive fine-tuning of the policy on individual problem instances. Building on the intuition that performant search at inference time should be anticipated during pre-training, we propose COMPASS, a novel RL approach that parameterizes a distribution of diverse and specialized policies conditioned on a continuous latent space. We evaluate COMPASS across three canonical problems - Travelling Salesman, Capacitated Vehicle Routing, and Job-Shop Scheduling - and demonstrate that our search strategy (i) outperforms state-of-the-art approaches on 11 standard benchmarking tasks and (ii) generalizes better, surpassing all other approaches on a set of 18 procedurally transformed instance distributions.

^*^*footnotetext: Equal contribution

\doparttoc\faketableofcontents

1 Introduction

Combinatorial Optimization (CO) has a wide range of real-world applications, from transportation (Contardo et al., 2012) and logistics (Laterre et al., 2018), to energy (Froger et al., 2016). Solving a CO problem consists of finding an ordering, labelling or subset of elements from a finite, discrete set that maximizes (or minimizes) a given objective function. As the number of feasible solutions typically grows exponentially with the problem size, CO problems are challenging (often NP-hard) to solve. As such, significant work goes into designing problem-specific heuristic approaches that, whilst not guaranteeing the optimal answer, can often work well in practice. Reinforcement Learning (RL) offers a domain-agnostic framework to learn heuristics and has been successfully applied across a range of CO tasks (Vinyals et al., 2015; Deudon et al., 2018; Mazyavkina et al., 2021).

Concretely, leading RL methods typically train a policy to incrementally construct a solution one element at a time. However, whilst most efforts have focused on improving the one-shot quality of these construction heuristics (Kool et al., 2019; Kwon et al., 2020; Grinsztajn et al., 2022), it intuitively appears impractical to reliably produce the optimal solution to NP-hard problems within a single construction attempt. Consequently, competitive performance has to rely on combining a pre-trained policy with an additional search procedure. Nevertheless, this crucial aspect is often implemented using simple procedures such as stochastic sampling (Kool et al., 2019; Kwon et al., 2020; Grinsztajn et al., 2022), beam search (Steinbiss et al., 1994) or Monte Carlo Tree Search (MCTS) (Browne et al., 2012). An alternative approach, representing the current state-of-the-art for search-based RL (Bello et al., 2016; Hottung et al., 2022), is to actively re-train the heuristic on each new problem instance; however, this comes with clear computational and practical limitations. Strikingly, neither of these approaches pre-trains the policy in a way that could enable a fast and efficient inference time search: rather current approaches typically completely decouple both. The absence of an efficient search strategy is even more detrimental when the test instances are out of the distribution (OOD) used to train the policy, as this may cause a large difference between the learned policy and the policy leading to the optimal solution.

In this work, we aim to overcome the current limitations of search strategies used in RL when applied to CO problems. Our approach is to learn a latent space of diverse policies that can be explored at inference time in order to find the best-performing strategies for a given instance. This updates the current paradigm by enabling sampling from a policy space at inference time rather than constantly sampling the same policy (or set of policies) with stochasticity. We introduce COMPASS – COMbinatorial optimization with Policy Adaptation using Latent Space Search. COMPASS effectively creates an infinite set of diverse solvers by using a single conditioned policy and sampling the conditions from a continuous latent space. The training process encourages subareas of the latent space to specialize to sub-distributions of instances and this diversity is used at inference time to solve newly encountered instances.

We evaluate COMPASS on three popular CO problems: Travelling Salesman Problem (TSP), Capacitated Vehicle Routing Problem (CVRP) and Job-Shop Scheduling Problem (JSSP). After training on a distribution of fixed-sized instances for each problem, we evaluate our method on both in- and out-of-distribution test sets. We find that simple search strategies requiring no re-training provide both rapid and sustained improvement of the instance-specific policy, with COMPASS establishing a new state-of-the-art across all problems in this setting. Thanks to the diversity provided by its latent space, COMPASS achieves high performance even without a search budget and achieves comparable or better results than current leading few-shot methods.

Concretely, our work makes the following contributions: (i) We introduce COMPASS which leverages a latent space of diverse and specialized policies to effectively solve CO problems. (ii) We show that COMPASS allows for the efficient adaptation of instance-specific policies without re-training or sacrificing zero-shot performance. (iii) Experimentally, our approach is found to represent a new state-of-the-art for RL-based CO methods across all our considered problem types, achieving superior performance on all 29 tasks. (iv) We release fast and performant implementations of our method and its main competitors, written in JAX. We also provide all of our test sets including our procedurally transformed problem instances for easier comparison in future work.

2 Related work

Construction methods for CO

Construction approaches in RL for CO incrementally build a solution by selecting one element at a time. After Hopfield and Tank (1985) first applied neural networks to TSP, Bello et al. (2016) extended these efforts by proposing to learn heuristics with RL using a Pointer Network (Vinyals et al., 2015) combined with an actor-critic framework. This approach was extended by Deudon et al. (2018) who added an attention-based city encoder, which was subsequently further extended by Kool et al. (2019) to use a general transformer architecture (Vaswani et al., 2017). The transformer has since become the standard model for a range of CO problems and is also used in this work. Kim et al. (2022) builds on Kool et al. (2019) by leveraging symmetries of routing problems during training. Even though the majority of these construction approaches have focused on routing problems, numerous works have also tackled other classes of CO problems, especially on graphs, like Maximum Cut (Dai et al., 2017; Barrett et al., 2020), or Job Shop Scheduling Problem (JSSP), for which Zhang et al. (2020) proposed a Graph Neural Network (GNN) approach. A broader scope of (non-construction) approaches can be found in Appendix K.

Improving solutions at inference time

As it is unlikely that the first solution generated by a construction heuristic is optimal, a popular approach consists in sampling various trajectories during inference for the same CO problem. POMO (Kwon et al., 2020) uses one policy rolled out on several versions of the same problem, while considering different starting points or symmetries, to create diverse trajectories and select the best one. Choo et al. (2022) proposes an efficient search guided by simulations, but cannot take advantage of a large inference budget by itself. EAS (Hottung et al., 2022) adds on POMO by fine-tuning a subset of the model parameters at inference time using gradient descent. However, the new solutions are biased toward the underlying pre-trained policy and can easily be stuck in local optima. Instead, MDAM (Xin et al., 2021) and Poppy (Grinsztajn et al., 2022) employ a population of agents, all of which are simultaneously rolled out at inference time. MDAM trains these policies to select different initial actions, whereas Poppy utilizes a loss function designed to specialize each policy on specific subsets of the problem distribution. Despite demonstrating promising performance, these approaches are constrained by the number of policies used during training, which remains fixed. Such a limitation quickly diminishes the benefits of additional solution candidates sampled from the population. Our method COMPASS uses the same loss as Poppy, but, unlike their approach and that of MDAM, COMPASS is not bound to a specific number of specialized policies. Moreover, its latent space makes it possible to add additional search mechanisms over the policy space, ensuring better solutions over time. CVAE-Opt (Hottung et al., 2021), akin to our method, uses a latent space for solving routing problems, however, it has several differences. First, COMPASS is trained end-to-end with RL, hence does not necessitate pre-solved instances. Second, CVAE-Opt requires training an additional recurrent encoder for (instance, solution) pairs, whereas COMPASS uses the latent space to encode a distribution of complementary policies and can be easily applied to pre-train models. Overall, COMPASS significantly outperforms CVAE-Opt while having shorter runtime.

3 Methods

3.1 Preliminaries

Formulation

The goal of a CO problem is to find the optimal labeling of a set of discrete variables that satisfies the problem’s constraints. In RL, a CO problem can be formulated as a Markov Decision Process (MDP) defined by $M=(S,A,R,T,\gamma,H)$ . This includes the state space $S$ with states $s_{i}\in S$ , action space $A$ with actions $a_{i}\in A$ , reward function $R:S\times A\rightarrow R$ , transition function $T(s_{i+1}|s_{i},a_{i})$ , discount factor $\gamma\in[0,1]$ , and horizon $H$ which denotes the episode duration. The state of a problem instance is represented as the (partial) trajectory or set of actions taken in the instance, and the next state $s_{t+1}$ is determined by applying the chosen action $a_{t}$ to the current state $s_{t}$ . An agent is introduced in the MDP to interact with the CO problem and find solutions by learning a policy $\pi:S\rightarrow A$ . The policy is trained to maximize the expected sum of discounted rewards to find the optimal solution, and this is formalized as the following learning objective: $\pi^{*}=\underset{\pi}{\mathrm{argmax}}\ \mathbb{E}[\sum_{t=0}^{H}\gamma^{t}R(% s_{t},a_{t})]$ .

3.2 COMPASS

Recall our intuition that no single policy will reliably be able to solve all instances of an NP-hard CO problem in a single inference pass. Two primary approaches to address this are the inclusion of inference time search and the deployment of a diverse set of policies to increase the chance of a near-optimal strategy being deployed. This work aims to unify and extend these approaches by training an infinitely large set of diverse and specialized policies that can subsequently be searched at inference time.

To achieve this, we propose that a single set of policy parameters condition not just on the current observation, but also on samples drawn from a continuous latent space. The training objective then encourages this latent space of policies to be diverse (generate a wide range of behaviors) and specialized (these behaviors are optimized for different types of problem instances from the training distribution). This latent space can then be efficiently searched during inference to find the most performant policy for a given problem instance. In this section, we describe in detail the realization of this approach, which we call COMPASS (COMbinatorial optimization with Policy Adaptation using Latent Space Search). In Fig. 1, we provide an illustrated overview of COMPASS.

Our approach offers several key advantages over traditional techniques. Compared to methods that directly train multiple, uniquely parameterized policies (Xin et al., 2021; Grinsztajn et al., 2022), training a single conditional policy can, in principle, provide a continuous distribution of an infinite number of policies. Moreover, our approach mitigates the significant training and memory overheads associated with training a population of agents. Compared to methods that rely on brute-force sampling (Kool et al., 2019; Kwon et al., 2020; Grinsztajn et al., 2022) or expensive fine-tuning (Hottung et al., 2022), our training process produces a structured latent space (where similar policies are found near to each other) that permits principled search during inference.

Refer to caption — Figure 1: Our method COMPASS is composed of the following two phases. A. Training - the latent space is sampled to generate vectors that the policy can condition upon. The conditioned policies are then evaluated and only the best one is trained to create specialization within the latent space. B. Inference - at inference time the latent space is searched through an evolution strategy to exploit regions with high-performing policies for each instance.

Latent space

The latent space defines the set of policies that our model can condition itself upon. Importantly, we do not learn the distribution of this space, but rather select a prior distribution over the space from which we sample during training. In practice, we use a latent space with 16 dimensions bounded between -1 and 1, and use a uniform sampling prior.

Architecture

COMPASS is agnostic to the network architecture used, so long as the resulting policy is, in some way, conditioned on the vector sampled from the latent space. This can be achieved in numerous ways, from directly concatenating the vector to the input observation to conditioning keys, queries, and values in the self-attention models commonly used for CO. We refer to Appendix D for further details about the architectures used in this work and how the latent vector is used to condition them. Whilst it is possible to train COMPASS from scratch, we found that it was simple and efficient to adapt pre-trained single-policy models. To do this, we zero-initialize any additional weights corresponding to the sample latent vector such that it has no impact at the start of training. In practice, we adapt a single-agent architecture designed for few-shot inference in all of our problem settings; POMO (Kwon et al., 2020) for TSP and CVRP, and a similar architecture taken from Jumanji (Bonnet et al., 2023) for JSSP (we also considered the current SOTA model L2D (Zhang et al., 2020), however, we found that the model from Jumanji already outperformed this approach). Full network details can be found in Appendices D.1 (TSP & CVRP) and D.2 (JSSP).

Training

The training procedure aims to specialize subareas of the latent space to sub-distributions of problems by training the policy solely on latent vectors that achieve the best performance on a given problem. At each training step, we uniformly sample a set of $N$ vectors from the latent space and condition the policy on each vector resulting in $N$ conditioned policies. After evaluating each policy on the problem instance, we train the best policy (i.e. the policy conditioned on the best-performing latent vector) on the instance. The model is updated using the gradient of our objective as given by

\displaystyle\nabla_{\theta}J_{\text{compass}}=\mathbb{E}_{\rho\sim\mathcal{D}% }\mathbb{E}_{z_{1},...,z_{N}\sim\mathcal{P}_{z}}\mathbb{E}_{\tau_{i}\sim\pi_{% \theta}(\cdot|z_{i})}[

\displaystyle\nabla_{\theta}\log\pi_{\theta}(\tau_{i^{\star}}|z_{i^{\star}})(R% _{i^{\star}}-\mathcal{B}_{\rho,\theta})],

(1)

where $\mathcal{D}$ is the data distribution, $\mathcal{P}_{z}$ the latent space, $z_{i}$ a latent vector, $\pi_{\theta}$ the conditioned policy, $\tau_{i}$ the trajectory generated by policy $\pi_{\theta}$ conditioned on vector $z_{i}$ and has the corresponding reward $R_{i}$ , $i^{\star}$ is the index of the best performing latent vector (in the sampled set) and is expressed as $i^{\star}=\arg\max_{i\in[1,N]}R(\tau_{i})$ , and lastly, $\mathcal{B}_{\rho,\theta}$ is the baseline, inspired by Kwon et al. (2020). Full details of the algorithmic procedure can be found in Appendix F. Notably, our work is the first to create a specialized and diverse set of policies represented by a continuous latent space by only training the best-performing vector for each problem instance.

A key training hyperparameter is the number of condition vectors sampled during evaluation. More conditioned policies results in an increased certainty that the best-performing vector in the sampled set of conditions is the best-performing vector in the latent space. Therefore, increasing the number of sampled conditions increases the likelihood of training the true best latent vector for the given problem instance, rather than a potentially suboptimal vector. More details (including training times and environment steps) are reported in Appendix F.

Inference-time search

Given the latent space of diverse, specialized policies obtained by training COMPASS, at inference time, we apply a principled search procedure to find the most performant strategies. Our desired properties for a search procedure are that it should be simple, capable of rapid adaptation and robust to local optima. As such, evolutionary strategies are an appropriate approach. Specifically, we use Covariance Matrix Adaptation (CMA-ES, (Hansen and Ostermeier, 2001)). CMA-ES uses a multivariate normal distribution to sample vectors and iteratively updates the distribution’s mean to increase the expected performance of sampled vectors (i.e. the quality of the solution found by the policy corresponding to each vector). The covariance is also adapted over time, either for exploration (high values, broad sampling) or exploitation (small values, focused sampling).

For a given problem instance, there may be multiple high-performance policies, therefore we use several independent CMA-ES components in parallel. To ensure that those components explore distinct areas of the space (or at least, take different paths), we compute a Voronoi Tesselation (Du et al., 1999) of the latent space and use the corresponding centroids to initialize the means of the CMA-ES components. This method proves to be robust, easy to tune, and fast, and requires low memory and computation budget, making it the perfect candidate for efficient adaptation at inference time. In our experimental section (4.3), we present an analysis of our latent space and how it is explored by CMA-ES. Details and considered alternatives can be found in Section E.4.

4 Experiments

We evaluate our method on three problems – Travelling Salesman (TSP), Capacitated Vehicle Routing (CVRP), and Job Shop Scheduling (JSSP) – widely used to assess RL-based methods for CO (Deudon et al., 2018; Kool et al., 2019; Grinsztajn et al., 2022; Hottung et al., 2022). In Section 4.1, we evaluate COMPASS in the standard setting used by other methods from the literature and report results on each problem type. In Section 4.2, we assess the robustness of methods by evaluating them on instances of TSP and CVRP that are procedurally transformed using the approach developed by Bossek et al. (2019). In Section 4.3, we analyze the methods’ search strategies; in particular, we provide insights about COMPASS’ latent space and how it is navigated by CMA-ES at inference time. Figure 2 provides a radar plot overview of our aggregated experimental results across six performance categories of interest: (1) in distribution instances, OOD instances with different levels of distribution shift (2) small, (3) medium and (4) large, (5) large instance sizes as well as (6) few-shot performance. Our results highlight the strengths and weaknesses of each approach and in particular, the versatility and superiority of COMPASS.

Baselines

We compare COMPASS to a suite of leading RL methods and industrial solvers. Across all problems we provide baselines for EAS (Hottung et al., 2022); the current SOTA active-search RL method that fine-tunes the policy on each problem instance, and Poppy (Grinsztajn et al., 2022); the current SOTA active-search RL method that stochastically samples from a fixed population of pre-trained solvers. For routing problems (TSP and CVRP), we also provide results for POMO (Kwon et al., 2020); the leading single-agent, one-shot architecture on which EAS and Poppy are built, and LKH (Helsgaun, 2017); a leading industrial solver. We also report results of TSP-specific industrial solver Concorde (Applegate et al., 2006). For JSSP, we provide results for L2D (Zhang et al., 2020); the leading single-agent, one-shot architecture. We also provide results for the attention-based model proposed in Jumanji (Bonnet et al., 2023) that proved to outperform L2D. Finally, we report the results of Google OR-Tools (Perron and Furnon, 2019); the reference industrial solver for JSSP.

Training

As our method is capable of adopting initial parameters from a pre-trained model, we re-use publicly available checkpoints of POMO (details in Appendix H) as the starting point for COMPASS on TSP and CVRP. For JSSP, we found attention-based model from Bonnet et al. (2023) outperforms L2D and hence choose it to be the reference single-agent architecture. We train the model and use the same trained checkpoints for all methods. We then train COMPASS until convergence on the same training distribution as that used to train the initial checkpoint. For TSP and CVRP these are problem instances with 100 locations uniformly sampled within a unit square. For JSSP, we use the same training distribution used in EAS, which is an instance with 10 jobs and machines, and a maximum possible duration of 98. A single set of hyperparameters is used across all problems, with full training details provided in Appendix G.

Inference

When evaluating active-search performance, each method is given a fixed budget of 1600 attempts – similar to Hottung et al. (2022); Grinsztajn et al. (2022) –, where each attempt consists of one trajectory per possible starting point. This approach is used to enable direct comparison to POMO and EAS which use rollouts from all starting points at each step. For the main results on TSP and CVRP, we do not use the “augmentation trick”; where the same problem is solved multiple times by rotating the coordinate frame to make it appear different and thus generate additional diverse trajectories. This trick was used in a few baselines from prior work, however, we refrain from using it in the main results of this work for two reasons: (1) it is a domain-specific trick mainly applicable to routing problems and (2) it significantly increases the required computational budget. We nevertheless provide some results in both settings to ease comparison with previous work. Overall, the trajectory budget is exactly the same as the one used in Grinsztajn et al. (2022); Hottung et al. (2022). Note that expressing the budget in terms of trajectories gives an advantage to EAS, which uses more time, memory, and computation due to the backpropagations used to update the policy during the search.

Code availability

We release the code¹¹1Code, checkpoints and evaluation sets are available at https://github.com/instadeepai/compass used to train our method and to run all baselines. We also make our checkpoints available for all three problems, along with the datasets necessary to reproduce the results. To ensure fair comparison and extend our evaluation to new settings, we reimplemented all baselines within the same codebase. For the three problems, we used the JAX (Bradbury et al., 2018) implementations from Jumanji (Bonnet et al., 2023) to leverage hardware accelerators (e.g. TPU). Our code is optimized for TPU v3-8, which is the hardware used for our experiments.

4.1 Standard benchmarking on TSP, CVRP, and JSSP

We evaluate our method on benchmark sets frequently used in the literature (Kool et al., 2019; Kwon et al., 2020; Grinsztajn et al., 2022; Hottung et al., 2022). Specifically, for TSP and CVRP, we use datasets of $10\,000$ instances drawn from the training distribution, with the positions of $100$ cities/customers uniformly sampled within the unit square, and three datasets not seen during training, each containing $1000$ problem instances but with larger sizes: $125$ , $150$ and $200$ , also generated from a uniform distribution over the unit square. We use the exact same datasets as in the literature.

Table 1: Results of COMPASS against the baseline algorithms for (a) TSP, (b) CVRP, and (c) JSSP problems. The methods are evaluated on instances from training distribution as well as on larger instance sizes to test generalization.

(a) TSP

Training distr.

Generalization

n=100

n=125

n=150

n=200

Method

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

Concorde

7.765

0.000\%

82M

8.583

0.000\%

12M

9.346

0.000\%

17M

10.687

0.000\%

31M

LKH3

7.765

0.000\%

8.583

0.000\%

73M

9.346

0.000\%

99M

10.687

0.000\%

POMO (greedy)

POMO (sampling)

Poppy 16

EAS

COMPASS (ours)

7.796

7.779

7.766

7.778

7.765

0.404%

0.185%

0.013%

0.161%

0.002%

37S

8.635

8.609

8.587

8.604

8.586

0.607%

0.299%

0.050%

0.238%

0.036%

20M

38M

20M

9.440

9.401

9.359

9.380

9.354

1.001%

0.585%

0.141%

0.363%

0.083%

10S

32M

10.933

10.956

10.795

10.759

10.724

2.300%

2.513%

1.007%

0.672%

0.348%

21S

70M

101M

70M

(b) CVRP

Training distr.

Generalization

n=100

n=125

n=150

n=200

Method

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

LKH3

15.65

0.000\%

17.50

0.000\%

19.22

0.000\%

22.00

0.000\%

POMO (greedy)

POMO (sampling)

Poppy 32

EAS

COMPASS (ours)

15.874

15.713

15.663

15.594

1.430%

0.399%

0.084%

0.081%

-0.361%

17.818

17.612

17.548

17.536

17.511

1.818%

0.642%

0.276%

0.146%

0.064%

<1M

43M

42M

81M

42M

19.750

19.488

19.421

19.321

19.313

2.757%

1.393%

1.044%

0.528%

0.485%

23.318

23.378

23.352

22.541

22.462

5.992%

6.264%

6.144%

2.460%

2.098%

100M

Training distr.

Generalization

10\times 10

15\times 15

20\times 15

Method

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

OR-Tools

807.6

0.0\%

37S

1188.0

0.0\%

1345.5

0.0\%

80H

L2D (sampling)

Single

Poppy 16

EAS

COMPASS (ours)

871.7

862.1

849.7

858.4

845.5

8.0\%

6.7\%

5.2\%

6.3\%

4.7%

1378.3

1302.6

1290.4

1295.2

1282.8

16.0\%

9.6\%

8.6\%

9.0\%

8.0%

25H

1624.6

1503.0

1495.7

1498.0

1485.6

20.8\%

11.7\%

11.2\%

11.3\%

10.4%

40H

11H

Results

The average performance of each method across all problem settings is presented in Table 1. We find that COMPASS demonstrates superior performance on all of the 11 test sets considered. Moreover, the degree of improvement is significant across all problem types. On TSP and JSSP, COMPASS reduces the optimality gap on the training distribution by a factor of 6.5 and 1.3, respectively. On CVRP, COMPASS is the only RL method able to outperform the industrial solver LKH. Finally, COMPASS is also found to generalize well to larger problem instances unseen during training. COMPASS obtains the best solutions in all TSP, CVRP and JSSP sets.

The same benchmark is also reported with the “augmentation trick” in Table 2 for TSP and CVRP. This trick can only be used for routing problems and is not applicable to JSSP. Interestingly, COMPASS is the only method that performs on par or better without the “augmentation trick”, showing its ability to adapt and find diversity in its latent space rather than through a problem-specific augmentation. In this setting, COMPASS is outperformed by EAS on two instance sizes of CVRP. Nevertheless, EAS is 50% slower and more computationally expensive as it requires updating an entire subset of its network’s weights, as opposed to simply navigating the 16-dimensional latent space of policies as is done in COMPASS. Overall, COMPASS remains the leading method and the conclusions drawn above remain unchanged.

4.2 Robustness to generalization: solving mutated instances

To further study the generalization ability of our method, we consider the mutation operators introduced by Bossek et al. (2019) to procedurally transform instances drawn from the training distribution. By progressively increasing the power of the applied mutations we construct new datasets that are increasingly far from the training distribution whilst not modifying the overall size of the problem.

We use 9 different mutation operators (explosion, implosion, cluster, rotation, linear projection, axis projection, expansion, compression and grid). One can find an illustration of the entire set of mutations along with their mathematical definition in Appendix C. Interestingly, it enables us to evaluate the methods on instances that look closer to real-life situations. For instance, the operator that gathers nodes in a cluster can mimic a dense city surrounded by its nearby suburbs. In practice, each mutation operator is parameterized by a factor that controls the probability of mutating each node of the instance - referred to as mutation power - this factor directly impacts the shift between the training distribution and the new distribution. We use 10 values, going from 0 (no change) to 0.9 (highly mutated instances).

Results

We plot the relative performance of the baselines compared to COMPASS in Fig. 3. A negative performance ratio indicates that a method does not provide as good of a solution as COMPASS, and we observe that this is the case for all baseline methods, at all mutation strengths, on both TSP and CVRP. Moreover, COMPASS is seen to generalize significantly better than the methods that only rely on stochastic sampling for their search, namely POMO and Poppy. This validates our intuition that adaptive policies are especially important for handling out-of-distribution data, where the optimal policy may be significantly different to that needed during pre-training. Even compared to EAS, which fine-tunes the policy to the target problem instance, we find that COMPASS maintains a significant performance gap across all mutation strengths. This result is particularly noteworthy as our approach only modifies 16 parameters (the conditioning vector sampled from our latent space), compared to EAS, which updates more than $10^{4}$ parameters (the embeddings of the instance’s nodes).

It is interesting to note that the relative generalization performance of COMPASS compared to EAS is stronger on these mutated instances than the larger instances considered in Section 4.1. We hypothesize that this is because EAS actively fine-tunes the embeddings of every location in a given problem instance. Therefore, as the problem size increases, so does the number of free parameters to adapt the policy (albeit with commensurately increasing computational overhead). This suggests that further improvements to COMPASS could be possible by increasing the number of adapted parameters (i.e. the latent space dimension), however, we defer further investigation to future works.

4.3 Analysis of the search strategies

In this section, we analyze the structure of the latent space and the behavior of the search procedure both empirically and visually.

Figure 4 details the performance of our considered methods as a function of the overall search budget. The left panel reports the quality of the best solution found so far (i.e. the cumulative performance), whereas the right plot reports the mean and standard deviations of the latest batch of solutions (i.e. the current performance) during the search process. From this, we would highlight three main conclusions. (i) Adaptive methods (COMPASS, EAS) perform well as they are able to improve the mean performance of their solution over time, in contrast to stochastic sampling methods (Poppy, POMO). This also highlights that the latent space of COMPASS has been able to diversify and can be exploited. (ii) Highly-focused (low-variance) search does not always outperform stochastic exploration. Concretely, whilst EAS quickly adapts a policy with better average performance than Poppy (right panel), the additional variance of Poppy’s multiple diverse policies means it produces better overall solutions (left panel). (iii) COMPASS is able to combine both of the previously discussed aspects for a highly performant search procedure. By using an adaptive covariance mechanism as well as its multiple components to navigate several regions of the latent policy space, it focuses its search on promising strategies (better average performance) whilst maintaining a broad beam (higher variance).

To better understand how COMPASS’s latent space is structured and explored, Fig. 5 presents the trajectory of a single CMA-ES component during the search of a 2D latent space on a randomly chosen problem instance. We can first observe that even for a specific problem instance, there are several high-performing areas of interest which highlights the advantage of having multiple search components. Furthermore, it shows how the evolution strategy explores the space. The search variance is initially high to improve exploration until the search center moves into a high-performing area, whereupon the variance is gradually decreased to better exploit these promising strategies. We provide additional plots and explanation in Section E.2 for other problem instances, demonstrating the spread of the specialised areas depending on the problem instance.

Lastly, it is worth noting that the adaptation mechanism of COMPASS (CMA-ES search) comes with negligible time cost (e.g. three orders of magnitude smaller than the time needed for the environment rollout), which is a strength compared to the costly backpropagation-based updates performed in EAS. We provide additional time analysis in Appendix L.

Table 2: Results of COMPASS and the baseline algorithms with instance augmentation for (a) TSP and (b) CVRP. We also report COMPASS with no augmentation (no aug.).

(a) TSP

Training distr.

Generalization

n=100

n=125

n=150

n=200

Method

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

CVAE-Opt

SGBS

SGBS+EAS-Lay

POMO (sampling)

Poppy 16

EAS

COMPASS (aug.)

COMPASS (no aug.)

-
7.769
7.769
7.767
7.765
7.768
7.765
7.765

0.343%
0.058%
0.058%
0.026%
0.002%
0.038%
0.002%
0.002%

8.646
-
-
8.594
8.584
8.590
8.584
8.586

0.736%
-
-
0.128%
0.009%
0.080%
0.009%
0.036%

21H

20M

31M

20M

9.482
9.367
9.359
9.376
9.351
9.361
9.350
9.354

1.45%
0.220%
0.174%
0.321%
0.141%
0.159%
0.043%
0.083%

30H

32M

50M

32M

-
10.753
10.727
10.916
10.802
10.730
10.723
10.724

-
0.619%
0.40%
2.14%
1.08%
0.403%
0.337%
0.348%

14M

70M

85M

70M

(b) CVRP

Training distr.

Generalization

n=100

n=125

n=150

n=200

Method

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

CVAE-Opt

SGBS

SGBS+EAS-Lay

POMO (sampling)

Poppy 32

EAS

COMPASS (aug.)

COMPASS (no aug.)

-
15.66
15.594
15.67
15.62
15.623
15.65
15.594

1.36%
0.01%
-0.36%
0.18%
-0.14%
-0.175%
-0.00%
-0.36%

11D

10M

17.87
-
-
17.56
17.49
17.473
17.52
17.511

2.08%
-
-
0.33%
-0.10%
-0.153%
0.09%
0.06%

36H

43M

42M

67M

42M

19.84
19.43
19.168
19.43
19.32
19.261
19.33
19.313

3.24 %
1.08%
-0.27%
1.08%
0.50%
0.213%
0.56%
0.49%

46H

-
22.57
21.988
23.24
22.94
22.556
22.55
22.462

-
2.59%
-0.05%
5.64%
4.27%
2.527%
2.49%
2.10%

100M

5 Conclusion

We present COMPASS, a novel approach to solving CO problems using RL. Our approach is motivated by the observation that active search is a key component to finding high-quality solutions to NP-hard problems. Finding one-shot solutions that are near-optimal is believed to be impossible. Instead, COMPASS is trained to create a distribution of diverse and specialized policies, conditioned on a structured latent space. This diversification is achieved by using an objective that specializes areas of the space on sub-distributions of problem instances. By navigating this latent space at inference time COMPASS is able to find the most performant policy for a given instance. Empirical results show that COMPASS achieves state-of-the-art performance on all 11 standard benchmark tasks across three distinct CO problems, TSP, CVRP and JSSP, outperforming prior RL methods based on either stochastic sampling or fine-tuning. We extend the canonical evaluation sets with instances that are procedurally transformed using mutation operators introduced in prior work. This additional set of tasks enables us to assess the generalization ability of COMPASS. We show that COMPASS is particularly robust to out-of-distribution instances, achieving superior performance in all 18 tasks considered. To better understand the benefits of our search procedure, we provide an empirical analysis of the latent space’s structure, along with evidence of how it is explored at inference time. We show that, despite having no explicit regularization during training, the latent space exhibits clear regions of interest, and our search procedure is able to explore this space using an evolution strategy to produce high-performing policies. Overall, COMPASS proves to be performant, robust, and versatile on many types of CO problems and is able to provide solutions quickly at a reasonable computational cost.

Limitations and future work. The diversity of the policies contained in the latent space is closely linked to the specialisation that can be obtained from the training distribution, and hence potentially limited. We would like to inspect whether a broader range of policies could be obtained by using an additional unsupervised diversity reward, or by procedurally diversifying the distribution used. Another limitation of our method is the lack of structure in our latent space. Although we proved that it was enough to be successfully explored by an evolution strategy, we hypothesize that a better defined space could be searched through faster. We would like to inspect the use of regularization terms during the training phase to achieve this.

Acknowledgements

Research supported with Cloud TPUs from Google’s TPU Research Cloud (TRC). We thank anonymous reviewers for comments and helpful discussions that helped improve the paper.

References

Applegate et al. [2006] D. Applegate, R. Bixby, V. Chvatal, and W. Cook. Concorde TSP solver, 2006.
Barrett et al. [2020] T. D. Barrett, W. R. Clements, J. N. Foerster, and A. I. Lvovsky. Exploratory combinatorial optimization with reinforcement learning. In In Proceedings of the 34th National Conference on Artificial Intelligence, AAAI, 2020.
Bello et al. [2016] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
Bonnet et al. [2023] C. Bonnet, D. Luo, D. Byrne, S. Abramowitz, V. Coyette, P. Duckworth, D. Furelos-Blanco, N. Grinsztajn, T. Kalloniatis, V. Le, O. Mahjoub, L. Midgley, S. Surana, C. Waters, and A. Laterre. Jumanji: a suite of diverse and challenging reinforcement learning environments in jax, 2023. URL https://github.com/instadeepai/jumanji.
Bossek et al. [2019] J. Bossek, P. Kerschke, A. Neumann, M. Wagner, F. Neumann, and H. Trautmann. Evolving diverse tsp instances by means of novel and creative mutation operators. In Proceedings of the 15th ACM/SIGEVO Conference on Foundations of Genetic Algorithms, FOGA ’19, page 58–71, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362542. doi: 10.1145/3299904.3340307. URL https://doi.org/10.1145/3299904.3340307.
Bradbury et al. [2018] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
Browne et al. [2012] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
Chalumeau et al. [2023a] F. Chalumeau, R. Boige, B. Lim, V. Macé, M. Allard, A. Flajolet, A. Cully, and T. Pierrot. Neuroevolution is a competitive alternative to reinforcement learning for skill discovery. In International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=6BHlZgyPOZY.
Chalumeau et al. [2023b] F. Chalumeau, B. Lim, R. Boige, M. Allard, L. Grillotti, M. Flageat, V. Macé, A. Flajolet, T. Pierrot, and A. Cully. Qdax: A library for quality-diversity and population-based algorithms with hardware acceleration, 2023b.
Chen and Tian [2019] X. Chen and Y. Tian. Learning to perform local rewriting for combinatorial optimization. In Advances in Neural Information Processing Systems, 2019.
Choo et al. [2022] J. Choo, Y.-D. Kwon, J. Kim, J. Jae, A. Hottung, K. Tierney, and Y. Gwon. Simulation-guided beam search for neural combinatorial optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.longhoe.net/abs/2207.06190.
Contardo et al. [2012] C. Contardo, C. Morency, and L.-M. Rousseau. Balancing a dynamic public bike-sharing system, volume 4. 2012.
Cully [2021] A. Cully. Multi-emitter map-elites: Improving quality, diversity and data efficiency with heterogeneous sets of emitters. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’21, page 84–92, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383509. doi: 10.1145/3449639.3459326. URL https://doi.org/10.1145/3449639.3459326.
Dai et al. [2017] H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, 2017.
de O. da Costa et al. [2020] P. R. de O. da Costa, J. Rhuggenaath, Y. Zhang, and A. Akcay. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In Asian Conference on Machine Learning, 2020.
Deudon et al. [2018] M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.-M. Rousseau. Learning heuristics for the tsp by policy gradient. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 170–181. Springer International Publishing, 2018.
Du et al. [1999] Q. Du, V. Faber, and M. Gunzburger. Centroidal voronoi tessellations: Applications and algorithms. SIAM Review, 41(4):637–676, 1999. doi: 10.1137/S0036144599352836. URL https://doi.org/10.1137/S0036144599352836.
Eysenbach et al. [2019] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019.
Fontaine et al. [2020] M. C. Fontaine, J. Togelius, S. Nikolaidis, and A. K. Hoover. Covariance matrix adaptation for the rapid illumination of behavior space. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20, page 94–102, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371285. doi: 10.1145/3377930.3390232. URL https://doi.org/10.1145/3377930.3390232.
Froger et al. [2016] A. Froger, M. Gendreau, J. E. Mendoza, Éric Pinson, and L.-M. Rousseau. Maintenance scheduling in the electricity industry: A literature review. European Journal of Operational Research, 251(3):695–706, 2016. ISSN 0377-2217. doi: https://doi.org/10.1016/j.ejor.2015.08.045. URL https://www.sciencedirect.com/science/article/pii/S0377221715008012.
Grinsztajn et al. [2022] N. Grinsztajn, D. Furelos-Blanco, and T. D. Barrett. Population-based reinforcement learning for combinatorial optimization. arXiv preprint arXiv:2210.03475, 2022.
Hansen and Ostermeier [2001] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001. doi: 10.1162/106365601750190398.
Helsgaun [2017] K. Helsgaun. An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems. Roskilde University, Tech. Rep., 2017.
Hopfield and Tank [1985] J. J. Hopfield and W. D. Tank. “neural” computation of decisions in optimization problems. Biological cybernetics, 52(3):141–152, 1985.
Hottung and Tierney [2020] A. Hottung and K. Tierney. Neural large neighborhood search for the capacitated vehicle routing problem. In 24th European Conference on Artificial Intelligence (ECAI 2020), 2020.
Hottung et al. [2021] A. Hottung, B. Bhandari, and K. Tierney. Learning a latent search space for routing problems using variational autoencoders. In International Conference on Learning Representations, 2021.
Hottung et al. [2022] A. Hottung, Y.-D. Kwon, and K. Tierney. Efficient active search for combinatorial optimization problems. In International Conference on Learning Representations, 2022.
Kim et al. [2021] M. Kim, J. Park, and J. Kim. Learning collaborative policies to solve np-hard routing problems. In Advances in Neural Information Processing Systems, 2021.
Kim et al. [2022] M. Kim, J. Park, and J. Park. Sym-nco: Leveraging symmetricity for neural combinatorial optimization. In Advances in Neural Information Processing Systems, 2022. doi: 10.48550/arXiv.2205.13209.
Kool et al. [2019] W. Kool, H. van Hoof, and M. Welling. Attention, learn to solve routing problems! In International Conference on Learning Representations, 2019.
Kwon et al. [2020] Y.-D. Kwon, B. K. **ho Choo, Y. G. Iljoo Yoon, and S. Min. Pomo: Policy optimization with multiple optima for reinforcement learning. In Advances in Neural Information Processing Systems, 2020.
Laterre et al. [2018] A. Laterre, Y. Fu, M. K. Jabri, A.-S. Cohen, D. Kas, K. Hajjar, T. S. Dahl, A. Kerkeni, and K. Beguir. Ranked reward: Enabling self-play reinforcement learning for combinatorial optimization. arXiv preprint arXiv:1807.01672, 2018.
Lim et al. [2022] B. Lim, M. Allard, L. Grillotti, and A. Cully. Accelerated quality-diversity for robotics through massive parallelism. arXiv preprint arXiv:2202.01258, 2022.
Ma et al. [2021] Y. Ma, J. Li, Z. Cao, W. Song, L. Zhang, Z. Chen, and J. Tang. Learning to iteratively solve routing problems with dual-aspect collaborative transformer. In Advances in Neural Information Processing Systems, volume 34, pages 11096–11107, 2021.
Mazyavkina et al. [2021] N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021. ISSN 0305-0548. doi: https://doi.org/10.1016/j.cor.2021.105400. URL https://www.sciencedirect.com/science/article/pii/S0305054821001660.
Perron and Furnon [2019] L. Perron and V. Furnon. OR-Tools, 2019. URL https://developers.google.com/optimization/.
Sharma et al. [2019] A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman. Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.
Son et al. [2023] J. Son, M. Kim, H. Kim, and J. Park. Meta-SAGE: Scale meta-learning scheduled adaptation with guided exploration for mitigating scale shift on combinatorial optimization. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32194–32210. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/son23a.html.
Steinbiss et al. [1994] V. Steinbiss, B.-H. Tran, and H. Ney. Improvements in beam search. In Third international conference on spoken language processing, 1994.
Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
Vinyals et al. [2015] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, 2015.
Wu et al. [2022] Y. Wu, W. Song, Z. Cao, J. Zhang, and A. Lim. Learning improvement heuristics for solving routing problems. IEEE Transactions on Neural Networks and Learning Systems, 33(9):5057–5069, 2022.
Xin et al. [2021] L. Xin, W. Song, Z. Cao, and J. Zhang. Multi-decoder attention model with embedding glimpse for solving vehicle routing problems. In In Proceedings of the 35th National Conference on Artificial Intelligence, AAAI, 2021.
Zhang et al. [2020] C. Zhang, W. Song, Z. Cao, J. Zhang, P. S. Tan, and C. Xu. Learning to dispatch for job shop scheduling via deep reinforcement learning. In Advances in Neural Information Processing Systems, 2020.

Appendix

\parttoc

Appendix A Extended Results

In this section, we compare the results of our method to an extensive list of baselines for the TSP, CVRP, and JSSP problems. Subsection A.1 focuses on the results of the standard benchmark, corresponding to instances from the training distribution as well as larger instances. Subsection A.2 presents the results for procedurally transformed (and hence out-of-distribution) instances.

A.1 Extended results on standard benchmark

In Section 4 of the main paper, we report the performances of the main competitors on a standard benchmark for TSP, CVRP, and JSSP. We report extended results, with other competitors (2-Opt-DL [Wu et al., 2022] and LIH [de O. da Costa et al., 2020]) and inference time. We also report the results of an alternative architecture (L2D) for JSSP. Tables 3(a), 3(b), and 3(c) show the results of all methods for TSP, CVRP, and JSSP problems, respectively. The first column in the tables shows the average performance across the validation set, which is the mean tour length for TSP and CVRP, and the mean schedule duration for JSSP, and the second and third columns show the optimality gap and total run-time, respectively.

Examining the inference time provides valuable insights for comparison, especially considering that time constraints are often a crucial factor in industrial applications. On the whole benchmark, our method is as fast as the baselines (POMO, Poppy) and almost three times faster than EAS.

Results from Table 6(c) confirm our choice to use the attention-based model from Jumanji [Bonnet et al., 2023]: the attention-based model is performing better on all sets, while being 5 times faster. We can also compare EAS used with both architectures (L2D or the attention-based model). EAS is performing better with the attention-based architecture in 2 datasets out of 3. Interestingly, EAS (w/ L2D) is performing very well on the training distribution, but generalizes less than EAS (w/ attention-based model).

Table 3: Results of COMPASS against the baseline algorithms for (a) TSP, (b) CVRP, and (c) JSSP problems. The methods are evaluated on instances from training distribution as well as on larger instance sizes to test generalization. Tables report the best solutions found, gaps to best industrial solvers, and inference times.

(a) TSP

Training distr.

Generalization

n=100

n=125

n=150

n=200

Method

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

Concorde

7.765

0.000\%

82M

8.583

0.000\%

12M

9.346

0.000\%

17M

10.687

0.000\%

31M

LKH3

7.765

0.000\%

8.583

0.000\%

73M

9.346

0.000\%

99M

10.687

0.000\%

2-Opt-DL

LIH

POMO (greedy)

POMO

Poppy 16

EAS

COMPASS (ours)

7.83

7.87

7.796

7.779

7.766

7.778

7.765

0.87%

1.42%

0.404%

0.185%

0.013%

0.161%

0.002%

41M

8.635

8.609

8.587

8.604

8.586

0.607%

0.299%

0.050%

0.238%

0.036%

20M

38M

20M

9.440

9.401

9.359

9.380

9.354

1.001%

0.585%

0.141%

0.363%

0.083%

10S

32M

10.933

10.956

10.795

10.759

10.724

2.300%

2.513%

1.007%

0.672%

0.348%

21S

70M

101M

70M

(b) CVRP

Training distr.

Generalization

n=100

n=125

n=150

n=200

Method

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

LKH3

15.65

0.000\%

17.50

0.000\%

19.22

0.000\%

22.00

0.000\%

LIH

POMO (greedy)

POMO

Poppy 32

EAS

COMPASS (ours)

16.03

15.874

15.713

15.663

15.594

2.47%

1.430%

0.399%

0.084%

0.081%

-0.361%

17.818

17.612

17.548

17.536

17.511

1.818%

0.642%

0.276%

0.146%

0.064%

<1M

43M

42M

81M

42M

19.750

19.488

19.421

19.321

19.313

2.757%

1.393%

1.044%

0.528%

0.485%

23.318

23.378

23.352

22.541

22.462

5.992%

6.264%

6.144%

2.460%

2.098%

100M

Training distr.

Generalization

10\times 10

15\times 15

20\times 15

Method

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

OR-Tools

807.6

0.0\%

37S

1188.0

0.0\%

1345.5

0.0\%

80H

L2D (Greedy)

L2D (Sampling)

EAS (w/ L2D)

Single

Poppy 16

EAS

COMPASS (ours)

988.6

871.7

837.0

862.1

849.7

858.4

845.5

22.3\%

8.0\%

3.7%

6.7\%

5.2\%

6.3\%

4.7\%

20S

1528.3

1378.3

1326.4

1302.6

1290.4

1295.2

1282.8

28.6\%

16.0\%

11.7\%

9.6\%

8.6\%

9.0\%

8.0%

44S

25H

22H

1738.0

1624.6

1570.8

1503.0

1495.7

1498.0

1485.6

29.2\%

20.8\%

16.8\%

11.7\%

11.2\%

11.3\%

10.4%

60S

40H

37H

11H

A.2 Results on the procedurally transformed instances

In Section 4.2, we present the performance of the methods under study on procedurally transformed instances. In particular, Figure 3 reports the relative performance of the baselines compared to our method COMPASS. In this section, we report the numerical results corresponding to this plot. Tables 4(a) show the results of all methods on TSP and 4(b) on CVRP. Each line corresponds to a mutation power used to procedurally transform instances. Mutations are used to create 1800 new instances. The columns report the average tour length and the optimality gap. Details about the way those datasets are created can be found in Appendix C.

Our method outperforms the baselines on the entire benchmark. Nevertheless, the industrial solver LKH3 is still performing better than COMPASS on all mutated datasets, showing room for improvement of the RL-based approaches.

Table 4: Results of the methods on procedurally transformed instances, obtained by applying mutations with increasing mutation powers (referred to as ‘Mu’). Our method COMPASS outperforms other baselines on all mutated instances.

(a) TSP

LKH3

POMO

Poppy 16

EAS

COMPASS (ours)

Obj.

Gap

Obj.

Gap

Obj.

Gap

Obj.

Gap

Obj.

Gap

0
1
2
3
4
5
6
7
8
9

7.768
7.609
7.562
7.485
7.386
7.308
7.182
7.063
6.910
6.732

0.000%
0.000%
0.000%
0.000%
0.000%
0.000%
0.000%
0.000%
0.000%
0.000%

7.782
7.624
7.577
7.502
7.402
7.326
7.201
7.085
6.938
6.770

0.188%
0.195%
0.199%
0.217%
0.219%
0.251%
0.272%
0.311%
0.402%
0.557%

7.769
7.611
7.563
7.488
7.389
7.311
7.186
7.069
6.918
6.744

0.02%
0.022%
0.022%
0.029%
0.034%
0.042%
0.06%
0.078%
0.118%
0.176%

7.78
7.622
7.574
7.499
7.398
7.322
7.196
7.079
6.928
6.753

0.163%
0.164%
0.163%
0.18%
0.166%
0.199%
0.199%
0.22%
0.256%
0.308%

7.769
7.611
7.563
7.487
7.388
7.310
7.184
7.066
6.914
6.738

0.013%
0.019%
0.014%
0.019%
0.021%
0.026%
0.036%
0.044%
0.067%
0.091%

(b) CVRP

LKH3

POMO

Poppy 16

EAS

COMPASS (ours)

Obj.

Gap

Obj.

Gap

Obj.

Gap

Obj.

Gap

Obj.

Gap

0
1
2
3
4
5
6
7
8
9

15.595
15.337
15.271
15.156
15.032
14.805
14.561
14.306
13.886
13.507

0.000%
0.000%
0.000%
0.000%
0.000%
0.000%
0.000%
0.000%
0.000%
0.000%

15.661
15.41
15.345
15.23
15.112
14.885
14.646
14.398
13.989
13.612

0.426%
0.471%
0.482%
0.488%
0.531%
0.545%
0.581%
0.641%
0.741%
0.773%

15.613
15.36
15.294
15.181
15.061
14.835
14.594
14.348
13.936
13.566

0.119%
0.148%
0.15%
0.164%
0.188%
0.206%
0.229%
0.292%
0.363%
0.431%

15.611
15.356
15.29
15.174
15.058
14.828
14.587
14.339
13.922
13.548

0.103%
0.121%
0.121%
0.123%
0.169%
0.154%
0.182%
0.227%
0.263%
0.298%

15.594
15.339
15.274
15.16
15.039
14.813
14.568
14.322
13.907
13.534

-0.009%
0.012%
0.018%
0.027%
0.043%
0.055%
0.048%
0.11%
0.15%
0.196%

Appendix B Analysis of the Performance during the Search Process

In this section, we examine how the performance of COMPASS, along with the baseline methods, evolves during the search process. The evaluation procedure consists of 160,000 rollouts per problem for TSP and CVRP, and 8,000 rollouts per problem for JSSP, distributed over the population for all problems. Figures 6, 7, 8 show the performances – for TSP, CVRP, and JSSP, respectively, – for our method COMPASS and the three main baselines, POMO (single-agent for JSSP), Poppy, and EAS. Each figure showcases the overall performance and the latest performance achieved by the methods for various instance sizes, including the training distribution size (left column), medium size (middle column), and large size (right column). The first row of plots illustrates the performance of the best solution discovered thus far in the search process, while the second row of plots presents the best performance of the last batch of solutions found at the current timestep of the search process.

Note that the ‘Latest’ performance metric reported in these plots is different from the one reported in Section 4.3 of the paper, as the latter reports the mean and standard deviation.

We can draw three main conclusions from those plots. (i) On all instance size of the TSP, COMPASS clearly outperforms baselines, with a search that constantly find better solutions on average. (ii) We can clearly see the difference between principled search (COMPASS, EAS) and stochastic sampling (POMO, Poppy 16), as the former have an improving ‘latest batch performance’, whereas the latter do not (iii) Interestingly, on several tasks, EAS has a higher maximum value on its latest batch (averaged on the 1000 problem instances); but the wider search of COMPASS enables to find better solutions for each problem instance in average. This can be observed on all JSSP sets and on the two first CVRP sets. For larger instances of CVRP, we can see that EAS is able to outperform our method: EAS is updating more parameters than COMPASS and this difference of modified parameters increases as the instance size increases (because its the product of the embedding size and the number of nodes in the instance). This larger number of updated parameters enables a better adaptation but also comes with a computational cost.

Appendix C Mutation Operators Used in the Generalization Tasks

In this section, we outline the mutation operators employed to generate the out-of-distribution (OOD) instances. The training distribution comprises random uniform Euclidean (RUE) instances, which are created by uniformly sampling the city coordinates from a unit square. To diversify the dataset and assess the methods’ generalization abilities, these RUE instances can be mutated to exhibit different underlying distributions. We utilize nine mutation operators (taken from Bossek et al. [2019]) to construct the OOD dataset for TSP and CVRP by mutating the RUE instances. These mutated instances not only closely resemble real-world geographical data but also serve as valuable benchmarks for evaluating the methods’ generalization capabilities. Figure 9 presents a visual representation of a RUE instance as well as the instance mutated by mutation operators and the list below defines each operator.

•

Explosion: this operator simulates a random explosion that creates a gap or hole in the point cloud. It randomly selects a center of explosion, and then displaces all cities within a specified radius of this center to locations outside the radius.
•

Implosion: this operator serves as the inverse of the explosion operator by bringing cities closer together towards a central point. It involves randomly selecting an implosion center and radius, and subsequently shifting all cities located within the implosion radius towards the center.
•

Cluster: this operator generates a concentrated cluster of cities by randomly selecting a cluster center and mutating cities within a specified radius around the center. Specifically, cities are randomly chosen and their locations are modified to be within the selected radius of the cluster center.
•

Rotation: this operator applies a rotation transformation to the cities around a specified pivot point. This mutation operator introduces angular displacement and rearranges the spatial arrangement of the cities in the TSP instance.
•

Linear Projection: this operator performs a linear projection of the cities onto a randomly generated line. A random subset of cities is selected and repositioned along the line according to their original distances.
•

Expansion: this operator merges the concepts of the explosion and linear projection mutations by displacing cities farther away from a randomly generated line in an orthogonal direction.
•

Compression: this operator, conversely to the expansion operator, is a combination of the implosion and linear projection operators by displacing the cities closer to the randomly generated line in an orthogonal direction.
•

Axis Projection: this operator is a special case of the linear project operator as the randomly generated line can either be parallel to the x- or y-axis.
•

Grid: this operator maps randomly chosen cities onto a grid-like structure. Specifically, the width, height, and proximity of the cities within the grid are randomly selected. Next, several cities are chosen from the instance and displaced into one of the grid locations.

It is worth noting that all operators are parameterized by a probability argument which denotes the likelihood of mutating each city within the instance, referred to as mutation power. As a result, the mutation power is directly proportional to the number of cities mutated, and hence positively correlated with the shift between the underlying distribution and the distribution of the RUE instance. Consequently, we use this mutation power as a reference to define the scale when studying the robustness of the baselines to out-of-distribution instances.

Appendix D Architecture details

D.1 TSP and CVRP Networks

The architecture of our model for the TSP and CVRP problems consists of several key components. First, we have a single encoder responsible for encoding problem instances into a matrix of embeddings. This encoder follows a similar approach to other reinforcement learning methods for combinatorial optimization, such as POMO [Kwon et al., 2020] and Poppy [Grinsztajn et al., 2022].

Next, we have a single conditioned decoder that takes in the embeddings and the current state of the environment and outputs the next action. In contrast to the encoder being called at the beginning of the episode, the decoder is called at each step of the episode conditioned to the same latent vector throughout a given episode. The conditioned decoder architecture is quite similar to the one utilized in POMO and Poppy. It incorporates a multi-head attention mechanism to compute cross-attention between the embeddings and a local context. This local context includes the embedding of the starting point, the embedding of the current node, and the mean embedding of all nodes. The notable difference between our method COMPASS and prior works (POMO and Poppy) is that our decoder is conditioned on a vector sampled from a 16-dimension latent space. This latent vector is concatenated with the key, query, and value inputs of the multi-head attention decoder module. This conditioning allows us to create distinct policies while processing the same observation from the environment. Each latent vector corresponds to a unique policy, and thus, sampling the latent space to obtain vectors that our model can condition upon gives us an infinite set of policies.

D.2 JSSP Network

The model architecture used for the JobShop Scheduling Problem (JSSP) is different from the networks used for TSP and CVRP (where the latter were taken from POMO [Kwon et al., 2020]). Prior work, which achieved state-of-the-art for JSSP [Hottung et al., 2022], use the L2D model [Zhang et al., 2020] which is a Graph Neural Network, on a different, yet equivalent environment. We implemented an attention-based model and observed our model to outperform L2D and thus, decided to use our transformer-based architecture (similar to the TSP and CVRP models) to tackle JSSP.

We utilize the actor-critic transformer architecture, as implemented in Jumanji [Bonnet et al., 2023]. This architecture consists of an encoder and decoder network for both the actor and critic components. The encoder network incorporates attention layers for the machines’ status, operation durations (with positional encoding), and joint sequence of jobs and machines. During each step of the episode, the encoder network is called and produces joint embeddings of the jobs and machines. These embeddings, along with the latent vector, are then fed into the decoder network. The decoder network receives the encoder embeddings and the latent vector as inputs and concatenates each dimension of the embedding with the latent vector before passing it through a multi-layer perceptron. At each step, the decoder network generates N marginal categorical distributions for each machine and provides a value generated by the critic. It is important to note that the actor and critic networks are separate entities with distinct sets of weights, ensuring that they do not share any parameters.

Appendix E Latent Policy Space

E.1 Design

The latent policy space defines a set of vectors that the single model can condition itself upon, and thus, this latent space provides us with an infinitely large set of policies (which becomes specialized and diverse through our training procedure). There are several ways to define the latent space, and the simplest approach is to use a set of $N$ one-hot encoded vectors. This way is very similar to the definition of the skill space in the RL Skill-Discovery literature [Eysenbach et al., 2019, Sharma et al., 2019] and the difference between a set of independent policies and a single conditioned policy is reminiscent of the opposition between RL-based methods and Quality-Diversity methods for Skill Discovery [Chalumeau et al., 2023a]. Through preliminary experiments, we achieved a similar performance to Poppy 16 [Grinsztajn et al., 2022] (current SOTA) with a set of 16 one-hot encoded vectors. However, the drawback of using a discrete latent space is that we cannot have an infinite set of policies and cannot interpolate within the latent space to adapt our model. Therefore, we define our latent space as a continuous n-dimension square. This enables us with an infinite number of policies that can be uniformly sampled during training and strategically searched during inference. During the experimentation phase, we investigated several different distributions and space sizes and concluded with a 16-dimensional space constrained to $[-1,1]^{16}$ . In practice, we multiply the latent vector by a factor 100, which is equivalent to sampling in $[-100,100]^{16}$ .

E.2 Visualisation

In section 4.3, we train COMPASS with a two-dimensional latent space and report the visualization of its performance landscape on a randomly selected instance of TSP $150$ . To obtain this visualization, COMPASS was trained on a distribution of TSP $100$ , and then we evaluated $32\,000$ latent vectors on a randomly sampled TSP $150$ instance. This enables us to create a precise heat-map (contour plot) of the performance landscape of the latent space on this instance. In Figure 10, we report $7$ additional visualizations for $7$ new instances (the first one is the one reported in the main paper).

We can draw three main observations out of this plot: (i) the landscape is instance-dependent: we can see that the whole landscape changes and, in particular, the performant regions are different for each of the 8 problems studied. This shows that the whole latent space is used at inference time, and there does not seem to be a subpart dominating the others. (ii) They are usually several performant areas in the latent space for a given instance. This motivates the use of several independent search components at inference time. It enables us to avoid having all the search budget used on a suboptimal area. In case there is a clear area of interest, the components are likely to all converge there, hence having no loss of performance. This observation is insightful as it illustrates that several distinct solving-strategies can lead to solutions of similar quality (even if those solutions are distinct). (iii) Interestingly, they are clear discontinuities in the performance landscape. We can observe those in all problems visualized, but Problem 5 is a particularly relevant example. We can see several clear frontiers in the landscape, which shows that they can be a clear discontinuity in the map** between solving-strategy and solution quality. We can make the hypothesis that there is one important decision that differs at the frontier, leading to a completely different performance.

Note that the deeper analysis of these latent spaces can provide very interesting insights about the solving process (e.g. the important decisions taken during the solving process), which could help to improve it (e.g. focusing the search on the important decision nodes). We leave this for future work.

E.3 Exploring the Latent Policy Space at Inference Time

In this section, we present a comparison of various search methods applied to our latent policy space. Specifically, we analyze the performance of 4 strategies to illustrate choices made in the design of our method. In particular, we compare (1) a naive strategy, consisting of sampling 16 random vectors from the latent space, and sampling stochastically the resulting policies. This strategy is referred to as ‘fixed policies’ because we are not re-sampling in the latent space. (2) a strategy consisting of sampling uniformly from the latent space and rolling out the resulting policy, referred to as ‘uniform-sampling’. This strategy can already illustrate the interest in having a latent space of diverse and specialized policies compared to sampling from a fixed set of policies. (3) CMA-ES search to navigate the latent space (4) our strategy, CMA-ES with several components (three), to illustrate the interest of being able to focus on distinct areas of the latent space.

Figure 11 illustrates the performances of the different search strategies on a set of TSP150 instances. We report the global performance as well as the latest batch performance. We can highlight three observations: (i) Sampling from the latent space brings significant improvement compared to sampling stochastically from a fixed set of policies. This shows the interest in having access to a space of diverse and specialized policies. (ii) Using an Evolution Strategy (like CMA-ES) to focus the search helps to make better use of the budget. We can see that the latest sampled solutions get better as the budget is used and the global performance improves faster compared to uniform sampling (iii) The final performance of the search gets better with three independent CMA-ES components rather than one. Being able to focus on several areas of the latent space enables us to avoid local optima and helps exploration. Interestingly, we can see that using all the budget for one component gives faster improvement but gets surpassed at the end of the budget. The choice of the accurate number of components is a middle ground between risking staying stuck in local optima and not having enough time to converge. In most of our tasks, using two or three components proved to work best.

E.4 Alternative Strategies

At inference time, our goal is to explore the latent space to efficiently find promising areas of the latent space in order to obtain high-quality solutions within the given computational budget. As explained in Section 3.2, we decide to use multiple components of an evolution strategy to navigate our search space, because this approach is able to focus the search while being robust to local optima. Additionally, its parameters are easy to tune and work well on a wide range of tasks. Using multiple independent components of CMA-ES is also used in other works from the literature [Fontaine et al., 2020, Cully, 2021], out of the CO scope. In practice, our implementation of CMA-ES is inspired by the GitHub package QDax [Chalumeau et al., 2023b, Lim et al., 2022].

In Appendix E.3, we provide a comparison of our search strategy (CMA-ES with multiple components) with alternatives. In particular, we compare CMA-ES with a single component to uniform sampling in the latent space. Nevertheless, we also considered other alternatives for this search. First, we explored Bayesian Search, as it is a data-efficient search that can hence be appropriate when the budget is limited, which is our case. Nevertheless, this approach was never able to do better than random sampling, although we tried multiple sets of hyper-parameters. A potential limitation of Bayesian Search is that it needs to model the space with a Gaussian Process, which can be very tedious if the space is noisy. Our main hypothesis is that the latent space of COMPASS is too noisy to be correctly modeled under the budget constraint. We see two mitigation of this effect. The first one is to add a regularization term during the training to create a ‘better-defined’ latent space. The second would be to use a distribution to sample the space and use a Bayesian Search to optimize the parameters of this distribution. We leave these for future work.

Another alternative to Evolution Strategy is Gradient Descent, since we do have access to the derivatives. Nevertheless, gradient descent can easily be stuck in local optima, which is what we observe when analyzing the search of EAS in Section 4.3. That being said, note that EAS and COMPASS could be combined, which could be expected to provide very good results, particularly in CVPR. We also defer this to future work.

Appendix F Training Procedure

In this section, we describe our model’s training process. We first have a pre-training phase where we either reuse an existing model or train a single model with an encoder and a non-conditioned decoder using the REINFORCE algorithm.

Next, we begin the training procedure which aims to create a diverse set of specialized policies. In this process, we designate our set of policies with a 16-dimensional latent space that contains vectors that can be used to condition our decoder model (i.e., a policy is parameterized by the decoder model parameters and the latent vector). Specifically, the vector is concatenated with the key, query, and value inputs of the decoder model for each step of the episode. In order to use the learned decoder from the pre-training phase, we initialize the extra weights with zeros such that they do not initially affect the output.

In the training procedure, the policy is trained to use the latent space to specialize to subareas of the problem distribution and this is achieved by solely training the policy conditioned on the latent variable that yields the highest reward for this instance. The details of the COMPASS training procedure is presented in Algorithm 1 and can be understood as follows. At each iteration, we sample $N$ vectors from the latent space $\mathcal{Z}$ and a batch $\mathcal{B}$ of instances from the problem distribution $\mathcal{D}$ . Then, for each instance $\rho_{i}\text{ where }i\in 1,\dots,\mathcal{B}$ and sampled vector $z_{k}\text{ where }k\in 1,\dots,N$ , we rollout the conditioned policy $\pi_{\theta}(\cdot|z_{k})$ on the problem instance (i.e., generate a trajectory which represents a solution to the instance). Next, for each instance, we determine the best-performing latent vector and this is done by computing which conditioned policy obtained the highest reward on the instance. Finally, we only train the best latent vector for each problem instance and use the REINFORCE loss to perform backpropagation through the network parameters of our model (including both the encoder and decoder networks).

Note, we only train on instances that have a conditioned policy performing strictly better than the remaining policies. For example, if two policies have the exact same performance which is the maximum amongst the set of sampled policies, neither of the policies is trained on the instance. This approach enhances specialization and promotes a more balanced distribution of instances solved by each policy, resulting in a better spread of contributions within the latent space.

Lastly, it is also important to note that the number of vectors sampled during training (i.e., the number of conditioned policies competing for each instance) plays a significant role in the training process. The number of sampled latent vectors can be potentially infinite, the constraint lies in hardware or runtime limitations. Increasing the "number of samples $N$ in COMPASS promotes greater specialization and competition among policies in the latent space, which can be leveraged during inference when tackling new instances.

The final COMPASS neural solvers are trained until convergence, on a TPU v3-8. For each problem the training time and environment steps are: 4.5 days (110M steps) for TSP, 5.5 days (76.5M steps) for CVRP and 4.5 days (4.2M steps) for JSSP.

Algorithm 1 COMPASS Training

1: Input: problem distribution

\mathcal{D}

, latent space

\mathcal{Z}

, number of samples

N

, batch size

\mathcal{B}

, number of training steps

K

, policy

\pi_{\theta}

with pre-trained parameters

\theta

\pi_{\theta}\leftarrow\textnormal{{Augment}}(\pi_{\theta})

{Augment the pre-trained policy to take as input the latent variable.}

3: for step 1 to

K

\rho_{i}\leftarrow\textnormal{{Sample}}(\mathcal{D})~{}\forall i\in{1,\dots,% \mathcal{B}}

z_{i}\leftarrow\textnormal{{Sample}}(\mathcal{Z})~{}\forall i\in{1,\dots,N}

{\tau}^{k}_{i}\leftarrow\textnormal{{Rollout}}(\rho_{i},\pi_{\theta}(\cdot|z_{% k}))~{}\forall i\in{1,\dots,\mathcal{B}},\forall k\in{1,\dots,N}

k^{*}_{i}\leftarrow\operatorname*{arg\,max}_{k\leq N}\mathcal{R}({\tau}^{k}_{i% })~{}\forall i\in{1,\dots,\mathcal{B}}

{Select the best vector for each problem

\rho_{i}

\nabla L(\theta)\leftarrow\frac{1}{B}\sum_{i\leq\mathcal{B}}\textnormal{{% REINFORCE}}({\tau}^{k^{*}_{i}}_{i})

{Backpropagate through these only.}

\theta\leftarrow\theta-\alpha\nabla L(\theta)

Appendix G Hyper-parameters

In Table 5, we report all the hyper-parameters of our method. Interestingly, our method is quite robust to these parameters and we almost use the same for all types of tasks. Note that for TSP sizes 125, 150, and 200, we report the hyper-parameters used during inference for the multi-components CMAES algorithm but there is no training hyper-parameters to report as we model used was trained on instances of size 100.

Table 5: The hyper-parameters used in COMPASS.

Phase	Hyper-parameters	TSP100	TSP(125, 150)	TSP200	CVRP	JSSP
	latent space dimension	16	-	-	16	16
Train time	training sample size	128	-	-	128	128
	instances batch size	8	-	-	8	8
	policy noise	1	0.1	0.1	0.1	0.1
Inference	num. CMAES components	3	3	2	2	3
Time	CMAES init. sigma	100	100	100	100	100
	sampling batch size	16	16	16	16	16

Appendix H Model Checkpoints

We compare our method COMPASS to three main baselines: POMO [Kwon et al., 2020], Poppy [Grinsztajn et al., 2022], and EAS [Hottung et al., 2022] on three CO problem, TSP, CVRP, and JSSP. The checkpoints used to run the POMO and Poppy models for TSP and CVRP are taken from Grinsztajn et al. [2022], and the EAS baseline is executed using the same POMO checkpoint. Those checkpoints are publicly available at https://github.com/instadeepai/poppy. For JSSP, we trained the single agent and Poppy models ourselves as they were not available. Lastly, we provide the COMPASS checkpoints used to obtain the results reported in this work. All those are available at https://github.com/instadeepai/compass.

Appendix I Performance with a Small Evaluation Budget

In Table 6, we provide the performances of COMPASS and the baseline methods on the standard benchmark dataset with a smaller budget (typically 10% of the usual budget). It can be seen that COMPASS outperforms the baselines on a majority of the tasks (specifically, 3/4 for TSP, 2/4 for CVRP, and all 3/3 for JSSP). In practice, a solver is particularly useful if it is performing well for a wide range of budget, in other words, able to provide good solutions fast, but also able to continually improve if any additional budget is given. COMPASS is able to provide an efficient to improve with a budget while being competitive (or outperforming) state-of-the-art methods with low budget.

Table 6: Results of COMPASS against the baseline algorithms for few-shot task on (a) TSP, (b) CVRP, and (c) JSSP problems. The methods are evaluated on instances from training distribution as well as on larger instance sizes to test generalization. The methods are only given 10% of the standard budget.

(a) TSP

Training distr.

Generalization

n=100

n=125

n=150

n=200

Method

Obj.

Gap

Obj.

Gap

Obj.

Gap

Obj.

Gap

Concorde

7.765

0.000\%

8.583

0.000\%

9.346

0.000\%

10.687

0.000\%

LKH3

7.765

0.000\%

8.583

0.000\%

9.346

0.000\%

10.687

0.000\%

POMO (greedy)

POMO

Poppy 16

EAS

COMPASS (ours)

7.796

7.785

7.769

7.785

7.769

0.404%

0.263%

0.045%

0.254%

0.044%

8.635

8.619

8.593

8.618

8.595

0.607%

0.424%

0.115%

0.403%

0.138%

9.440

9.424

9.373

9.419

9.373

1.001%

0.835%

0.293%

0.776%

0.283%

10.933

11.066

10.867

11.037

10.787

2.300%

3.545%

1.683%

3.276%

0.933%

(b) CVRP

Training distr.

Generalization

n=100

n=125

n=150

n=200

Method

Obj.

Gap

Obj.

Gap

Obj.

Gap

Obj.

Gap

LKH3

15.65

0.000\%

17.50

0.000\%

19.22

0.000\%

22.00

0.000\%

POMO (greedy)

POMO

Poppy 32

EAS

COMPASS (ours)

15.874

15.767

15.722

15.743

15.681

1.430%

0.75%

0.459%

0.592%

0.201%

17.818

17.690

17.633

17.646

17.648

1.818%

1.085%

0.758%

0.834%

0.846%

19.750

19.610

19.558

19.553

19.532

2.757%

2.031%

1.757%

1.730%

1.626%

23.318

23.817

23.907

23.612

22.946

5.992%

8.258%

8.666%

7.329%

4.298%

Training distr.

Generalization

10\times 10

15\times 15

20\times 15

Method

Obj.

Gap

Obj.

Gap

Obj.

Gap

OR-Tools

807.6

0.0\%

1188.0

0.0\%

1345.5

0.0\%

L2D (Sampling)

Single

Poppy 16

EAS

COMPASS (ours)

871.7

869.7

857.1

868.3

851.9

8.0\%

7.684%

6.126%

7.516%

5.48%

1378.3

1312.4

1306.1

1313.6

1301.1

16.0\%

10.47%

9.94%

10.572%

9.518%

1624.6

1517.6

1516.2

1519.3

1504.6

20.8\%

12.787%

12.686%

12.918%

11.827%

Appendix J Limitations

In this section, we address three potential limitations of our method, COMPASS. First, our generalization capacity is directly linked to and potentially limited by the diversity created by specializing on the training distribution. In particular, our objective function does not explicitly incorporate any term to promote further diversity within the latent space. The specialization we observe during training is a result of training only the top-performing conditioned policy on each instance. Although this approach is straightforward and effective, it can be argued that adding an unsupervised term during training could lead to a more diverse and ultimately higher-performing set of policies.

Another limitation lies in our training process: when sampling the latent space uniformly, the latent vectors evaluated may not include the true best vector, resulting in training a policy conditioned on a sub-optimal vector. This is both data-inefficient (evaluating several conditioned policies to only train on one trajectory) and sub-optimal (as specialization decreases by training a vector that does not correspond to the best conditioned policy).

To address this issue, we can consider two approaches. First, we can incorporate a search in the latent space during training to increase the likelihood of sampling the best vector. This way, we explore a wider range of latent vectors and improve the chances of discovering the optimal conditioning for each instance. Alternatively, we can train a prior distribution that, given an instance, provides information about the promising regions in the search space. By utilizing this prior distribution, we can sample latent vectors that are more likely to lead to better-performing conditioned policies. By leveraging prior knowledge or conducting a targeted search, we can enhance the efficiency and effectiveness of our training process.

Lastly, we believe there is potential for improving the efficiency of our search process with a better-defined latent space. We have observed that the latent space is noisy, which poses a challenge when employing a Bayesian search method for exploration. Moreover, we anticipate that our CMA-ES search could achieve faster convergence if the space exhibits smoother characteristics. One possible approach to address this is by incorporating a regularization term into our training objective, which would promote a smoother and more well-behaved latent space.

Appendix K Extended related work

We focus the literature review of the main paper (section 2) on construction methods trained with reinforcement learning. In this section, we present several improvement methods, i.e. methods that start with an existing solution and learn to directly modify this solution to create a new one; close to the concept of local search. Those are interesting alternatives to construction methods, although there limitation lies in that they are usually more problem-specific, and are also highly biased by the choice of the initial solution used.

NeuRewriter [Chen and Tian, 2019] learns a policy that updates the solution, factorized in two steps: choosing the part of the solution, and then choosing the rule used to create a new solution. Note that this assumes the existence of a set of existing update rules to pick from. Concurrently, Hottung and Tierney [2020] builds upon the Large Neighborhood Search framework, which consists of destroying and repairing solutions. This method relies on heuristics to destroy the solution and a neural policy cor the reconstruction mechanism, and this is learned end-to-end with reinforcement learning. Their approach is solely assessed on vehicle routing problems. Kim et al. [2021] reduces the dependence on the initial solution by learning a constructive model that generates diverse solutions, called seeds, that are used as starting points for an improvement neural policy. de O. da Costa et al. [2020] learns a neural policy that selects 2-opt operators to improve solutions of the TSP, and Wu et al. [2022] extends it to CVRP.

Ma et al. [2021] improves the performance of improvement methods (in VRP) based on transformers neural models by ensuring that the embedding that is learned to encode the instance being solved better capture the structure of the vehicle routing problems.

Overall, once a first solution has been created, using improvement approaches can be seen as a concurrent approach to policy adaptation to make the best use of a given budget to reach the best possible solution.

Another related method that helps fine-tuning pre-trained checkpoints for larger scales tasks is Meta-SAGE [Son et al., 2023]. This method meta-learns how to scale embedding with respect to the instance sizes, improving performing a policy gradient fine-tuning. This method requires using a mixed-sized distribution at training time.

Appendix L Analysis of time consumption in COMPASS

We provide an analysis of the time consumption of COMPASS at inference time, by looking at its time performance on TSP100 on a TPU v3-8. The inference procedure can be decomposed into four steps: (1) encoding an instance (once per episode), decoding steps (99 times per episode), (3) environment steps (99 times per episode), and (4) CMA-ES steps (namely the sampling of the vectors and the update of the distribution parameters, once per episode). All values are estimated by averaging 500 evaluations and are reported in milliseconds (ms).

These results show that the adaptation mechanism of COMPASS (CMA-ES sampling and updates) is negligible compared to the remaining steps in the inference procedure, with a difference of three orders of magnitude. This explains the similarity in runtimes reported in Table 1 between COMPASS, Poppy and POMO: the encoding and decoding are mostly identical, and the additional adaptation mechanism is not significant.

Interestingly, the encoding step is done only once for the whole budget. Hence, the bigger the budget, the more negligible the encoding steps become compared to the decoding and environment steps. Furthermore, the time difference between CMA-ES steps and all the other steps (aggregated over an episode) is only expected to grow with the size of the instance. First, because those individual steps depend on the instance size (larger matrix to encode or decode, more computations to be carried in the environment) whereas the CMA-ES adaptation mechanism does not depend on the instance size. Second, the number of the decoding and environment step increases with the instance size, which is not the case for the CMA-ES steps.

On the contrary, the adaptation mechanism used in EAS is time-consuming, making the method much slower than POMO, Poppy and COMPASS. Additionally, EAS’ adaptation time increases as the instance size increases.

Table 7: The time taken in a full episode rollout of COMPASS for TSP100. All values are averaged over 500 evaluations. CMA-ES (sampling and update) takes three orders of magnitude less time than the decoding steps.

Phase	Encoding	Decoding	Env. step	CMAES (sample & update)
Time for single event (ms)	32.27	3.01	0.69	0.28
Occurrence in an episode	1	99	99	1
Time over an episode (ms)	32.27	297.99	68.31	0.28

Appendix M Impacts of neural solver and search procedure in overall performance

M.1 Base solver vs adaptation mechanism

Our method COMPASS has two important aspects: a conditioned neural solver, that enables to it capture specialised policies in a latent space; and an adaptation mechanism that searches the latent space at inference time. In this subsection, we provide more insight into the impact of both aspects on the overall performance observed.

Our experimental results provide evidence that both aspects – (1) a well-trained conditional neural solver and (2) an efficient search algorithm – are critical for strong performance.

Point (1) is illustrated by Fig. 5 and Fig. 10 which shows high-performing regions for a given instance. Point (2) is demonstrated by Fig. 11 which shows the principled search method significantly outperforms random search. Additionally, we see that random search outperforms POMO and Poppy, confirming that the latent space “contains” high-performing and diverse policies.

To further illustrate the importance of those combined aspects, we provide two additional experiments. First, we under-train a conditioned neural solver by stop** the training procedure well before convergence and compare two COMPASS models (fully- vs. under-trained) solving TSP150 instances with two search methods (CMA-ES and uniform sampling). Those results are reported in Fig. 12. The results demonstrate that: (i) both search methods for the fully trained model outperform those for the under-trained model, showing the importance of our training procedure. (ii) uniform search on the fully-trained solver outperforms CMA-ES search on the under-trained model, showing that the search alone is not sufficient.

Second, we present the evolution of the latent space during training on a TSP150 instance, on Fig. 13. It can be seen that initially, the space is uniform (no specialized regions exist). However, as training progresses, high-performing regions emerge (shown in red) which indicates the specialization of policies within the latent space, and we also see the improved performance of the best conditioned policy.

M.2 Adaptation mechanism vs beam search

Several strategies can be consider to optimally leverage a neural solver within the budget constraints to find the best possible solutions. To evaluate the efficiency and performance of our adaptation mechanism, CMA-ES search, we can compare to an alternative approach, such as Beam search.

These two approaches are different paradigms: our approach searches the policy space and uses the policy to find a solution; Beam search explores directly the solution space, armed with a fixed policy. This can be approximated by comparing COMPASS with SGBS [Choo et al., 2022], a heuristic approach to improve the performance of POMO [Kwon et al., 2020]. A naive application of this heuristic on COMPASS (sampling a random policy with no latent space search) is equivalent to POMO+SGBS. We report the results of POMO+SGBS in Table 3 and show that COMPASS outperforms POMO+SGBS on the whole benchmark. This validates that it is worth searching for a good latent condition with the budget rather than fixing a random policy and using a beam search. Nevertheless, there may be a trade-off between search in latent space and heuristic solution search, which we leave for future work.

Appendix N Performance of our implementation of EAS

In order to report the performance of EAS on our whole set of experiments and in similar conditions as POMO, Poppy and COMPASS, we have re-implemented EAS in our codebase, in Jax. This implementation is open-sourced²²2Implementations available at https://github.com/instadeepai/compass. In this section, we report the results stated in the paper introducing EAS [Hottung et al., 2022] and report the results of our implementation in the same conditions.

Our implementation is faster on all TSP problem sets, and has similar speed on the CVRP sets. Our reported performance is better on all TSP sets and on half the CVRP sets. The difference observed on CVRP 150 and 200 cannot be certainly explained, it can be linked to difference between Jax and PyTorch or divergence in architectures that we could have missed. Note that our implementation is available online.

The results of EAS reported in this arxiv version is slightly different than the one reported in the OpenReview version because we have changed the implementation to better match the original one introduced in EAS [Hottung et al., 2022]. We used to backpropagate the gradients through the whole decoder to update the embeddings instead of backpropagating only through the last attention layer, which was making the method significantly slower than expected. We have fixed this in our codebased and updated our results and paper accordingly in this arxiv version.

Table 8: Results of EAS (paper results vs. our implementation) with instance augmentation for (a) TSP and (b) CVRP.

(a) TSP

Training distr.

Generalization

n=100

n=125

n=150

n=200

Method

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

EAS (paper)

EAS (our implem.)

7.769

7.768

0.052%

0.038%

8.591

8.590

0.093%

0.080%

57M

31M

9.363

9.361

0.182%

0.159%

50M

10.730

0.402%

0.403%

85M

(b) CVRP

Training distr.

Generalization

n=100

n=125

n=150

n=200

Method

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

Obj.

Gap

Time

EAS (paper)

EAS (our implem.)

15.63

15.62

-0.13%

-0.175%

17.47

-0.17%

-0.153%

93M

67M

19.22

19.26

0.00%

0.213%

108M

22.19

22.556

0.86%

2.527%

Combinatorial Optimization with Policy Adaptation using Latent Space Search

Abstract

1 Introduction

2 Related work

Construction methods for CO

Improving solutions at inference time

3 Methods

3.1 Preliminaries

Formulation

3.2 COMPASS

Latent space

Architecture

Training

Inference-time search

4 Experiments

In Distribution

Larger Instances

Small/Medium/Large Shift

Few-Shot

Baselines

Training

Inference

Code availability

4.1 Standard benchmarking on TSP, CVRP, and JSSP

Results

4.2 Robustness to generalization: solving mutated instances

Results

4.3 Analysis of the search strategies

5 Conclusion

Acknowledgements

References

Appendix

Appendix A Extended Results

A.1 Extended results on standard benchmark

A.2 Results on the procedurally transformed instances

Appendix B Analysis of the Performance during the Search Process

Appendix C Mutation Operators Used in the Generalization Tasks

Appendix D Architecture details

D.1 TSP and CVRP Networks

D.2 JSSP Network

Appendix E Latent Policy Space

E.1 Design

E.2 Visualisation

E.3 Exploring the Latent Policy Space at Inference Time

E.4 Alternative Strategies

Appendix F Training Procedure

Appendix G Hyper-parameters

Appendix H Model Checkpoints

Appendix I Performance with a Small Evaluation Budget

Appendix J Limitations

Appendix K Extended related work

Appendix L Analysis of time consumption in COMPASS

Appendix M Impacts of neural solver and search procedure in overall performance

M.1 Base solver vs adaptation mechanism

M.2 Adaptation mechanism vs beam search

Appendix N Performance of our implementation of EAS