Learning to Play Pursuit-Evasion with Dynamic and Sensor Constraints

Burak M. Gonultas1\orcidlink0000-0002-7966-7929 and Volkan Isler1\orcidlink0000-0002-0868-5441 1 the authors are with the Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, USA {gonul004, isler}@umn.eduThis work was supported by the MN LCCMR program.
Abstract

We present a multi-agent reinforcement learning approach to solve a pursuit-evasion game between two players with car-like dynamics and sensing limitations. We develop a curriculum for an existing multi-agent deterministic policy gradient algorithm to simultaneously obtain strategies for both players, and deploy the learned strategies on real robots moving as fast as 2 m/s in indoor environments. Through experiments we show that the learned strategies improve over existing baselines by up to 30% in terms of capture rate for the pursuer. The learned evader model has up to 5% better escape rate over the baselines even against our competitive pursuer model. We also present experiment results which show how the pursuit-evasion game and its results evolve as the player dynamics and sensor constraints are varied. Finally, we deploy learned policies on physical robots for a game between the F1TENTH and JetRacer platforms and show that the learned strategies can be executed on real-robots. Our code and supplementary material including videos from experiments are available at https://gonultasbu.github.io/pursuit-evasion/.

I Introduction

Pursuit-evasion games are optimization problems in which one or more pursuers try to “capture” an evader. The notion of capture is usually formulated as being co-located or getting close. However there are other formulations such as visibility-based pursuit evasion [1, 2] where capture is achieved when the evader is located. In robotics, pursuit-evasion games are studied to model motion planning problems in dynamic scenarios and to investigate the coupling between player dynamics, observation models and environment complexity.

As an example, in the classical homicidal chauffeur game [3], the pursuer is a car-like vehicle whose goal is to hit a pedestrian who is not subject to any turning constraints. In this game, the pursuer has three state variables (position and orientation) and the evader has two (only position). One might view the solution of this game as a subspace (manifold) of this five dimensional joint state space. The tangent space of this solution manifold corresponds to the optimal control actions of the players. In some cases, the differential equations for the evolution of the game when the players play optimally can be derived in closed form. The strategies for a given instance is usually obtained by numerical integration. Alternatively, the state space can be discretized and the solution can be obtained using dynamic programming techniques.

Refer to caption
(a)
Figure 1: Real-world experiments: Execution of the learned pursuit-evasion policies on two robotic cars with varying wheelbase, velocity and steering angle constraints.

However, such methods become quickly intractable as the dimensionality of the game increases. For example, in his book, Isaacs presents only a partial solution for the “game of two cars” where both players have car-like dynamics. The dimensionality issue becomes more severe when sensing is involved: in the standard formulations, it is assumed that the players can observe each others’ state fully at each time step. In practice, this may not be always true due to sensor and observability limitations. The resulting “partial observability games” have even higher (sometimes infinite) dimensionality as the players now need to reason about all feasible states that are consistent with their observation histories.

Reinforcement Learning (RL) [4] methods provide an alternative approach to construct the solution manifold by learning it through game play. Modern RL methods often represent the solution as a neural network which takes the observations as input and returns a distribution over the actions. In this paper, we investigate whether recently developed multi-agent reinforcement learning (MARL) techniques can solve a challenging game where the pursuer is a car with five state variables including velocities. We investigate two versions of the evader. It is either a car with the same kinematic model or a point mass. Hence, the joint-state dimensionality can be as high as 10. The players are subject to field-of-view sensing restrictions for both the viewing angle and viewing distance (Section III). We cast the game as a competitive Partially Observable Stochastic Game (POSG) and use the MARL algorithm multi-agent deep deterministic policy gradient (MADDPG) [5] to simultaneously train pursuer and evader policies. Our problem formulation and proposed approach do not require any pretrained or designed expert evader policy and generalizes over varying agent constraints.

Our contributions are as follows:

  • We develop a curriculum strategy for MADDPG to solve a pursuit-evasion game with complex dynamics and sensing constraints. We show that our method is able to train competitive pursuer and evader policies without the need for expert policies.

  • We provide insights on how the pursuit-evasion game and its results evolve as the player dynamic and sensor constraints are varied.

  • In simulation experiments we individually compare pursuer and evader policies against existing baselines.

  • We deploy learned policies on physical robots for a game between the F1TENTH[6] and JetRacer [7] platforms moving at high-speed and showcase a successful policy transfer to real world.

Overall, our results show the potential of multi-agent reinforcement learning in navigating complex pursuit-evasion scenarios within dynamic environments and demonstrate that simulation-trained policies are directly transferable to real autonomous agents.

II Related Work

In this section, we review relevant pursuit-evasion and MARL literature.

II-A Pursuit-Evasion

The mathematical origins of the pursuit-evasion problem can be traced back to 1930’s as the Lion and Man problem. In the original formulation, a lion and a human are in a circular area where the lion tries to catch the human as quickly as possible when the human tries to avoid the lion for as long as possible. Since then, various variants of this game have been studied in the robotics literature [8]. In his seminal work [3] Isaacs presented a differential game based formulation to the pursuit-evasion problem which has served as the basis of game-theoretic approaches.

In control-theoretic formulations, both the pursuer and evader in a pursuit-evasion problem have constraints on maximum velocities. One prominent example is the Homicidal chauffeur problem formulated by Isaacs [3], which is a special case of the pursuit-evasion problem where the pursuer has nonholonomic constraints. Since then, numerous works have studied pursuit-evasion games with dynamic constraints. In [9], the authors study the “Suicidal Pedestrian problem” inspired by Isaacs. Ruiz and Murrieta-Cid [10] analyze a scenario with a differential drive, faster evader against an omnidirectional but slower pursuer. In [11] Scott and Leonard derive and analyze optimal control strategies for speed, turning rate, lateral acceleration and movement direction-constrained evaders. In [12] the authors propose a single-agent reinforcement learning approach to the multi-pursuer with unicycle constraints scenario against a single, omnidirectional evader. Similarly, in [13] Yang et al. study a multi-pursuer, multi-evader version of pursuit-evasion problem with unicycle constraints and propose a MARL state processing method to improve the scalability.

If the players can not observe each other at all times, for example, if the game is played in a complex environment under sensing limitations, pursuit-evasion game has been defined as a combination of two subcomponents: 1) locating the evader (search problem) and 2) capturing the evader. From the pursuer perspective, the search problem appears once sensor constraints are introduced to the pursuit-evasion problem. From the evader perspective, optimal strategies depend on the environment as well as the sensor constraints of the agents in the pursuit-evasion game. The problem of planning paths to reach a state with unobstructed view of the evader was first presented in [14]. In [1], it was shown that in a simply connected polygon of n𝑛nitalic_n vertices, Θ(logn)Θ𝑛\Theta(\log{n})roman_Θ ( roman_log italic_n ) pursuers are necessary and sufficient to detect an unpredictable evader. Numerous studies since then have formulated different variations of the pursuit-evasion problem, [15, 16, 17] with [15] studying the case with a pursuer with limited field of view and [16] with unreliable sensor data. In [2] Isler et al. propose a randomized algorithm to locate the evader with a single pursuer and extend this strategy to two pursuers with line-of-sight visibility to capture the evader in simply connected polygons. In later work, [18] Noori and Isler have shown that a single pursuer can locate and capture the evader in monotone polygons. In [19] Engin et al. have proposed a single-agent reinforcement learning method with a compact belief state representation for pursuit-evasion games with visibility constraints in simply-connected polygons. In [20] Bajcsy et al. have studied the pursuit-evasion problem for quadrupedal robots under real-world physical and visibility constraints and proposed a teacher-student network pair to deal with the issues caused by partial observability in the problem definition.

In this work, we investigate a game between players with complex dynamics and with sensing limitations.

II-B Dynamic Games & MARL

Isaacs’ [3] and other game-theoretic formulations [21, 22] have long influenced robotics, controls and reinforcement learning research [23, 24, 25, 26]. Significant progress has been made using MARL algorithms [5, 27, 28, 29] in settings where large-scale simulations are feasible such as Starcraft, Dota 2, Go and Chess [30, 31, 32, 33]. Although there are recent works demonstrating successful reinforcement learning methods’ deployment on physical systems [20, 34], joint learning in game settings and deployment on physical systems still remains a challenge for the robotics community.

III Problem Formulation

In this section, we introduce the relevant terminology and models used in our problem definition.

III-A Agent and Environment Models

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Car kinematic bicycle model and sensor representations: For simulated experiments, the pursuer uses the kinematic bicycle model presented in (a). For physical experiments, both agents use the kinematic bicycle model with different parameters. All agents use the same sensor model presented in (b).

The pursuer kinematic model, which is represented by the state-space model [35], is given in Eq (1). It consists of the following states x1=sx,x2=sy,x3=δ,x4=v,x5=Ψformulae-sequencesubscript𝑥1subscript𝑠𝑥formulae-sequencesubscript𝑥2subscript𝑠𝑦formulae-sequencesubscript𝑥3𝛿formulae-sequencesubscript𝑥4𝑣subscript𝑥5Ψx_{1}=s_{x},x_{2}=s_{y},x_{3}=~{}\delta,x_{4}=v,x_{5}=\Psiitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_δ , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_v , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = roman_Ψ.

x1˙=x4cos(x5)˙subscript𝑥1subscript𝑥4𝑐𝑜𝑠subscript𝑥5\displaystyle\dot{x_{1}}=x_{4}cos(x_{5})over˙ start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_c italic_o italic_s ( italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) (1)
x2˙=x4sin(x5)˙subscript𝑥2subscript𝑥4𝑠𝑖𝑛subscript𝑥5\displaystyle\dot{x_{2}}=x_{4}sin(x_{5})over˙ start_ARG italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_s italic_i italic_n ( italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT )
x3˙=fsteer(x3,u1)˙subscript𝑥3subscript𝑓𝑠𝑡𝑒𝑒𝑟subscript𝑥3subscript𝑢1\displaystyle\dot{x_{3}}=f_{steer}(x_{3},u_{1})over˙ start_ARG italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG = italic_f start_POSTSUBSCRIPT italic_s italic_t italic_e italic_e italic_r end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
x4˙=facc(x4,u2)˙subscript𝑥4subscript𝑓𝑎𝑐𝑐subscript𝑥4subscript𝑢2\displaystyle\dot{x_{4}}=f_{acc}(x_{4},u_{2})over˙ start_ARG italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG = italic_f start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
x5˙=x4lf+lrtan(x3)˙subscript𝑥5subscript𝑥4subscript𝑙𝑓subscript𝑙𝑟𝑡𝑎𝑛subscript𝑥3\displaystyle\dot{x_{5}}=\frac{x_{4}}{l_{f}+l_{r}}tan(x_{3})over˙ start_ARG italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG start_ARG italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG italic_t italic_a italic_n ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )

where sxsubscript𝑠𝑥s_{x}italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and sysubscript𝑠𝑦s_{y}italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are the coordinates in meters in the global frame. v𝑣vitalic_v is the longitudinal velocity in m/s𝑚𝑠m/sitalic_m / italic_s, and ΨΨ\Psiroman_Ψ is the global yaw angle at the vehicle center in radians. Inputs are defined as u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT representing the steering velocity δ˙˙𝛿\dot{\delta}over˙ start_ARG italic_δ end_ARG in rad/s𝑟𝑎𝑑𝑠rad/sitalic_r italic_a italic_d / italic_s and u2subscript𝑢2u_{2}italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT representing the longitudinal acceleration v˙˙𝑣\dot{v}over˙ start_ARG italic_v end_ARG in m/s2𝑚superscript𝑠2m/s^{2}italic_m / italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. System steering and acceleration input constraints are represented as fsteersubscript𝑓𝑠𝑡𝑒𝑒𝑟f_{steer}italic_f start_POSTSUBSCRIPT italic_s italic_t italic_e italic_e italic_r end_POSTSUBSCRIPT and faccsubscript𝑓𝑎𝑐𝑐f_{acc}italic_f start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT, respectively. lf+lrsubscript𝑙𝑓subscript𝑙𝑟l_{f}+l_{r}italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the car model parameter representing the vehicle wheelbase with lfsubscript𝑙𝑓l_{f}italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT representing the distance of the front axle from the center of gravity and lrsubscript𝑙𝑟l_{r}italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT representing the same for the rear axle in meters. See Fig. 2a.

The evader is based on the point-mass state-space model [35] defined in Eq. (2), consisting of the following states x1=sx,x2=sy,x3=sx˙,x4=sy˙formulae-sequencesubscript𝑥1subscript𝑠𝑥formulae-sequencesubscript𝑥2subscript𝑠𝑦formulae-sequencesubscript𝑥3˙subscript𝑠𝑥subscript𝑥4˙subscript𝑠𝑦x_{1}=s_{x},x_{2}=s_{y},x_{3}=\dot{s_{x}},x_{4}=\dot{s_{y}}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = over˙ start_ARG italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = over˙ start_ARG italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG and uses the same global frame of reference notation with the inputs u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT representing the accelerations along the global x and y-axes sx¨¨subscript𝑠𝑥\ddot{s_{x}}over¨ start_ARG italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG and sy¨¨subscript𝑠𝑦\ddot{s_{y}}over¨ start_ARG italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG in m/s2𝑚superscript𝑠2m/s^{2}italic_m / italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

x1˙=x3˙subscript𝑥1subscript𝑥3\displaystyle\dot{x_{1}}=x_{3}over˙ start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (2)
x2˙=x4˙subscript𝑥2subscript𝑥4\displaystyle\dot{x_{2}}=x_{4}over˙ start_ARG italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
x3˙=u1˙subscript𝑥3subscript𝑢1\displaystyle\dot{x_{3}}=u_{1}over˙ start_ARG italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
x4˙=u2˙subscript𝑥4subscript𝑢2\displaystyle\dot{x_{4}}=u_{2}over˙ start_ARG italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG = italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Both the pursuer and evader agents use the same sensor model represented by a wedge (slice of a circle) centered at each agent’s center of mass and have two parameters: field of view angle (rad𝑟𝑎𝑑raditalic_r italic_a italic_d) Vαisubscriptsuperscript𝑉𝑖𝛼V^{i}_{\alpha}italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and range (m𝑚mitalic_m) Vrisubscriptsuperscript𝑉𝑖𝑟V^{i}_{r}italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT as shown in Fig. 2b, which may differ between the agents. An agent sees its opponent if and only if the opponent is inside the agent sensor footprint. The sensor footprint of an agent i is denoted by Visuperscript𝑉𝑖V^{i}\subseteq\mathcal{E}italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⊆ caligraphic_E with i{p,e}𝑖𝑝𝑒i\in\{p,e\}italic_i ∈ { italic_p , italic_e }. Therefore, the pursuer p𝑝pitalic_p sees the evader e𝑒eitalic_e if eVp𝑒superscript𝑉𝑝e\in V^{p}italic_e ∈ italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and vice versa. The sensor footprint of an agent is also presented in Fig. 2b.

We denote the rectangular workspace by \mathcal{E}caligraphic_E as a subset of the 2D Cartesian coordinate system defined by the minimum and maximum values along each axis such that {Xlow,Xhigh,Ylow,Yhigh}subscript𝑋𝑙𝑜𝑤subscript𝑋𝑖𝑔subscript𝑌𝑙𝑜𝑤subscript𝑌𝑖𝑔\{X_{low},X_{high},Y_{low},Y_{high}\}\subset\mathbb{R}{ italic_X start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT } ⊂ blackboard_R. \mathcal{E}caligraphic_E is assumed to be obstacle free and fully traversable. Euclidean distance between two points in \mathcal{E}caligraphic_E of arbitrary size is denoted by d(z1,z2)𝑑subscript𝑧1subscript𝑧2d(z_{1},z_{2})italic_d ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with z1,z22subscript𝑧1subscript𝑧2superscript2z_{1},z_{2}\in\mathbb{R}^{2}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. All distances are defined in meters.

III-B Game Formulation

Within the scope of this paper, the pursuit evasion game is defined as a POSG. However, it must be noted that without added stochastic noise, presented agent dynamics are fully deterministic as described in Eq. 1 and Eq. 2. The formal definition of a POSG is shown in 1 as defined by Terry et al. [36] inspired by Shapley [21].

Definition 1

A Partially Observable Stochastic Game (POSG) is a tuple 𝒮,s0,N,Ai,P,Ri,Ωi,Oi𝒮subscript𝑠0𝑁superscript𝐴𝑖𝑃superscript𝑅𝑖superscriptΩ𝑖superscript𝑂𝑖\langle\mathcal{S},s_{0},N,A^{i},P,R^{i},\Omega^{i},O^{i}\rangle⟨ caligraphic_S , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_N , italic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_P , italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , roman_Ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_O start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟩, where:

  • 𝒮𝒮\mathcal{S}caligraphic_S denoting the set of possible states.

  • s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denoting the initial state.

  • N𝑁Nitalic_N is the number of agents. The set of agents is denoted as [N]delimited-[]𝑁[N][ italic_N ]

  • 𝒜isuperscript𝒜𝑖\mathcal{A}^{i}caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the set of possible actions for agent i𝑖iitalic_i with i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ].

  • P:𝒮×i𝒜i×𝒮:𝑃𝒮subscriptproduct𝑖superscript𝒜𝑖𝒮P:\mathcal{S}\times\prod_{i}\mathcal{A}^{i}\times\mathcal{S}italic_P : caligraphic_S × ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT × caligraphic_S[0,1]absent01\rightarrow[0,1]→ [ 0 , 1 ] is the transition function. It has the property that for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, for all (a1,a2,,aN)i𝒜isuperscript𝑎1superscript𝑎2superscript𝑎𝑁subscriptproduct𝑖superscript𝒜𝑖(a^{1},a^{2},\ldots,a^{N})\in\prod_{i}\mathcal{A}^{i}( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ∈ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT,s𝒮P(s,a1,a2,,aN,s)=1.subscriptsuperscript𝑠𝒮𝑃𝑠superscript𝑎1superscript𝑎2superscript𝑎𝑁superscript𝑠1\sum_{s^{\prime}\in\mathcal{S}}P(s,a^{1},a^{2},\ldots,a^{N},s^{\prime})=1.∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 .

  • Ri:𝒮×i𝒜i×𝒮:superscript𝑅𝑖𝒮subscriptproduct𝑖superscript𝒜𝑖𝒮R^{i}:\mathcal{S}\times\prod_{i}\mathcal{A}^{i}\times\mathcal{S}\rightarrow% \mathbb{R}italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT : caligraphic_S × ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT × caligraphic_S → blackboard_R is the reward function for agent i𝑖iitalic_i.

  • ΩisuperscriptΩ𝑖\Omega^{i}roman_Ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the set of possible observations for agent i.

  • Oi:𝒜i×𝒮×Ωi[0,1]:superscript𝑂𝑖superscript𝒜𝑖𝒮superscriptΩ𝑖01O^{i}:\mathcal{A}^{i}\times\mathcal{S}\times\Omega^{i}\rightarrow[0,1]italic_O start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT : caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT × caligraphic_S × roman_Ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT → [ 0 , 1 ] is the observation function. It has property that ωΩiOi(a,s,ω)=1subscript𝜔superscriptΩ𝑖superscript𝑂𝑖𝑎𝑠𝜔1\sum_{\omega\in\Omega^{i}}O^{i}(a,s,\omega)=1∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_O start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_a , italic_s , italic_ω ) = 1 for all a𝒜i𝑎superscript𝒜𝑖a\in\mathcal{A}^{i}italic_a ∈ caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S

We are given a rectangular workspace \mathcal{E}caligraphic_E. The positions of the agents at timestep t𝑡titalic_t are defined as ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the pursuer and the evader, respectively. At each timestep both agents make their moves simultaneously. The pursuer captures the evader if d(pt,et)2c𝑑subscript𝑝𝑡subscript𝑒𝑡2𝑐d(p_{t},e_{t})\leq 2citalic_d ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ 2 italic_c with c𝑐citalic_c being each agent’s radius. The pursuer wins the game if a capture occurs within an allocated timestep limit 𝒯𝒯\mathcal{T}caligraphic_T. Timeouts are considered evader victories.

IV Method

The objective of each agent is to maximize its total expected reward Gi=t=0𝒯γtrtisuperscript𝐺𝑖superscriptsubscript𝑡0𝒯subscript𝛾𝑡subscriptsuperscript𝑟𝑖𝑡G^{i}=\sum_{t=0}^{\mathcal{T}}{\gamma_{t}r^{i}_{t}}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over time horizon 𝒯𝒯\mathcal{T}caligraphic_T where rtisubscriptsuperscript𝑟𝑖𝑡r^{i}_{t}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is provided by the environment at timestep t𝑡titalic_t and γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is the discount factor. Per-agent actions are sampled from the agent policy networks atiπθi(sti)similar-tosubscriptsuperscript𝑎𝑖𝑡subscriptsuperscript𝜋𝑖𝜃subscriptsuperscript𝑠𝑖𝑡a^{i}_{t}\sim\pi^{i}_{\theta}(s^{i}_{t})italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

IV-A Multi-Agent Deep Reinforcement Learning

We use the MADDPG algorithm [5] which follows the Centralized Critic, Decentralized Execution (CTDE) principle to train the agents by making them play against each other from scratch.

IV-B Space Representations

The hidden full state of the pursuit-evasion game between a kinematic car-like pursuer and a point-mass evader is referred as a 9-vector s[x1p,x2p,x3p,x4p,x5p,x1e,x2e,x3e,x4e]𝖳𝑠superscriptsubscriptsuperscript𝑥𝑝1subscriptsuperscript𝑥𝑝2subscriptsuperscript𝑥𝑝3subscriptsuperscript𝑥𝑝4subscriptsuperscript𝑥𝑝5subscriptsuperscript𝑥𝑒1subscriptsuperscript𝑥𝑒2subscriptsuperscript𝑥𝑒3subscriptsuperscript𝑥𝑒4𝖳s\coloneqq[x^{p}_{1},x^{p}_{2},x^{p}_{3},x^{p}_{4},x^{p}_{5},x^{e}_{1},x^{e}_{% 2},x^{e}_{3},x^{e}_{4}]^{\mathsf{T}}italic_s ≔ [ italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT. Without loss of generality, the state vector comprises of state variables presented in Section III-A. Time indices are dropped for brevity. Each agent in the pursuit-evasion game creates its own observation vector oisuperscript𝑜𝑖o^{i}italic_o start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT through partial observations from the environment represented as 11-vectors of floating point numbers by appending a binary observation flag (represented with floating point number) indicating the visibility of the opponent and a time index. If an agent cannot observe its opponent at a timestep, corresponding values in oisuperscript𝑜𝑖o^{i}italic_o start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are zeroed and the binary flag is set to negative. Observation vectors are normalized to contain values in interval [1,1]11[-1,1][ - 1 , 1 ]. Input/action representations follow the same convention described in III-A and are represented as 2-vectors normalized to the interval [1,1]11[-1,1][ - 1 , 1 ] by taking acceleration, velocity, steering angle velocity and steering angle limits. Both the state and the action spaces are continuous.

IV-C Reward Structure

We formulate pursuit-evasion problem as a competitive zero-sum POSG. Players have opposite rewards at each timestep, that is rtp+rte=0subscriptsuperscript𝑟𝑝𝑡subscriptsuperscript𝑟𝑒𝑡0r^{p}_{t}+r^{e}_{t}=0italic_r start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_r start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0. At each timestep, the pursuer receives the following reward:

rtp={λcapture,if t𝒯λcapture,if d2c and t<𝒯(λt+λdd),otherwisesubscriptsuperscript𝑟𝑝𝑡casessubscript𝜆𝑐𝑎𝑝𝑡𝑢𝑟𝑒if t𝒯subscript𝜆𝑐𝑎𝑝𝑡𝑢𝑟𝑒if d2c and t<𝒯subscript𝜆𝑡subscript𝜆𝑑𝑑otherwiser^{p}_{t}=\begin{cases}-\lambda_{capture},&\text{if $t\geq\mathcal{T}$}\\ \lambda_{capture},&\text{if $d\leq 2c$ and $t<\mathcal{T}$}\\ -(\lambda_{t}+\lambda_{d}\cdot d),&\text{otherwise}\end{cases}italic_r start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL - italic_λ start_POSTSUBSCRIPT italic_c italic_a italic_p italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT , end_CELL start_CELL if italic_t ≥ caligraphic_T end_CELL end_ROW start_ROW start_CELL italic_λ start_POSTSUBSCRIPT italic_c italic_a italic_p italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT , end_CELL start_CELL if italic_d ≤ 2 italic_c and italic_t < caligraphic_T end_CELL end_ROW start_ROW start_CELL - ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⋅ italic_d ) , end_CELL start_CELL otherwise end_CELL end_ROW

where λcapturesubscript𝜆𝑐𝑎𝑝𝑡𝑢𝑟𝑒\lambda_{capture}italic_λ start_POSTSUBSCRIPT italic_c italic_a italic_p italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT, λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, λdsubscript𝜆𝑑\lambda_{d}italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT represent the coefficients for capture reward, per-timestep reward and distance reward, respectively. Each episode has two independent reset conditions, the capture condition (pursuer wins) or the timeout (evader wins). d𝑑ditalic_d represents the Euclidean distance between agents as described in Section III.

V Simulation Experiments

Refer to caption
(a)
Figure 3: Learning regime for both agents: Pursuer initially starts out by losing almost every episode, after a few thousand episodes the performance catches up and pursuer starts to dominate. In response, the evader develops its own counter strategies; resulting in an oscillatory reward profile with evader being slightly better on average.

In this section, we analyze our method by evaluating it against pursuer and evader baselines in a series of experiments. We have designed our experiments to answer the following questions: Question 1: How well do trained pursuer and evader agents perform against baseline methods? Question 2: Qualitatively, are there any interesting emerging behaviors? Question 3: How do the model parameters affect agents’ performances?

Refer to caption
(a)
Figure 4: 250 initial distributions sampled uniformly at random: Initial positions are transformed with respect to the pursuer frame of reference. Colors indicate the normalized time-to-capture. Evader starting points that are farther away from pursuer have higher Time-to-Capture.

As the pursuer baseline, we use a slightly modified pure pursuit controller [37] to handle observations where the evader is not visible. If the pursuer loses sight of the evader after observing it for at least one timestep, maximum steering velocity action is applied in arbitrary direction (positive or negative) for 25 timesteps. If the evader is not detected, the pursuer conducts a random walk, changing the input steering velocity at value every \nth8 timestep, sampled uniformly at random from the normalization interval. The velocity value is controlled to be at maximum forward velocity. For the evader baselines, we have three different baselines: a random walk algorithm for the point-mass evader, sampling linear acceleration action values from the normalization interval uniformly at random every 25 steps; a greedy evader that moves away from the pursuer upon detection and stands still otherwise; and the Rash evader from Quattrini et al. [38] which hides in an arbitrary corner until it detects the pursuer, and attempts to move to another once it detects the pursuer.

V-A Multi-Agent Training

During the training, both agents start from scratch and improve their performance by playing against each other over 15000 episodes. We set Δt=0.1sΔ𝑡0.1𝑠\Delta t=0.1\ sroman_Δ italic_t = 0.1 italic_s and maximum number of timesteps as 𝒯=500𝒯500\mathcal{T}=500caligraphic_T = 500 with 2-frame stacking and skip**. We spawn 8 parallel vectorized environments to increase training speed with concurrency using the AgileRL Deep Reinforcement Learning Python library MADDPG and environment parallelization [39]. Actor and critic networks are both multilayer perceptrons and comprise of 2 hidden layers of size 128 each and ReLU activations. The workspace \mathcal{E}caligraphic_E is of size (20,20)2020(20,20)( 20 , 20 ) in meters along the x and y-axes and is centered at (0,0)00(0,0)( 0 , 0 ). Starting positions of the agents are sampled uniformly at random within \mathcal{E}caligraphic_E. Reward coefficients are set as λcapture=1000subscript𝜆𝑐𝑎𝑝𝑡𝑢𝑟𝑒1000\lambda_{capture}=1000italic_λ start_POSTSUBSCRIPT italic_c italic_a italic_p italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT = 1000, λd=1subscript𝜆𝑑1\lambda_{d}=1italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 1, λt=1subscript𝜆𝑡1\lambda_{t}=1italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1. We use a simple curriculum learning method which starts training with stronger sensor coverage for the pursuer with Vαp=2πradsubscriptsuperscript𝑉𝑝𝛼2𝜋𝑟𝑎𝑑V^{p}_{\alpha}=2\pi\ raditalic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 2 italic_π italic_r italic_a italic_d and Vrp=10msubscriptsuperscript𝑉𝑝𝑟10𝑚V^{p}_{r}=10\ mitalic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 10 italic_m and linearly decays to the training model parameter over 1414\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG of the number of total training episodes.

For training, pursuer dynamic model parameters are set as follows: rear and front axle distances from the center-of-mass lf,lr=0.15msubscript𝑙𝑓subscript𝑙𝑟0.15𝑚l_{f},l_{r}=0.15\ mitalic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.15 italic_m, steering angle θ𝜃\thetaitalic_θ joint limits are ±0.34radplus-or-minus0.34𝑟𝑎𝑑\pm 0.34\ rad± 0.34 italic_r italic_a italic_d, steering angle velocity θ˙˙𝜃\dot{\theta}over˙ start_ARG italic_θ end_ARG limits are ±3.2radsplus-or-minus3.2𝑟𝑎𝑑𝑠\pm 3.2\ \frac{rad}{s}± 3.2 divide start_ARG italic_r italic_a italic_d end_ARG start_ARG italic_s end_ARG. Minimum and maximum longitudinal velocities v𝑣vitalic_v are 1ms1𝑚𝑠-1\ \frac{m}{s}- 1 divide start_ARG italic_m end_ARG start_ARG italic_s end_ARG and 2.5ms2.5𝑚𝑠2.5\ \frac{m}{s}2.5 divide start_ARG italic_m end_ARG start_ARG italic_s end_ARG, respectively with ±2ms2plus-or-minus2𝑚superscript𝑠2\pm 2\ \frac{m}{s^{2}}± 2 divide start_ARG italic_m end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG maximum acceleration v˙˙𝑣\dot{v}over˙ start_ARG italic_v end_ARG value in both directions. Sensor parameters are set as Vαp=π2radsubscriptsuperscript𝑉𝑝𝛼𝜋2𝑟𝑎𝑑V^{p}_{\alpha}=\frac{\pi}{2}\ raditalic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = divide start_ARG italic_π end_ARG start_ARG 2 end_ARG italic_r italic_a italic_d and Vrp=7.5msubscriptsuperscript𝑉𝑝𝑟7.5𝑚V^{p}_{r}=7.5\ mitalic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 7.5 italic_m. Evader dynamic model parameters are: minimum and maximum velocities along the axes are ±2msplus-or-minus2𝑚𝑠\pm 2\ \frac{m}{s}± 2 divide start_ARG italic_m end_ARG start_ARG italic_s end_ARG with ±9.81ms2plus-or-minus9.81𝑚superscript𝑠2\pm 9.81\ \frac{m}{s^{2}}± 9.81 divide start_ARG italic_m end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG maximum acceleration v˙˙𝑣\dot{v}over˙ start_ARG italic_v end_ARG value in both directions.

Fig. 3 shows a sample plot of averaged rewards of the agents during training. We observe a learning regime where the pursuer initially starts out losing almost every episode. Eventually the pursuer performance catches up. At that time, the evader develops its own counter strategies; resulting in an oscillatory reward profile. Maximum achievable reward by the pursuer in an individual environment before reset is λcapture=1000subscript𝜆𝑐𝑎𝑝𝑡𝑢𝑟𝑒1000\lambda_{capture}=1000italic_λ start_POSTSUBSCRIPT italic_c italic_a italic_p italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT = 1000 (instant capture), yet the rewards are accumulated every 500 timesteps for plotting from 8 parallel environments, therefore each parallel environment can reset multiple times (indicating pursuer domination) before the evaluation timestep, resulting a higher cumulative reward. Curriculum learning method described earlier is of higher importance during the early low reward regime for the pursuer; without the curriculum, the pursuer may never be able to encounter the evader and get stuck with a very sparse reward policy, unable to learn.

V-B Performance evaluation against baselines

In our first set of experiments we investigate the performance of our learned pursuer and evader models against baseline algorithms. For multi-agent training scenario, it is not straightforward to find the best performing algorithm by evaluating algorithms based on their reward or similar metrics at the end of an episode due to the oscillatory nature of the training regime. As a more concrete example, a mediocre pursuer may achieve a deceivingly high score after training against a relatively bad evader or vice versa. To mitigate this problem, we create model checkpoints every 250 episodes and evaluate these checkpoints by setting them up against each other (one vs. 64 random samples) after training. The best performing models are selected by evaluating the average success rates and breaking ties with average capture times.

Time-to-Capture Capture Rate
Evader Type Pure Pursuit Learned Pure Pursuit Learned
Random Walk 0.53 (σ𝜎\sigmaitalic_σ = 0.35) 0.35 (σ𝜎\sigmaitalic_σ = 0.28) 0.77 0.96
Greedy 0.84 (σ𝜎\sigmaitalic_σ = 0.28) 0.54 (σ𝜎\sigmaitalic_σ = 0.37) 0.30 0.63
Rash 0.81 (σ𝜎\sigmaitalic_σ = 0.33) 0.55 (σ𝜎\sigmaitalic_σ = 0.39) 0.33 0.62
Learned 0.99 (σ𝜎\sigmaitalic_σ = 0.09) 0.69 (σ𝜎\sigmaitalic_σ = 0.31) 0.01 0.58
TABLE I: PURSUER CAPTURE TIMES AND RATES
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: Normalized Time-to-Capture values with respect to the pursuer and evader parameters: Pursuer policies are evaluated against greedy evader policy whereas evader policies are evaluated against learned pursuer policy. Pursuer is able to make good use of its improved sensing capabilities, evader makes rapid gains with velocity until it surpasses the default pursuer velocity.

For learned pursuer and evader strategies, we tested the MADDPG algorithm against baseline pursuer and evader algorithms as well as against each other over 250 episodes. We report the results of the best performing strategies in Table I. The results are normalized between 0 and 1. For the learned pursuer strategy, a lower Time-to-Capture value and a higher capture rate are better. For the learned evader strategy, these values are the opposite. From Table I we see that the learned strategies outperform the baselines in each case and the learned evader strategy performs significantly better against the learned pursuer strategy as compared to the baseline evader strategy. Fig. 4 visualizes these results over 250 initial distributions, uniformly sampled at random from \mathcal{E}caligraphic_E. The top row visualizes the performance of the baseline pure pursuit where the bottom row uses the learned pursuer strategy. Initial positions are transformed with respect to the pursuer frame of reference (therefore the axes ranges are doubled). Colors indicate the normalized time-to-capture: blue indicates a quicker capture and a stronger pursuer performance; red indicates a slower capture and a stronger evader performance.

V-C Effect of model parameters

Next, we evaluate our approach in the following scenario: pursuer policies are evaluated against greedy evader policy, whereas evader policies are evaluated against learned pursuer policy. We only evaluate by the generalized inference performance, i.e. we do not retrain the agents for the specified parameters and keep rest of the parameters in default settings as described in Section V except for the independent model parameters. Normalized Time-to-Capture values are shown in Fig. 5 with respect to the pursuer and evader parameters. In Figures 5a and 5b it is demonstrated that the learned pursuer policy is able to make good use of its improved sensing capabilities while its ability to handle better forward velocity ends up being in limited Fig. 5c. We attribute this to the relatively limited workspace area, i.e. beyond 4ms4𝑚𝑠4\frac{m}{s}4 divide start_ARG italic_m end_ARG start_ARG italic_s end_ARG it takes less than 5 seconds to go through an axis; higher speeds reducing agility due to braking limits and finite sensor range. The learned evader policy on the other hand does not make good use of its improved sensor range in Fig. 5b but makes rapid gains with velocity in Fig. 5c until it surpasses the default pursuer velocity. Qualitative analysis indicates that the learned evader strategy prefers hiding (close to corners) in favor of outranging pursuer sensor, only moving away if the pursuer decides to check the corner the evader is currently in. This, however comes with a significant drawback, i.e. the evader corners itself even with clear dynamic superiority and gets captured. This indicates that the qualitative nature of learned strategies are strongly correlated with model parameters.

VI Real-World Results

For the real world experiments, a learned pursuer policy was deployed against a human controlled evader. The reinforcement learning model for the pursuer was deployed on an F1TENTH autonomous vehicle [6] with the identified system parameters from our previous work [40]. The human-controlled evader was a smaller and more nimble car-like autonomous vehicle Nvidia JetRacer [7]. Due to real world physical and system constraints, the reinforcement learning model state-space representation was slightly modified to use linear velocity and steering angle commands (as opposed to their first order derivatives), full observability was assumed (due to area being too small to properly make use of sensor footprint) and steering angles are removed from the observation state (since they are actions). A Gaussian noise with 0 mean and 5% standard deviation was added to the normalized (between -1 and 1) state and action vectors for robustness. For 6-DoF real-world state estimation, Phasespace X2E LED motion capture markers were attached to both vehicles and finite differencing was used to calculate velocities at 10 Hz frequency. ROS2 was used as the middleware for interfacing between sensors, agents and the reinforcement learning models [41]. We demonstrate trained reinforcement learning models are directly transferable to the real system and present two sample pursuit evasion sample trajectories in Fig. 6, for further details on our physical experimental results, please refer to the submitted video attachment. These experiments successfully demonstrate that the learned strategy can be executed on real systems.

Refer to caption
(a)
Figure 6: Real system trajectories: We demonstrate trained reinforcement learning models are directly transferable. Lighter colors have earlier timestamps, darker colors are later. For further details on our physical experimental results, please refer to the submitted video attachment.

VII Conclusion

In this paper, we explored a pursuit-evasion game where the pursuer, a car with five state variables, engages with either a similar car or a point mass evader with constraints on the sensing capabilities for both agents. We formulated the game as a zero-sum Partially Observable Stochastic Game (POSG) and developed a curriculum for the MADDPG algorithm to simultaneously obtain the pursuer and evader policies. By training the players against each other, our approach sidesteps the necessity for pretrained or expert-designed evader policies. We also demonstrated that simulation-trained policies are directly transferable to real autonomous agents. Through this exploration, we show the potential of multi-agent reinforcement learning in navigating complex pursuit-evasion scenarios within dynamic environments.

There are multiple avenues for future research: in our current formulation, we did not model uncertainty in the players own states. Preliminary experiments indicate that the learned strategies are robust to small errors in state. As the arena gets bigger and if the players do not have perfect global sensors, the uncertainty in state estimation may need to be addressed explicitly. Furthermore, if the game takes place in complex environments, the search component of the game becomes prominent. It is unclear if current RL methods will be able to handle such scenarios in which there will be long sequences with no reward. We will explore these exciting, yet challenging, issues in our future work.

References

  • [1] L. J. Guibas, J.-C. Latombe, S. M. LaValle, D. Lin, and R. Motwani, “A visibility-based pursuit-evasion problem,” International Journal of Computational Geometry & Applications, vol. 9, no. 04n05, pp. 471–493, 1999.
  • [2] V. Isler, S. Kannan, and S. Khanna, “Randomized pursuit-evasion in a polygonal environment,” IEEE Transactions on Robotics, vol. 21, no. 5, pp. 875–884, 2005.
  • [3] R. Isaacs, Differential games: a mathematical theory with applications to warfare and pursuit, control and optimization.   Courier Corporation, 1999, originally published in 1965.
  • [4] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [5] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” Advances in neural information processing systems, vol. 30, 2017.
  • [6] M. O’Kelly, H. Zheng, D. Karthik, and R. Mangharam, “F1tenth: An open-source evaluation environment for continuous control and reinforcement learning,” Proceedings of Machine Learning Research, vol. 123, 2020.
  • [7] NVIDIA-AI-IOT, “jetracer.” [Online]. Available: https://github.com/NVIDIA-AI-IOT/jetracer
  • [8] T. H. Chung, G. A. Hollinger, and V. Isler, “Search and pursuit-evasion in mobile robotics: A survey,” Autonomous robots, vol. 31, pp. 299–316, 2011.
  • [9] I. Exarchos, P. Tsiotras, and M. Pachter, “On the suicidal pedestrian differential game,” Dynamic Games and Applications, vol. 5, pp. 297–317, 2015.
  • [10] U. Ruiz and R. Murrieta-Cid, “A differential pursuit/evasion game of capture between an omnidirectional agent and a differential drive robot, and their winning roles,” International Journal of Control, vol. 89, no. 11, pp. 2169–2184, 2016.
  • [11] W. L. Scott and N. E. Leonard, “Optimal evasive strategies for multiple interacting agents with motion constraints,” Automatica, vol. 94, pp. 26–34, 2018.
  • [12] C. De Souza, R. Newbury, A. Cosgun, P. Castillo, B. Vidolov, and D. Kulić, “Decentralized multi-agent pursuit using deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4552–4559, 2021.
  • [13] H. Yang, P. Ge, J. Cao, Y. Yang, and Y. Liu, “Large scale pursuit-evasion under collision avoidance using deep reinforcement learning,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2023, pp. 2232–2239.
  • [14] I. Suzuki and M. Yamashita, “Searching for a mobile intruder in a polygonal region,” SIAM Journal on computing, vol. 21, no. 5, pp. 863–888, 1992.
  • [15] B. P. Gerkey, S. Thrun, and G. Gordon, “Visibility-based pursuit-evasion with limited field of view,” The International Journal of Robotics Research, vol. 25, no. 4, pp. 299–315, 2006.
  • [16] N. M. Stiffler, A. Kolling, and J. M. O’Kane, “Persistent pursuit-evasion: The case of the preoccupied pursuer,” in 2017 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2017, pp. 5027–5034.
  • [17] D. Shishika and V. Kumar, “Local-game decomposition for multiplayer perimeter-defense problem,” in 2018 IEEE conference on decision and control (CDC).   IEEE, 2018, pp. 2093–2100.
  • [18] N. Noori and V. Isler, “The lion and man game on polyhedral surfaces with boundary,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2014, pp. 1769–1774.
  • [19] S. Engin, Q. Jiang, and V. Isler, “Learning to play pursuit-evasion with visibility constraints,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 3858–3863.
  • [20] A. Bajcsy, A. Loquercio, A. Kumar, and J. Malik, “Learning vision-based pursuit-evasion robot policies,” arXiv preprint arXiv:2308.16185, 2023.
  • [21] L. S. Shapley, “Stochastic games,” Proceedings of the national academy of sciences, vol. 39, no. 10, pp. 1095–1100, 1953.
  • [22] M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine learning proceedings 1994.   Elsevier, 1994, pp. 157–163.
  • [23] I. M. Mitchell, A. M. Bayen, and C. J. Tomlin, “A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games,” IEEE Transactions on automatic control, vol. 50, no. 7, pp. 947–957, 2005.
  • [24] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of reinforcement learning and control, pp. 321–384, 2021.
  • [25] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarial reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2017, pp. 2817–2826.
  • [26] L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” Innovations in multi-agent systems and applications-1, pp. 183–221, 2010.
  • [27] J. Ackermann, V. Gabler, T. Osa, and M. Sugiyama, “Reducing overestimation bias in multi-agent domains using double centralized critics,” arXiv preprint arXiv:1910.01465, 2019.
  • [28] C. S. de Witt, T. Gupta, D. Makoviichuk, V. Makoviychuk, P. H. Torr, M. Sun, and S. Whiteson, “Is independent learning all you need in the starcraft multi-agent challenge?” arXiv preprint arXiv:2011.09533, 2020.
  • [29] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learning,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 7234–7284, 2020.
  • [30] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
  • [31] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al., “Dota 2 with large scale deep reinforcement learning,” arXiv preprint arXiv:1912.06680, 2019.
  • [32] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
  • [33] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
  • [34] A. Brunnbauer, L. Berducci, A. Brandstátter, M. Lechner, R. Hasani, D. Rus, and R. Grosu, “Latent imagination facilitates zero-shot transfer in autonomous racing,” in 2022 international conference on robotics and automation (ICRA).   IEEE, 2022, pp. 7513–7520.
  • [35] M. Althoff, M. Koschi, and S. Manzinger, “Commonroad: Composable benchmarks for motion planning on roads,” in 2017 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2017, pp. 719–726.
  • [36] J. Terry, B. Black, N. Grammel, M. Jayakumar, A. Hari, R. Sullivan, L. S. Santos, C. Dieffendahl, C. Horsch, R. Perez-Vicente, et al., “Pettingzoo: Gym for multi-agent reinforcement learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 15 032–15 043, 2021.
  • [37] R. C. Coulter et al., Implementation of the pure pursuit path tracking algorithm.   Carnegie Mellon University, The Robotics Institute, 1992.
  • [38] A. Quattrini Li, R. Fioratto, F. Amigoni, and V. Isler, “A search-based approach to solve pursuit-evasion games with limited visibility in polygonal environments,” in Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 2018, pp. 1693–1701.
  • [39] N. Ustaran-Anderegg and M. Pratt, “AgileRL.” [Online]. Available: https://github.com/AgileRL/AgileRL
  • [40] B. M. Gonultas, P. Mukherjee, O. G. Poyrazoglu, and V. Isler, “System identification and control of front-steered ackermann vehicles through differentiable physics,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2023, pp. 4347–4353.
  • [41] S. Macenski, T. Foote, B. Gerkey, C. Lalancette, and W. Woodall, “Robot operating system 2: Design, architecture, and uses in the wild,” Science Robotics, vol. 7, no. 66, p. eabm6074, 2022. [Online]. Available: https://www.science.org/doi/abs/10.1126/scirobotics.abm6074