Networked Communication for Decentralised Agents in Mean-Field Games

Patrick Benjamin
Department of Computer Science
University of Oxford
Oxford, United Kingdom
[email protected]
&Alessandro Abate
Department of Computer Science
University of Oxford
Oxford, United Kingdom
[email protected]
Abstract

We introduce networked communication to the mean-field game framework, in particular to oracle-free settings where N𝑁Nitalic_N decentralised agents learn along a single, non-episodic run of the empirical system. We prove that our architecture, with only a few reasonable assumptions about network structure, has sample guarantees bounded between those of the centralised- and independent-learning cases. We discuss how the sample guarantees of the three theoretical algorithms do not actually result in practical convergence. We therefore show that in practical settings where the theoretical parameters are not observed (leading to poor estimation of the Q-function), our communication scheme significantly accelerates convergence over the independent case (and often even the centralised case), without relying on the assumption of a centralised learner. We contribute further practical enhancements to all three theoretical algorithms, allowing us to present their first empirical demonstrations. Our experiments confirm that we can remove several of the theoretical assumptions of the algorithms, and display the empirical convergence benefits brought by our new networked communication. We additionally show that the networked approach has significant advantages, over both the centralised and independent alternatives, in terms of robustness to unexpected learning failures and to changes in population size.

1 Introduction

The mean-field game (MFG) framework [1, 2] seeks to tackle the difficulty faced by multi-agent reinforcement learning (MARL) regarding computational scalability as the number of agents increases. It models a representative agent as interacting not with the other individuals in the population on a per-agent basis, but instead with a distribution of other agents, known as the mean field. The MFG framework analyses the limiting case when the population consists of an infinite number of symmetric and anonymous agents, that is, they have identical reward and transition functions which depend on the mean-field distribution rather than on the actions of specific other players. In this work we focus on MFGs with stationary population distributions (‘stationary MFGs’) [3, 4, 5, 6], for which the solution concept is the MFG-Nash equilibrium (MFG-NE), which reflects the situation when each agent responds optimally to the population distribution that arises when all other agents follow that same optimal behaviour. The MFG-NE can be used as an approximation for the Nash equilibrium (NE) in a finite-agent game, with the error in the solution reducing as the number of agents N tends to infinity [4, 7, 8, 9, 10]. MFGs have been applied to a wide variety of real world problems [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]; see Appx. F for further details.

For large, complex many-agent systems in the real world (e.g. swarm robotics or autonomous vehicle traffic), it may not be feasible to find MFG-NEs analytically or via oracles/simulations of an infinite population (as they have been traditionally), such that learning must instead be conducted directly by the finite population in its deployed environment. In such settings, in contrast to many previous approaches, desirable qualities for MFG algorithms include: learning from the empirical distribution of N𝑁Nitalic_N agents (without generation or manipulation of this distribution by the algorithm itself or by an external oracle); learning from a single continued system run that is not arbitrarily reset as in episodic learning; model-free learning; decentralisation; fast practical convergence; and robustness to unexpected failures of decentralised learners or changes in the size of the population [29].

Conversely, MFG frameworks have traditionally been largely theoretical, and methods for finding equilibria have often relied on assumptions that are too strong for real-world applications [2, 4, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]; see Appx. F for an extended discussion of this related work. In particular, almost all prior work relies on a centralised controller to orchestrate the learning of all agents [3, 4, 5, 32, 42]. However, outside of MFGs, the multi-agent systems community has recognised that the existence of a central controller is a very strong assumption, as well as one that can both restrict scalability by constituting a bottleneck for computation and communication, and reveal a single point of failure for the whole system [43, 44, 45, 46, 47, 48]. For example, if the single server coordinating all the autonomous vehicles in a smart city were to crash, the entire road network would cease to operate. As an alternative, recent work has explored MFG algorithms for independent learning [6, 49, 50, 51, 52, 53, 54, 55]. However, prior works focus on theoretical sample guarantees rather than on practical convergence speed, and have largely not considered robustness in the senses we address, despite fault-tolerance being one of the original motivations behind many-agent systems.

We address all of these desiderata by novelly introducing a communication network to the MFG setting. Communication networks have had success in other multi-agent settings, removing the reliance on inflexible, centralised structures [43, 45, 46, 47, 56, 57, 58, 59, 60]. We focus on ‘coordination games’, i.e. where agents can increase their individual rewards by following the same strategy as others and therefore have an incentive to communicate policies, even if the MFG setting itself is technically non-cooperative. Thus our work can be applied to real-world problems in e.g. traffic signal control, formation control in swarm robotics, and consensus and synchronisation e.g. for sensor networks.

In this work, we show that when the agents’ state-action value functions (Q-functions) can be only roughly estimated due to fewer samples/updates, possibly leading to high variance in policy updates, then propagating policies that are estimated to be better through the population via the communication network leads to faster convergence than that achieved by agents learning entirely independently. This is crucial in large complex environments that may be encountered in real applications, where the idealised hyperparameter choices (such as learning rates and numbers of iterations) required in previous works for theoretical convergence guarantees will be infeasible in practice.111This work therefore also serves as a bridge with our ongoing work extending our algorithm with neural networks, for settings requiring (non-linear) function approximation. We compare our networked architecture with modified versions of earlier theoretical algorithms for the centralised and independent settings; we extend the original algorithms with experience replay buffers, without which we found them unable to converge in practical time. While the use of buffers means that the original theoretical sample guarantees no longer apply, we argue that this is preferable since these guarantees were in any case impractical. On this basis, we conduct numerical comparisons of the three architectures, demonstrating the benefits of communication for both convergence speed and system robustness. For further discussion of how networked communication can benefit robustness in large multi-agent systems, see Appx. D. In summary, our key contributions include the following:

  • We prove that a theoretical version of our new networked algorithm (Alg. 1) has sample guarantees bounded between those of the centralised and independent settings for learning with a single, non-episodic run of the empirical system (Sec. 3.3).

  • All three theoretical algorithms do not permit convergence in practical time; we show that in practical settings our communication scheme can significantly benefit convergence speed over the independent case, and often even the centralised case (Sec. 3.4.1).

  • We novelly modify all three theoretical algorithms (Alg. 2) to make their practical convergence feasible, most notably by including an experience replay buffer, allowing us to contribute the first empirical demonstrations of all three algorithms (Sec. 3.4.2).

  • Our experiments demonstrate the convergence benefits brought by our networked communication, and show we can remove several of the algorithms’ theoretical assumptions (a goal shared by other work on the practicality of MFG algorithms [61]) (Sec. 4.1).

  • We further demonstrate that our decentralised communication architecture brings significant benefits over both the centralised and independent alternatives in terms of robustness to unexpected learning failures and changes in population size (Sec. 4.1).

The main paper is structured as follows: notation and preliminaries are given in Sec. 2; the theoretical algorithms and results are given in Sections 3.1-3.3; practical enhancements to the algorithms are given in Sec. 3.4; experiments and discussion in Sec. 4. Limitations are found in Appx. G.

2 Preliminaries

We use the following notation. N𝑁Nitalic_N is the number of agents in a population, with 𝒮𝒮\mathcal{S}caligraphic_S and 𝒜𝒜\mathcal{A}caligraphic_A representing the finite state and common action spaces, respectively. The sets 𝒮𝒮\mathcal{S}caligraphic_S and 𝒜𝒜\mathcal{A}caligraphic_A are equipped with the discrete metric d(x,y)=𝟙xy𝑑𝑥𝑦subscript1𝑥𝑦d(x,y)=\mathbb{1}_{x\neq y}italic_d ( italic_x , italic_y ) = blackboard_1 start_POSTSUBSCRIPT italic_x ≠ italic_y end_POSTSUBSCRIPT. The set of probability measures on a finite set 𝒳𝒳\mathcal{X}caligraphic_X is denoted Δ𝒳subscriptΔ𝒳\Delta_{\mathcal{X}}roman_Δ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT, and 𝐞xΔ𝒳subscript𝐞𝑥subscriptΔ𝒳\mathbf{e}_{x}\in\Delta_{\mathcal{X}}bold_e start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT for x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X is a one-hot vector with only the entry corresponding to x𝑥xitalic_x set to 1, and all others set to 0. For time t0𝑡0t\geq 0italic_t ≥ 0, μ^tsubscript^𝜇𝑡\hat{\mu}_{t}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1Ni=1Ns𝒮1𝑁subscriptsuperscript𝑁𝑖1subscript𝑠𝒮\frac{1}{N}\sum^{N}_{i=1}\sum_{s\in\mathcal{S}}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT 𝟙sti=s𝐞ssubscript1subscriptsuperscript𝑠𝑖𝑡𝑠subscript𝐞𝑠\mathbb{1}_{s^{i}_{t}=s}\mathbf{e}_{s}blackboard_1 start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT \in Δ𝒮subscriptΔ𝒮\Delta_{\mathcal{S}}roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is a vector denoting the empirical state distribution of the N𝑁Nitalic_N agents at time t𝑡titalic_t. The set of policies is ΠΠ\Piroman_Π = {π𝜋\piitalic_π : 𝒮Δ𝒜𝒮subscriptΔ𝒜\mathcal{S}\rightarrow\Delta_{\mathcal{A}}caligraphic_S → roman_Δ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT}, and the set of Q-functions is denoted 𝒬={q:𝒮×𝒜}𝒬conditional-set𝑞𝒮𝒜\mathcal{Q}=\{q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}\}caligraphic_Q = { italic_q : caligraphic_S × caligraphic_A → blackboard_R }. For π,πΠ𝜋superscript𝜋Π\pi,\pi^{\prime}\in\Piitalic_π , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Π and q,q𝒬𝑞superscript𝑞𝒬q,q^{\prime}\in\mathcal{Q}italic_q , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Q, we have the norms ππ1subscriptnorm𝜋superscript𝜋1||\pi-\pi^{\prime}||_{1}| | italic_π - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := sups𝒮π(s)π(s)1subscriptsupremum𝑠𝒮subscriptnorm𝜋𝑠superscript𝜋𝑠1\sup_{s\in\mathcal{S}}||\pi(s)-\pi^{\prime}(s)||_{1}roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT | | italic_π ( italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and qqsubscriptnorm𝑞superscript𝑞||q-q^{\prime}||_{\infty}| | italic_q - italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT := sups𝒮,a𝒜|q(s,a)q(s,a)|subscriptsupremumformulae-sequence𝑠𝒮𝑎𝒜𝑞𝑠𝑎superscript𝑞𝑠𝑎\sup_{s\in\mathcal{S},a\in\mathcal{A}}|q(s,a)-q^{\prime}(s,a)|roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A end_POSTSUBSCRIPT | italic_q ( italic_s , italic_a ) - italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a ) |.

Function h:Δ𝒜0:subscriptΔ𝒜subscriptabsent0h:\Delta_{\mathcal{A}}\rightarrow\mathbb{R}_{\geq 0}italic_h : roman_Δ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT denotes a strongly concave function, which we implement as the scaled entropy regulariser λhent(u)=λau(a)logu(a)𝜆subscript𝑒𝑛𝑡𝑢𝜆subscript𝑎𝑢𝑎𝑢𝑎\lambda h_{ent}(u)=-\lambda\sum_{a}u(a)\log u(a)italic_λ italic_h start_POSTSUBSCRIPT italic_e italic_n italic_t end_POSTSUBSCRIPT ( italic_u ) = - italic_λ ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u ( italic_a ) roman_log italic_u ( italic_a ), for a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, uΔ𝒜𝑢subscriptΔ𝒜u\in\Delta_{\mathcal{A}}italic_u ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT and λ>0𝜆0\lambda>0italic_λ > 0. As in some earlier works [4, 6, 40, 62, 63, 64], regularisation is theoretically required to ensure the contractivity of operators and continued exploration, and hence algorithmic convergence. However, it has been recognised that modifying the RL objective in this way can bias the NE [6, 10, 30, 65]. We show in our experiments that we are able to reduce λ𝜆\lambdaitalic_λ to 0 with no detriment to convergence.

Definition 1 (N-player symmetric anonymous games).

An N-player stochastic game with symmetric, anonymous agents is given by the tuple \langleN𝑁Nitalic_N, 𝒮𝒮\mathcal{S}caligraphic_S, 𝒜𝒜\mathcal{A}caligraphic_A, P𝑃Pitalic_P, R𝑅Ritalic_R, γ𝛾\gammaitalic_γ\rangle, where 𝒜𝒜\mathcal{A}caligraphic_A is the action space, identical for each agent; 𝒮𝒮\mathcal{S}caligraphic_S is the identical state space of each agent, such that their initial states are {s0isubscriptsuperscript𝑠𝑖0s^{i}_{0}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT}i=1N𝒮N{}_{i=1}^{N}\in\mathcal{S}^{N}start_FLOATSUBSCRIPT italic_i = 1 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and their policies are {πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT}i=1NΠN{}_{i=1}^{N}\in\Pi^{N}start_FLOATSUBSCRIPT italic_i = 1 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. P𝑃Pitalic_P : 𝒮𝒮\mathcal{S}caligraphic_S ×\times× 𝒜𝒜\mathcal{A}caligraphic_A ×\times× Δ𝒮subscriptΔ𝒮\Delta_{\mathcal{S}}roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT \rightarrow Δ𝒮subscriptΔ𝒮\Delta_{\mathcal{S}}roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is the transition function and R𝑅Ritalic_R : 𝒮𝒮\mathcal{S}caligraphic_S ×\times× 𝒜𝒜\mathcal{A}caligraphic_A ×\times× Δ𝒮subscriptΔ𝒮\Delta_{\mathcal{S}}roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT \rightarrow [0,1] is the reward function, which map each agent’s local state and action and the population’s empirical distribution to transition probabilities and bounded rewards, respectively, i.e.

st+1iP(|sti,ati,μ^t),rti=R(sti,ati,μ^t),i=1,,N.s^{i}_{t+1}\sim P(\cdot|s^{i}_{t},a^{i}_{t},\hat{\mu}_{t}),\;\;\;r^{i}_{t}=R(s% ^{i}_{t},a^{i}_{t},\hat{\mu}_{t}),\;\;\;\forall i=1,\dots,N.\qeditalic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_R ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∀ italic_i = 1 , … , italic_N . italic_∎

The policy of an agent is given by atiπi(sti)similar-tosubscriptsuperscript𝑎𝑖𝑡superscript𝜋𝑖subscriptsuperscript𝑠𝑖𝑡a^{i}_{t}\sim\pi^{i}(s^{i}_{t})italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), that is, each agent only observes its own state, and not the joint state or empirical distribution of the population.

Definition 2 (N-player discounted regularised return).

With joint policies 𝛑𝛑\boldsymbol{\pi}bold_italic_π := (π1,,πNsuperscript𝜋1superscript𝜋𝑁\pi^{1},\dots,\pi^{N}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT) ΠNabsentsuperscriptΠ𝑁\in\Pi^{N}∈ roman_Π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, initial states sampled from a distribution υ0Δ𝒮subscript𝜐0subscriptΔ𝒮\upsilon_{0}\in\Delta_{\mathcal{S}}italic_υ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and γ𝛾\gammaitalic_γ \in [0,1) as a discount factor, the expected discounted regularised returns of each agent i𝑖iitalic_i in the symmetric anonymous game are given by

Θhi(𝝅,υ0)=𝔼[t=0γt(R(sti,ati,μ^t)+h(πi(sti)))|s0jυ0atjπj(stj)st+1jP(|stj,atj,μ^t),t0,j{1,,N}].\Theta^{i}_{h}(\boldsymbol{\pi},\upsilon_{0})=\mathbb{E}\left[\sum^{\infty}_{t% =0}\gamma^{t}(R(s^{i}_{t},a^{i}_{t},\hat{\mu}_{t})+h(\pi^{i}(s^{i}_{t})))\bigg% {|}\begin{subarray}{c}s^{j}_{0}\sim\upsilon_{0}\\ a^{j}_{t}\sim\pi^{j}(s^{j}_{t})\\ s^{j}_{t+1}\sim P(\cdot|s^{j}_{t},a^{j}_{t},\hat{\mu}_{t})\end{subarray},% \forall t\geq 0,j\in\{1,\dots,N\}\right].\qedroman_Θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_italic_π , italic_υ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = blackboard_E [ ∑ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_R ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_h ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) | start_ARG start_ROW start_CELL italic_s start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_υ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG , ∀ italic_t ≥ 0 , italic_j ∈ { 1 , … , italic_N } ] . italic_∎

Definition 3 (δ𝛿\deltaitalic_δ-NE).

Say δ>0𝛿0\delta>0italic_δ > 0 and (π,𝛑i)𝜋superscript𝛑𝑖(\pi,\boldsymbol{\pi}^{-i})( italic_π , bold_italic_π start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ) := (π1,πi1,π,πi+1,,πN)ΠNsuperscript𝜋1superscript𝜋𝑖1𝜋superscript𝜋𝑖1superscript𝜋𝑁superscriptΠ𝑁(\pi^{1},\dots\pi^{i-1},\pi,\pi^{i+1},\dots,\pi^{N})\in\Pi^{N}( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … italic_π start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , italic_π , italic_π start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ∈ roman_Π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. An initial distribution υ0Δ𝒮subscript𝜐0subscriptΔ𝒮\upsilon_{0}\in\Delta_{\mathcal{S}}italic_υ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and an N𝑁Nitalic_N-tuple of policies 𝛑𝛑\boldsymbol{\pi}bold_italic_π := (π1,,πNsuperscript𝜋1superscript𝜋𝑁\pi^{1},\dots,\pi^{N}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT) ΠNabsentsuperscriptΠ𝑁\in\Pi^{N}∈ roman_Π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT form a δ𝛿\deltaitalic_δ-NE (𝛑𝛑\boldsymbol{\pi}bold_italic_π, υ0subscript𝜐0\upsilon_{0}italic_υ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), if ifor-all𝑖\forall i∀ italic_i = 1,,N𝑁\dots,N… , italic_N

Θhi(𝝅,υ0)maxπΠΘhi((π,𝝅i),υ0)δ.subscriptsuperscriptΘ𝑖𝝅subscript𝜐0subscript𝜋ΠsubscriptsuperscriptΘ𝑖𝜋superscript𝝅𝑖subscript𝜐0𝛿\Theta^{i}_{h}(\boldsymbol{\pi},\upsilon_{0})\geq\max_{\pi\in\Pi}\Theta^{i}_{h% }((\pi,\boldsymbol{\pi}^{-i}),\upsilon_{0})-\delta.\qedroman_Θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_italic_π , italic_υ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≥ roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ( italic_π , bold_italic_π start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ) , italic_υ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_δ . italic_∎

At the limit as N𝑁N\rightarrow\inftyitalic_N → ∞, the population of infinitely many agents can be characterised as a limit distribution μΔ𝒮𝜇subscriptΔ𝒮\mu\in\Delta_{\mathcal{S}}italic_μ ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT. We denote the expected discounted return of the representative agent in the infinite-agent game - termed an MFG - as V𝑉Vitalic_V, rather than ΘΘ\Thetaroman_Θ as in the finite N𝑁Nitalic_N-agent case.

Definition 4 (Mean-field discounted regularised return).

For a policy-population pair (π,μ𝜋𝜇\pi,\muitalic_π , italic_μ) Π×Δ𝒮absentΠsubscriptΔ𝒮\in\Pi\times\Delta_{\mathcal{S}}∈ roman_Π × roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT,

Vh(π,μ)=𝔼[t=0γt(R(st,at,μ)+h(π(st)))|s0μatπ(st)st+1P(|st,at,μ)].V_{h}(\pi,\mu)=\mathbb{E}\left[\sum^{\infty}_{t=0}\gamma^{t}(R(s_{t},a_{t},\mu% )+h(\pi(s_{t})))\bigg{|}\begin{subarray}{c}s_{0}\sim\mu\\ a_{t}\sim\pi(s_{t})\\ s_{t+1}\sim P(\cdot|s_{t},a_{t},{\mu})\end{subarray}\right].\qeditalic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_π , italic_μ ) = blackboard_E [ ∑ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ) + italic_h ( italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) | start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_μ end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ) end_CELL end_ROW end_ARG ] . italic_∎

A stationary MFG is one that has a unique population distribution that is stable with respect to a given policy, and the agents’ policies are not time-dependent.

Definition 5 (NE of stationary MFG).

For a policy πΠsuperscript𝜋Π\pi^{*}\in\Piitalic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Π and a population distribution μΔ𝒮superscript𝜇subscriptΔ𝒮\mu^{*}\in\Delta_{\mathcal{S}}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, the pair (π,μsuperscript𝜋superscript𝜇\pi^{*},\mu^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) is a stationary MFG-NE if the following optimality and stability conditions hold:

optimality:Vh(π,μ)=maxπVh(π,μ),optimality:subscript𝑉superscript𝜋superscript𝜇subscript𝜋subscript𝑉𝜋superscript𝜇\displaystyle\textrm{optimality:}\quad V_{h}(\pi^{*},\mu^{*})=\max_{\pi}V_{h}(% \pi,\mu^{*}),optimality: italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ,
stability:μ(s)=s,aμ(s)π(a|s)P(s|s,a,μ).stability:superscript𝜇𝑠subscriptsuperscript𝑠superscript𝑎superscript𝜇superscript𝑠superscript𝜋conditionalsuperscript𝑎superscript𝑠𝑃conditional𝑠superscript𝑠superscript𝑎superscript𝜇\displaystyle\textrm{stability:}\quad\mu^{*}(s)=\sum_{s^{\prime},a^{\prime}}% \mu^{*}(s^{\prime})\pi^{*}(a^{\prime}|s^{\prime})P(s|s^{\prime},a^{\prime},\mu% ^{*}).stability: italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_P ( italic_s | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

If the optimality condition is only satisfied with Vh(πδ,μδ)maxπVh(π,μδ)δsubscript𝑉subscriptsuperscript𝜋𝛿subscriptsuperscript𝜇𝛿subscript𝜋subscript𝑉𝜋subscriptsuperscript𝜇𝛿𝛿V_{h}(\pi^{*}_{\delta},\mu^{*}_{\delta})\geq\max_{\pi}V_{h}(\pi,\mu^{*}_{% \delta})-\deltaitalic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ) ≥ roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ) - italic_δ, then (πδ,μδsubscriptsuperscript𝜋𝛿subscriptsuperscript𝜇𝛿\pi^{*}_{\delta},\mu^{*}_{\delta}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT) is a δ𝛿\deltaitalic_δ-NE of the MFG, where μδsubscriptsuperscript𝜇𝛿\mu^{*}_{\delta}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT is obtained from the stability equation and πδsubscriptsuperscript𝜋𝛿\pi^{*}_{\delta}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT. ∎

The MFG-NE is an approximate NE of the N𝑁Nitalic_N-player game, which is difficult to solve in itself [6, 30]:

Proposition 1 (N-player NE and MFG-NE (Thm. 1, [4])).

If (π,μsuperscript𝜋superscript𝜇\pi^{*},\mu^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) is a MFG-NE, then, under certain Lipschitz conditions [4], for any δ>0𝛿0\delta>0italic_δ > 0, there exists N(δ)>0𝑁𝛿subscriptabsent0N(\delta)\in\mathbb{N}_{>0}italic_N ( italic_δ ) ∈ blackboard_N start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT such that, for all NN(δ)𝑁𝑁𝛿N\geq N(\delta)italic_N ≥ italic_N ( italic_δ ), the joint policy 𝛑={π,π,,π}ΠN𝛑superscript𝜋superscript𝜋superscript𝜋superscriptΠ𝑁\boldsymbol{\pi}=\{\pi^{*},\pi^{*},\dots,\pi^{*}\}\in\Pi^{N}bold_italic_π = { italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } ∈ roman_Π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is a δ𝛿\deltaitalic_δ-NE of the N𝑁Nitalic_N-player game. ∎

Remark 1.

It can be shown that δ𝛿\deltaitalic_δ can be characterised further in terms of N𝑁Nitalic_N, with (π,μsuperscript𝜋superscript𝜇\pi^{*},\mu^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) being an 𝒪𝒪\mathcal{O}caligraphic_O(1N1𝑁\frac{1}{\sqrt{N}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG)-NE of the N𝑁Nitalic_N-player symmetric anonymous game [6].∎

For our new, networked learning algorithm, we also introduce the concept of a time-varying communication network, where the links between agents that make up the network may change at each time step t𝑡titalic_t. Most commonly we might think of such a network as depending on the spatial locations of decentralised agents, such as physical robots, which can communicate with neighbours that fall within a given broadcast radius. When the agents move in the environment, their neighbours and therefore communication links may change. However, the dynamic network can also depend on other factors that may or may not depend on each agent’s state stisubscriptsuperscript𝑠𝑖𝑡s^{i}_{t}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For example, even a network of fixed-location agents can change depending on which agents are active and broadcasting at a given time t𝑡titalic_t, or if their broadcast radius changes, perhaps in relation to signal or battery strength.

Definition 6 (Time-varying communication network).

The time-varying communication network {𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT}t≥0 is given by 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = (𝒩,t𝒩subscript𝑡\mathcal{N},\mathcal{E}_{t}caligraphic_N , caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), where 𝒩𝒩\mathcal{N}caligraphic_N is the set of vertices each representing an agent i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N, and the edge set tsubscript𝑡\mathcal{E}_{t}caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT \subseteq {(i,j) : i,j \in 𝒩𝒩\mathcal{N}caligraphic_N, i \neq j} is the set of undirected communication links by which information can be shared at time t.∎

We say a network is connected if there is a sequence of distinct edges that form a path between each distinct pair of vertices. The union of a collection of graphs {𝒢t,𝒢t+1,,𝒢t+ω}\mathcal{G}_{t},\mathcal{G}_{t+1},\cdots,\mathcal{G}_{t+\omega}\}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , ⋯ , caligraphic_G start_POSTSUBSCRIPT italic_t + italic_ω end_POSTSUBSCRIPT } (for ω𝜔\omega\in\mathbb{N}italic_ω ∈ blackboard_N) is the graph with vertices and edge set equalling the union of the vertices and edge sets of the graphs in the collection [66]. The collection is jointly connected if the union of its members is connected.

3 Learning with networked, decentralised agents

Summary We first discuss theoretical versions of our operators and algorithm (Sections 3.1, 3.2) to show that our networked framework has sample guarantees bounded between those of the centralised- and independent-learning cases (Sec. 3.3). We then show that our novel incorporation of an experience replay buffer (Sec. 3.4.2), along with networked communication, means that empirically we can remove many of the theoretical assumptions and practically infeasible hyperparameter choices that are required by the sample guarantees of the theoretical algorithms, in which cases we demonstrate that our networked algorithm can significantly outperform the independent algorithm, and sometimes even the centralised one (Sec. 4).

3.1 Learning with N𝑁Nitalic_N agents from a single run

We begin by outlining the basic procedure for solving the MFG using the N𝑁Nitalic_N-agent empirical distribution and a single, continuous system run. The two underlying operators are the same for the centralised, independent and networked architectures; in the latter two cases all agents apply the operators individually, while in the centralised setting a single central agent estimates the Q-function and computes an updated policy that is pushed to all the other agents.

We define, for hmax>0subscript𝑚𝑎𝑥0h_{max}>0italic_h start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT > 0 and h:Δ𝒜[0,hmax]:subscriptΔ𝒜0subscript𝑚𝑎𝑥h:\Delta_{\mathcal{A}}\rightarrow[0,h_{max}]italic_h : roman_Δ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT → [ 0 , italic_h start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ], umaxΔ𝒜subscript𝑢𝑚𝑎𝑥subscriptΔ𝒜u_{max}\in\Delta_{\mathcal{A}}italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT such that h(umax)=hmaxsubscript𝑢𝑚𝑎𝑥subscript𝑚𝑎𝑥h(u_{max})=h_{max}italic_h ( italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT. We further define qmax:=1+hmax1γassignsubscript𝑞𝑚𝑎𝑥1subscript𝑚𝑎𝑥1𝛾q_{max}:=\frac{1+h_{max}}{1-\gamma}italic_q start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT := divide start_ARG 1 + italic_h start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG, and set πmaxΠsubscript𝜋𝑚𝑎𝑥Π\pi_{max}\in\Piitalic_π start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ∈ roman_Π such that πmax(s)=umax,s𝒮formulae-sequencesubscript𝜋𝑚𝑎𝑥𝑠subscript𝑢𝑚𝑎𝑥for-all𝑠𝒮\pi_{max}(s)=u_{max},\forall s\in\mathcal{S}italic_π start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ( italic_s ) = italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , ∀ italic_s ∈ caligraphic_S. For any Δh>0Δsubscriptabsent0\Delta h\in\mathbb{R}_{>0}roman_Δ italic_h ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT, we also define the convex set 𝒰Δhsubscript𝒰Δ\mathcal{U}_{\Delta h}caligraphic_U start_POSTSUBSCRIPT roman_Δ italic_h end_POSTSUBSCRIPT := {uΔ𝒜:h(u)hmaxΔh}u\in\Delta_{\mathcal{A}}:h(u)\geq h_{max}-\Delta h\}italic_u ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT : italic_h ( italic_u ) ≥ italic_h start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - roman_Δ italic_h }.

Learning agents use the stochastic temporal difference (TD)-learning operator to repeatedly update an estimate of the Q-function of their current policy with respect to the current empirical distribution, i.e. to approximate the operator ΓqsubscriptΓ𝑞\Gamma_{q}roman_Γ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (Def. 12, Appx. A):

Definition 7 (Stochastic TD-learning operator, simplified from Definition 4.1 in Yardim et al. [6]).

We define 𝒵:=𝒮×𝒜×[0,1]×𝒮×𝒜assign𝒵𝒮𝒜01𝒮𝒜\mathcal{Z}:=\mathcal{S}\times\mathcal{A}\times[0,1]\times\mathcal{S}\times% \mathcal{A}caligraphic_Z := caligraphic_S × caligraphic_A × [ 0 , 1 ] × caligraphic_S × caligraphic_A, and say that ζtisubscriptsuperscript𝜁𝑖𝑡\zeta^{i}_{t}italic_ζ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the transition observed by agent i𝑖iitalic_i at time t𝑡titalic_t, given by ζti=(sti,ati,rti,st+1i,at+1i)subscriptsuperscript𝜁𝑖𝑡subscriptsuperscript𝑠𝑖𝑡subscriptsuperscript𝑎𝑖𝑡subscriptsuperscript𝑟𝑖𝑡subscriptsuperscript𝑠𝑖𝑡1subscriptsuperscript𝑎𝑖𝑡1\zeta^{i}_{t}=(s^{i}_{t},a^{i}_{t},r^{i}_{t},s^{i}_{t+1},a^{i}_{t+1})italic_ζ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ). The TD-learning operator F~βπ:𝒬×𝒵𝒬:subscriptsuperscript~𝐹𝜋𝛽𝒬𝒵𝒬\tilde{F}^{\pi}_{\beta}:\mathcal{Q}\times\mathcal{Z}\rightarrow\mathcal{Q}over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT : caligraphic_Q × caligraphic_Z → caligraphic_Q is defined, for any Q𝒬,ζt𝒵,βformulae-sequence𝑄𝒬formulae-sequencesubscript𝜁𝑡𝒵𝛽Q\in\mathcal{Q},\zeta_{t}\in\mathcal{Z},\beta\in\mathbb{R}italic_Q ∈ caligraphic_Q , italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_Z , italic_β ∈ blackboard_R, as

F~βπ(Q,ζt)=Q(st,at)β(Q(st,at)rth(π(st))γ(Q(st+1,at+1))).subscriptsuperscript~𝐹𝜋𝛽𝑄subscript𝜁𝑡𝑄subscript𝑠𝑡subscript𝑎𝑡𝛽𝑄subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡𝜋subscript𝑠𝑡𝛾𝑄subscript𝑠𝑡1subscript𝑎𝑡1\tilde{F}^{\pi}_{\beta}(Q,\zeta_{t})=Q(s_{t},a_{t})-\beta\left(Q(s_{t},a_{t})-% r_{t}-h(\pi(s_{t}))-\gamma\left(Q(s_{t+1},a_{t+1})\right)\right).\qedover~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_Q , italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_β ( italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h ( italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_γ ( italic_Q ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) ) . italic_∎

Having estimated the Q-function of their current policy, agents update this policy by selecting, for each state, a probability distribution over their actions that maximises the combination of three terms (Def. 8): 1. the value of the given state with respect to the estimated Q-function; 2. a regulariser over the action probability distribution (in practice, we maximise the scaled entropy of the distribution); 3. a metric of similarity between the new action probabilities for the given state and those of the previous policy, given by the squared two-norm of the difference between the two distributions. We can alter the importance of the similarity metric relative to the other two terms by varying a parameter η𝜂\etaitalic_η, which is equivalent to changing the learning rate of the policy update. The three terms in the maximisation function can be seen in the policy mirror ascent (PMA) operator:

Definition 8 (Policy mirror ascent operator (Definition 3.5, [6])).

For a learning rate η>0𝜂0\eta>0italic_η > 0 and Lh:=La+γLsKa2γKsassignsubscript𝐿subscript𝐿𝑎𝛾subscript𝐿𝑠subscript𝐾𝑎2𝛾subscript𝐾𝑠L_{h}:=L_{a}+\gamma\frac{L_{s}K_{a}}{2-\gamma K_{s}}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT := italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_γ divide start_ARG italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG 2 - italic_γ italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG (where these constants are defined in Assumption 1 in Appx. A), the PMA update operator Γηmd:𝒬×ΠΠ:subscriptsuperscriptΓ𝑚𝑑𝜂𝒬ΠΠ\Gamma^{md}_{\eta}:\mathcal{Q}\times\Pi\rightarrow\Piroman_Γ start_POSTSUPERSCRIPT italic_m italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT : caligraphic_Q × roman_Π → roman_Π is defined as

Γηmd(Q,π)(s):=argmaxu𝒰Lh(u,q(s,)+h(u)12ηuπ(s)22),s𝒮,Q𝒬,πΠ.formulae-sequenceassignsubscriptsuperscriptΓ𝑚𝑑𝜂𝑄𝜋𝑠𝑢subscript𝒰subscript𝐿𝑢𝑞𝑠𝑢12𝜂subscriptsuperscriptnorm𝑢𝜋𝑠22formulae-sequencefor-all𝑠𝒮formulae-sequencefor-all𝑄𝒬for-all𝜋Π\Gamma^{md}_{\eta}(Q,\pi)(s):=\underset{u\in\mathcal{U}_{L_{h}}}{\arg\max}% \left(\langle u,q(s,\cdot)\rangle+h(u)-\frac{1}{2\eta}||u-\pi(s)||^{2}_{2}% \right),\forall s\in\mathcal{S},\forall Q\in\mathcal{Q},\forall\pi\in\Pi.\qedroman_Γ start_POSTSUPERSCRIPT italic_m italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_Q , italic_π ) ( italic_s ) := start_UNDERACCENT italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_arg roman_max end_ARG ( ⟨ italic_u , italic_q ( italic_s , ⋅ ) ⟩ + italic_h ( italic_u ) - divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG | | italic_u - italic_π ( italic_s ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ∀ italic_s ∈ caligraphic_S , ∀ italic_Q ∈ caligraphic_Q , ∀ italic_π ∈ roman_Π . italic_∎

The theoretical learning algorithm has three nested loops (see Lines 4, 6 and 7 of Alg. 1). The policy update is applied K𝐾Kitalic_K times. Before the policy update in each of the K𝐾Kitalic_K loops, agents update their estimate of the Q-function by applying the stochastic TD-learning operator Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT times. Prior to the TD update in each of the Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT loops, agents take Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT steps in the environment without updating. The Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT loops exist to create a delay between each TD update to reduce bias when using the empirical distribution to approximate the mean field [67]. However, we find in our experiments that we are able to essentially remove the inner Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT loops (Sec. 4.1).

3.2 Decentralised communication between agents

In our novel algorithm Alg. 1, agents compute policy updates in a decentralised manner as in the independent case (Lines 5-12, before broadcasting their updated policy πk+1isubscriptsuperscript𝜋𝑖𝑘1\pi^{i}_{k+1}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT to their neighbours (Line 15). Agents also broadcast a value σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT that they generate associated with the policy. Agents have a certain broadcast radius, which defines the structure of the possibly time-varying communication network. Of the policies and associated values received by a given agent (including its own) (Line 16), the agent selects a σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT with a probability defined by a softmax function over the received values, and adopts the policy associated with this σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT (Lines 17, 18). This process continues for C𝐶Citalic_C communication rounds, before the the Q-function estimation steps begin again. After each round, the agents take a step in the environment (Line 19), such that if the communication network is affected by the agents’ states, then agents that are unconnected from any others in a given communication round might become connected in the next. (In our experiments we set C𝐶Citalic_C as 1 to show the benefits to convergence speed brought by even a single communication round.) We assume that the softmax function is subject to a possibly time-varying temperature parameter τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. We discuss the effects of the values of C𝐶Citalic_C and τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, as well as the mechanism for generating σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, in subsequent sections.

Remark 2.

Our networked architecture is in effect a generalisation of both the centralised and independent settings (Algorithms 2 and 3, Yardim et al. [6]). The independent setting is the special case where there are no communication rounds, i.e. C=0𝐶0C=0italic_C = 0. The centralised setting is the special case when the method for generating σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT involves a unique ID for each agent, with the central learner agent being assumed to generate the highest value by default. In this case we assume τk0subscript𝜏𝑘0\tau_{k}\rightarrow 0italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → 0 (such that the softmax becomes a max function), and that the communication network becomes jointly connected repeatedly, such that the central learner’s policy is always adopted by the entire population, assuming C𝐶Citalic_C is large enough that the number of jointly connected collections of graphs occurring within C𝐶Citalic_C is equal to the largest diameter of the union of any collection [68, 69]. ∎

Remark 3.

In practice, when referring in the following to a centralised version of the networked Alg. 1, for simplicity we assume that there is no communication and instead that the updated policy πk+11subscriptsuperscript𝜋1𝑘1\pi^{1}_{k+1}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT of the central learner i=1𝑖1i=1italic_i = 1 is pushed to all other agents after Line 12, as in Alg. 2 of [6]. ∎

Algorithm 1 Networked learning with single system run
1:loop parameters K,Mpg,Mtd,C𝐾subscript𝑀𝑝𝑔subscript𝑀𝑡𝑑𝐶K,M_{pg},M_{td},Citalic_K , italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT , italic_C, learning parameters η,{βm}m{0,,Mpg1}𝜂subscriptsubscript𝛽𝑚𝑚0subscript𝑀𝑝𝑔1\eta,\{\beta_{m}\}_{m\in\{0,\dots,M_{pg}-1\}}italic_η , { italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 0 , … , italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT - 1 } end_POSTSUBSCRIPT, λ,γ𝜆𝛾\lambda,\gammaitalic_λ , italic_γ, {τk}k{0,,K1}subscriptsubscript𝜏𝑘𝑘0𝐾1\{\tau_{k}\}_{k\in\{0,\dots,K-1\}}{ italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ { 0 , … , italic_K - 1 } end_POSTSUBSCRIPT
2:initial states {s0isubscriptsuperscript𝑠𝑖0s^{i}_{0}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT}i, i=1,,N𝑖1𝑁i=1,\ldots,Nitalic_i = 1 , … , italic_N
3:Set π0i=πmax,isubscriptsuperscript𝜋𝑖0subscript𝜋𝑚𝑎𝑥for-all𝑖\pi^{i}_{0}=\pi_{max},\forall iitalic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , ∀ italic_i and t0𝑡0t\leftarrow 0italic_t ← 0
4:for k=0,,K1𝑘0𝐾1k=0,\dots,K-1italic_k = 0 , … , italic_K - 1 do
5:     s,a,i:Q^0i(s,a)=Qmax:for-all𝑠𝑎𝑖subscriptsuperscript^𝑄𝑖0𝑠𝑎subscript𝑄𝑚𝑎𝑥\forall s,a,i:\hat{Q}^{i}_{0}(s,a)=Q_{max}∀ italic_s , italic_a , italic_i : over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_Q start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT
6:     for m=0,,Mpg1𝑚0subscript𝑀𝑝𝑔1m=0,\dots,M_{pg}-1italic_m = 0 , … , italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT - 1 do
7:         for Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT iterations do
8:              Take step i:atiπki(|sti),rti=R(sti,ati,μ^t),st+1iP(|sti,ati,μ^t)\forall i:a^{i}_{t}\sim\pi^{i}_{k}(\cdot|s^{i}_{t}),r^{i}_{t}=R(s^{i}_{t},a^{i% }_{t},\hat{\mu}_{t}),s^{i}_{t+1}\sim P(\cdot|s^{i}_{t},a^{i}_{t},\hat{\mu}_{t})∀ italic_i : italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_R ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ); tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
9:         end for
10:         Compute TD update (ifor-all𝑖\forall i∀ italic_i): Q^m+1i=F~βmπki(Q^mi,ζt2i)subscriptsuperscript^𝑄𝑖𝑚1subscriptsuperscript~𝐹subscriptsuperscript𝜋𝑖𝑘subscript𝛽𝑚subscriptsuperscript^𝑄𝑖𝑚subscriptsuperscript𝜁𝑖𝑡2\hat{Q}^{i}_{m+1}=\tilde{F}^{\pi^{i}_{k}}_{\beta_{m}}(\hat{Q}^{i}_{m},\zeta^{i% }_{t-2})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT = over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_ζ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ) (see Def. 7)
11:     end for
12:     PMA step i:πk+1i=Γηmd(Q^Mpgi,πki):for-all𝑖subscriptsuperscript𝜋𝑖𝑘1subscriptsuperscriptΓ𝑚𝑑𝜂subscriptsuperscript^𝑄𝑖subscript𝑀𝑝𝑔subscriptsuperscript𝜋𝑖𝑘\forall i:\pi^{i}_{k+1}=\Gamma^{md}_{\eta}(\hat{Q}^{i}_{M_{pg}},\pi^{i}_{k})∀ italic_i : italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = roman_Γ start_POSTSUPERSCRIPT italic_m italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (see Def. 8)
13:     i::for-all𝑖absent\forall i:∀ italic_i : Generate σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT associated with πk+1isubscriptsuperscript𝜋𝑖𝑘1\pi^{i}_{k+1}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT
14:     for C𝐶Citalic_C rounds do
15:         i::for-all𝑖absent\forall i:∀ italic_i : Broadcast σk+1i,πk+1isubscriptsuperscript𝜎𝑖𝑘1subscriptsuperscript𝜋𝑖𝑘1\sigma^{i}_{k+1},\pi^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT
16:         i:Jti=i{j𝒩:(i,j)t\forall i:J^{i}_{t}=i\cup\{j\in\mathcal{N}:(i,j)\in\mathcal{E}_{t}∀ italic_i : italic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_i ∪ { italic_j ∈ caligraphic_N : ( italic_i , italic_j ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT}
17:         i::for-all𝑖absent\forall i:∀ italic_i : Select σk+1adoptedsimilar-tosubscriptsuperscript𝜎𝑎𝑑𝑜𝑝𝑡𝑒𝑑𝑘1absent\sigma^{adopted}_{k+1}\simitalic_σ start_POSTSUPERSCRIPT italic_a italic_d italic_o italic_p italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∼ Pr(σk+1adopted=σk+1j)subscriptsuperscript𝜎𝑎𝑑𝑜𝑝𝑡𝑒𝑑𝑘1subscriptsuperscript𝜎𝑗𝑘1\left(\sigma^{adopted}_{k+1}=\sigma^{j}_{k+1}\right)( italic_σ start_POSTSUPERSCRIPT italic_a italic_d italic_o italic_p italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) = exp(σk+1j/τk)x=1[Jti]exp(σk+1x/τk)subscriptsuperscript𝜎𝑗𝑘1subscript𝜏𝑘subscriptsuperscriptdelimited-[]subscriptsuperscript𝐽𝑖𝑡𝑥1subscriptsuperscript𝜎𝑥𝑘1subscript𝜏𝑘\frac{\exp{(\sigma^{j}_{k+1}}/\tau_{k})}{\sum^{[J^{i}_{t}]}_{x=1}\exp{(\sigma^% {x}_{k+1}}/\tau_{k})}divide start_ARG roman_exp ( italic_σ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT [ italic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x = 1 end_POSTSUBSCRIPT roman_exp ( italic_σ start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG jJtifor-all𝑗subscriptsuperscript𝐽𝑖𝑡\forall j\in J^{i}_{t}∀ italic_j ∈ italic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
18:         i:σk+1i=σk+1adopted,πk+1i=πk+1adopted:for-all𝑖formulae-sequencesubscriptsuperscript𝜎𝑖𝑘1subscriptsuperscript𝜎𝑎𝑑𝑜𝑝𝑡𝑒𝑑𝑘1subscriptsuperscript𝜋𝑖𝑘1subscriptsuperscript𝜋𝑎𝑑𝑜𝑝𝑡𝑒𝑑𝑘1\forall i:\sigma^{i}_{k+1}=\sigma^{adopted}_{k+1},\pi^{i}_{k+1}=\pi^{adopted}_% {k+1}∀ italic_i : italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT italic_a italic_d italic_o italic_p italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT italic_a italic_d italic_o italic_p italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT
19:         Take step i:atiπk+1i(|sti),rti=R(sti,ati,μ^t),st+1iP(|sti,ati,μ^t)\forall i:a^{i}_{t}\sim\pi^{i}_{k+1}(\cdot|s^{i}_{t}),r^{i}_{t}=R(s^{i}_{t},a^% {i}_{t},\hat{\mu}_{t}),s^{i}_{t+1}\sim P(\cdot|s^{i}_{t},a^{i}_{t},\hat{\mu}_{% t})∀ italic_i : italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_R ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ); tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
20:     end for
21:end for
22:Return policies {πKisubscriptsuperscript𝜋𝑖𝐾\pi^{i}_{K}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}i, i=1,,N𝑖1𝑁i=1,\ldots,Nitalic_i = 1 , … , italic_N

3.3 Properties of policy adoption

We give two results comparing the sample guarantees of our networked case with those of the other settings; each result depends on how networked agents select which communicated policies to adopt.

Theorem 1 (Networked learning with random adoption).

Full version in Appx. B.2. Say that πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the unique MFG-NE policy, and let ε>0𝜀0\varepsilon>0italic_ε > 0 be an arbitrary value that reduces as K𝐾Kitalic_K increases. If we assume that C>0𝐶0C>0italic_C > 0, with τksubscript𝜏𝑘\tau_{k}\rightarrow\inftyitalic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → ∞, then under certain technical conditions the random output {πKisubscriptsuperscript𝜋𝑖𝐾\pi^{i}_{K}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}i of Alg. 1 preserves the sample guarantees of the independent-learning case given in Lemma 3, i.e. the output satisfies, for all agents i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N, 𝔼[πKiπ1]ε+𝒪(1N).𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑖𝐾superscript𝜋1𝜀𝒪1𝑁\mathbb{E}\left[||\pi^{i}_{K}-\pi^{*}||_{1}\right]\;\leq\;\varepsilon+\mathcal% {O}\left(\frac{1}{\sqrt{N}}\right).blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ italic_ε + caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ) .

Proof sketch. Random exchange of policies learnt in a decentralised manner does not change the expectation of the random output of the purely independent-learning setting i.e. where this exchange does not occur. Full proof in Appx. B.3.

Moreover we can show that if σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is generated arbitrarily and uniquely for each i𝑖iitalic_i, then for τk>0subscript𝜏𝑘subscriptabsent0\tau_{k}\in\mathbb{R}_{>0}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT, the sample complexity of the networked learning algorithm is bounded between that of the centralised and independent algorithms:

Theorem 2 (Networked learning with non-random adoption).

Full version in Appx. B.5. Assume that Alg. 1 is run as in Thm. 1, except now τk>0subscript𝜏𝑘subscriptabsent0\tau_{k}\in\mathbb{R}_{>0}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT. Assume also that σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is generated uniquely for each i𝑖iitalic_i, in a manner independent of any metric related to πk+1isubscriptsuperscript𝜋𝑖𝑘1\pi^{i}_{k+1}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, e.g. σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is random or related only to the index i𝑖iitalic_i (so as not to bias the spread of any particular policy). Let the random output of this Algorithm be denoted as {πKi,netsubscriptsuperscript𝜋𝑖𝑛𝑒𝑡𝐾\pi^{i,net}_{K}italic_π start_POSTSUPERSCRIPT italic_i , italic_n italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}i. Also consider an independent-learning version of the algorithm (i.e. with the same parameters except C=0𝐶0C=0italic_C = 0) and denote its random output {πKj,indsubscriptsuperscript𝜋𝑗𝑖𝑛𝑑𝐾\pi^{j,ind}_{K}italic_π start_POSTSUPERSCRIPT italic_j , italic_i italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}j; and a centralised version of the algorithm with the same parameters (see Rem. 3) and denote its random output as πKcentsubscriptsuperscript𝜋𝑐𝑒𝑛𝑡𝐾\pi^{cent}_{K}italic_π start_POSTSUPERSCRIPT italic_c italic_e italic_n italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. Then under certain technical conditions, for all agents i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N and j=1,,N𝑗1𝑁j=1,\dots,Nitalic_j = 1 , … , italic_N, the random outputs {πKi,netsubscriptsuperscript𝜋𝑖𝑛𝑒𝑡𝐾\pi^{i,net}_{K}italic_π start_POSTSUPERSCRIPT italic_i , italic_n italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}i, {πKj,indsubscriptsuperscript𝜋𝑗𝑖𝑛𝑑𝐾\pi^{j,ind}_{K}italic_π start_POSTSUPERSCRIPT italic_j , italic_i italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}j and πKcentsubscriptsuperscript𝜋𝑐𝑒𝑛𝑡𝐾\pi^{cent}_{K}italic_π start_POSTSUPERSCRIPT italic_c italic_e italic_n italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT satisfy

𝔼[πKcentπ1]𝔼[πKi,netπ1]𝔼[πKj,indπ1]ε+𝒪(1N).𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑐𝑒𝑛𝑡𝐾superscript𝜋1𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑖𝑛𝑒𝑡𝐾superscript𝜋1𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑗𝑖𝑛𝑑𝐾superscript𝜋1𝜀𝒪1𝑁\mathbb{E}\left[||\pi^{cent}_{K}-\pi^{*}||_{1}\right]\;\leq\;\mathbb{E}\left[|% |\pi^{i,net}_{K}-\pi^{*}||_{1}\right]\leq\;\mathbb{E}\left[||\pi^{j,ind}_{K}-% \pi^{*}||_{1}\right]\;\leq\;\varepsilon+\mathcal{O}\left(\frac{1}{\sqrt{N}}% \right).\qedblackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_c italic_e italic_n italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_i , italic_n italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_j , italic_i italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ italic_ε + caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ) . italic_∎

Proof sketch. Fewer distinct policies in the population means there is less bias in each learner’s estimation of the Q function. So methods for reducing the number of policies in the population (such as non-random adoption in the networked case) lead to faster convergence. Full proof in Appx. B.6.

Remark 4.

The expected sample complexity of the centralised algorithm is known to be significantly better than that of the independent case [6]. Where the expected sample complexity of the networked case lies on the spectrum between the centralised and independent cases depends on how much closer the divergence between the networked agents’ policies is to 0, as in the centralised case, than it is to the divergence in the independent case (Appx. B.6). This in turn depends on the interaction between the parameters τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and C𝐶Citalic_C, and also on the structure of the time-varying network, i.e. how frequently it becomes jointly connected. See Rem. 6 in Appx. B.7 for further details. ∎

In Thm. 2, we assume σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is generated independently of any metric related to πk+1isubscriptsuperscript𝜋𝑖𝑘1\pi^{i}_{k+1}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT; e.g. σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is random or related only to the index i𝑖iitalic_i. Next we show that an appropriate generation of σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT dependent on πk+1isubscriptsuperscript𝜋𝑖𝑘1\pi^{i}_{k+1}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT can advantageously bias the spread of particular policies in practical scenarios.

3.4 Practical running of algorithms

3.4.1 Generation of σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT

The theoretical analysis in Sec. 3.3 requires algorithmic hyperparameters that render convergence impractically slow in all of the centralised, independent and networked cases (see Rem. 7, Appx. B.8). For practical convergence of the algorithms, we seek to drastically increase {βm}subscript𝛽𝑚\{\beta_{m}\}{ italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } and reduce Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT and Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT, though this will naturally break the theoretical guarantees and give a poorer estimation of the Q-function Q^Mpgisubscriptsuperscript^𝑄𝑖subscript𝑀𝑝𝑔\hat{Q}^{i}_{M_{pg}}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and hence a greater variance in the quality of the updated policies πk+1isubscriptsuperscript𝜋𝑖𝑘1\pi^{i}_{k+1}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT. However, in such cases we found empirically that an appropriate method for generating σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT dependent on πk+1isubscriptsuperscript𝜋𝑖𝑘1\pi^{i}_{k+1}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT allows our networked algorithm to significantly outperform the independent setting, and sometimes even the centralised setting.

We do so by setting σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT to a finite approximation Θh,k+1i^(𝝅k+1,υ0)^subscriptsuperscriptΘ𝑖𝑘1subscript𝝅𝑘1subscript𝜐0\widehat{\Theta^{i}_{h,k+1}}(\boldsymbol{\pi}_{k+1},\upsilon_{0})over^ start_ARG roman_Θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_k + 1 end_POSTSUBSCRIPT end_ARG ( bold_italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_υ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) of Θh,k+1i(𝝅k+1,υ0)subscriptsuperscriptΘ𝑖𝑘1subscript𝝅𝑘1subscript𝜐0\Theta^{i}_{h,k+1}(\boldsymbol{\pi}_{k+1},\upsilon_{0})roman_Θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_k + 1 end_POSTSUBSCRIPT ( bold_italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_υ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) where 𝝅k+1subscript𝝅𝑘1\boldsymbol{\pi}_{k+1}bold_italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := (πk+11,,πk+1Nsubscriptsuperscript𝜋1𝑘1subscriptsuperscript𝜋𝑁𝑘1\pi^{1}_{k+1},\dots,\pi^{N}_{k+1}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT), by tracking the discounted return for E𝐸Eitalic_E evaluation steps. This is given by

Θh,k+1i^(𝝅k+1,υ0)=[e=0Eγe(R(sti,ati,μ^t)+h(πi(sti)))|t=t+eatjπk+1j(stj)st+1jP(|stj,atj,μ^t),j{1,,N}].\widehat{\Theta^{i}_{h,k+1}}(\boldsymbol{\pi}_{k+1},\upsilon_{0})=\left[\sum^{% E}_{e=0}\gamma^{e}(R(s^{i}_{t},a^{i}_{t},\hat{\mu}_{t})\right.+h(\pi^{i}(s^{i}% _{t})))\left|\begin{subarray}{c}t=t+e\\ a^{j}_{t}\sim\pi^{j}_{k+1}(s^{j}_{t})\\ s^{j}_{t+1}\sim P(\cdot|s^{j}_{t},a^{j}_{t},\hat{\mu}_{t})\end{subarray},% \forall j\in\{1,\dots,N\}\right].over^ start_ARG roman_Θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_k + 1 end_POSTSUBSCRIPT end_ARG ( bold_italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_υ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = [ ∑ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e = 0 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_R ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_h ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) | start_ARG start_ROW start_CELL italic_t = italic_t + italic_e end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG , ∀ italic_j ∈ { 1 , … , italic_N } ] .

Generating σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT in this way means that the policies that are more likely to be adopted and spread through the communication network are those which are estimated to receive a higher return in reality, despite being generated from poorly estimated Q-functions. This explains why our networked method can in practice outperform even the centralised case, where the updated policy of the agent with arbitrary index i=1𝑖1i=1italic_i = 1 gets pushed to all other agents regardless of its quality. Naturally the quality of the finite approximation depends on the number of evaluation steps E𝐸Eitalic_E, but we found empirically that E𝐸Eitalic_E can be much smaller than Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT and still give significant convergence benefits.

3.4.2 Algorithm acceleration by use of experience-replay buffer

Even with networked communication, the empirical convergence of our original algorithm is too slow for practical demonstration, as also in the centralised and independent cases. We therefore offer a further technical contribution allowing the first practical demonstrations of all three architectures for learning from a single continued system run.

The modifications made to our Alg. 1 are shown in blue in Alg. 2, Appx. C. Instead of using a transition ζt2isubscriptsuperscript𝜁𝑖𝑡2\zeta^{i}_{t-2}italic_ζ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT to compute the TD update within each Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT iteration and then discarding the transition, we store the transition in a buffer (Line 11) until after the Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT loops. Replay buffers are a common (MA)RL tool used especially with deep learning, precisely to improve data efficiency and reduce autocorrelation [70, 71, 72]. When learning does take place in our modified algorithm (Lines 13-18), it involves cycling through the buffer for L𝐿Litalic_L iterations - randomly shuffling the buffer between each - and thus conducting the TD update on each stored transition L𝐿Litalic_L times. This allows us to reduce the number of Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT loops, as well as not requiring as small a learning rate {βm}subscript𝛽𝑚\{\beta_{m}\}{ italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, allowing much faster learning in practice. Moreover, by shuffling the buffer before each cycle we reduce bias resulting from the dependency of samples along the single path, which may justify being able to achieve adequate stable learning even when reducing the number of Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT waiting steps within each Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT loop. Naturally the experience replay buffer means that theoretical guarantees given in Sec. 3.3 no longer apply, but we trade this off for practical convergence times. See Appx. C for further discussion.

Refer to caption
Figure 1: ‘Target agreement’ game. Even with only a single round of communication, our networked case outperforms the independent case in terms of exploitability, and significantly outperforms in terms of return. The fact that the lowest broadcast radius (0.2) ends with similar exploitability to the independent case yet much higher return shows that our networked algorithm can help agents find more ‘preferable’ equilibria. Moreover, the scenarios with the largest two broadcast radii outperform even the centralised algorithm. CPU time for 5 trials = 128,228 secs.

4 Experiments

Our technical contribution of the experience replay buffer to MFG algorithms for learning from continuous system runs allows us also to contribute the first empirical demonstrations of these algorithms, not just in the networked case but also in the centralised and independent cases. The latter two serve as baselines to demonstrate the advantages of the networked architecture.We follow prior works on stationary MFGs in the types of game used in our demonstrations [5, 30, 40, 73, 74]. We focus on grid-world environments where agents can move in one of the four cardinal directions or remain in place. We present results from two tasks defined by the reward functions of the agents; see Appx. E.1 for full technical description of our task settings.

Refer to caption
Figure 2: ‘Cluster’ game, testing robustness to 50% probability of policy update failure. The communication network allows agents that have successfully updated their policies to spread this information to those that have not, providing redundancy. Independent learners cannot do this and hardly appear to learn at all; likewise the centralised architecture is susceptible to its single point of failure. Thus our networked architecture significantly outperforms both the centralised and independent cases. CPU time for 5 trials = 136,410 secs.
Refer to caption
Figure 3: ‘Target agreement’ game, testing robustness to a five-times increase in population. The networked architectures are quickly able to spread the learnt policies to the newly arrived agents such that learning progress is minimally disturbed, whereas convergence is significantly impacted in the independent case. The largest broadcast radius (1.0), in particular, appears to suffer no disturbance at all, being much more robust than the centralised case, which takes a significant amount of time to return to equilibrium. CPU time for 5 trials = 72,896 secs.

Cluster. Agents are rewarded for gathering together. The agents are given no indication where they should cluster, agreeing this themselves over time.

Target agreement. The agents are rewarded for visiting any of a given number of targets, but their reward is proportional to the number of other agents co-located at that target. The agents must therefore coordinate on which single target they will all meet at to maximise their individual rewards.

As well as the standard scenario for these tasks, we conduct robustness tests in two settings, reflecting those elaborated in Appx. D. The first illustrates robustness to learning failures: at every iteration k𝑘kitalic_k each learner (whether centralised or decentralised) fails to update its policy (i.e. Line 12 of Alg. 1 is not executed such that πk+1i=πkisubscriptsuperscript𝜋𝑖𝑘1subscriptsuperscript𝜋𝑖𝑘\pi^{i}_{k+1}=\pi^{i}_{k}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) with a 50% probability. The second test illustrates robustness to increases in population size. Instead of having 250 agents throughout, the population begins with 50 agents learning normally, and a further 200 agents are added to the population at the marked point.

Experiments are evaluated via three metrics (see Appx. E.2 for a full discussion): an approximation of the exploitability of the joint policy 𝝅ksubscript𝝅𝑘\boldsymbol{\pi}_{k}bold_italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT; the average discounted return of the agents’ policies πkisuperscriptsubscript𝜋𝑘𝑖\pi_{k}^{i}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT; and the population’s policy divergence. Hyperparameters are discussed in Appx. E.3.

4.1 Discussion

We give here three example figures illustrating the benefits of the networked architecture; in each the decimals refer to each agent’s broadcast radius as a fraction of the maximum possible distance in the grid (i.e. the diagonal). See figure captions for details, and Appx. E.4 for further experiments and discussion. As well as allowing convergence in a practical number of iterations, even with only a single communication round, the combination of the buffer and the networked architecture allows us to remove in our experiments a number of the assumptions required for the theoretical algorithms:

  • We significantly reduce Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT while still converging within a reasonable K𝐾Kitalic_K. With smaller values for the parameters Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT (the number of samples in the buffer) and L𝐿Litalic_L (the number of loops through the buffer for updating the Q-function), and hence with worse estimation of the Q-function, the networked architecture outperforms the independent case to an even greater extent. This underlines its advantages in allowing faster convergence in practical settings.

  • We can reduce the Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT parameter (theoretically required for the learner to wait between collecting samples when learning from a single system run) to 1, effectively removing the innermost loop of the nested learning algorithm (see Line 7 of Alg. 1).

  • We can reduce the scaling parameter λ𝜆\lambdaitalic_λ of the entropy regulariser to 0, i.e. we converge even without regularisation, allowing us to leave the NE unbiased, and also removing Assumption 3 (Appx. A). In general an unregularised MFG-NE is not unique [6]; the ability of the agents to coordinate on one of the multiple solutions in the centralised and networked cases may explain why they outperform the independent-learning case.

  • For the PMA operator (Def. 8), we conduct the optimisation over the set uΔ𝒜𝑢subscriptΔ𝒜u\in\Delta_{\mathcal{A}}italic_u ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT instead of u𝒰Lh𝑢subscript𝒰subscript𝐿u\in\mathcal{U}_{L_{h}}italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT, i.e. we can choose from all possible probability distributions over actions instead of needing to identify the Lipschitz constants given in Assumption 1 (Appx. A).

For limitations and ongoing work, see Appx. G; for impact statement, see Appx. H.

References

  • Lasry and Lions [2007] Jean-Michel Lasry and Pierre-Louis Lions. Mean Field Games. Japanese Journal of Mathematics, 2(1):229–260, 2007.
  • Huang et al. [2006] Minyi Huang, Roland P. Malhamé, and Peter E. Caines. Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information & Systems, 6(3):221 – 252, 2006.
  • Xie et al. [2021] Qiaomin Xie, Zhuoran Yang, Zhaoran Wang, and Andreea Minca. Learning While Playing in Mean-Field Games: Convergence and Optimality. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11436–11447. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/xie21g.html.
  • Anahtarci et al. [2023] Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi. Q-learning in regularized mean-field games. Dynamic Games and Applications, 13(1):89–117, 2023.
  • uz Zaman et al. [2023] Muhammad Aneeq uz Zaman, Alec Koppel, Sujay Bhatt, and Tamer Başar. Oracle-free Reinforcement Learning in Mean-Field Games along a Single Sample Path, 2023.
  • Yardim et al. [2023] Batuhan Yardim, Semih Cayci, Matthieu Geist, and Niao He. Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games. In International Conference on Machine Learning, pages 39722–39754. PMLR, 2023.
  • Saldi et al. [2018] Naci Saldi, Tamer Başar, and Maxim Raginsky. Markov–Nash Equilibria in Mean-Field Games with Discounted Cost. SIAM Journal on Control and Optimization, 56(6):4256–4287, 2018. doi: 10.1137/17M1112583. URL https://doi.org/10.1137/17M1112583.
  • Yardim et al. [2024] Batuhan Yardim, Artur Goldman, and Niao He. When is Mean-Field Reinforcement Learning Tractable and Relevant? arXiv preprint arXiv:2402.05757, 2024.
  • Toumi et al. [2024] Noureddine Toumi, Roland Malhame, and Jerome Le Ny. A mean field game approach for a class of linear quadratic discrete choice problems with congestion avoidance. Automatica, 160:111420, 2024. ISSN 0005-1098. doi: https://doi.org/10.1016/j.automatica.2023.111420. URL https://www.sciencedirect.com/science/article/pii/S0005109823005873.
  • Hu and Zhang [2024] Anran Hu and Junzi Zhang. MF-OML: Online Mean-Field Reinforcement Learning with Occupation Measures for Large Population Games, 2024.
  • Trimborn et al. [2018] Torsten Trimborn, Martin Frank, and Stephan Martin. Mean field limit of a behavioral financial market model. Physica A: Statistical Mechanics and its Applications, 505:613–631, 2018. ISSN 0378-4371. doi: https://doi.org/10.1016/j.physa.2018.03.079. URL https://www.sciencedirect.com/science/article/pii/S0378437118303984.
  • Li et al. [2022] Zongxi Li, A. Max Reppen, and Ronnie Sircar. A Mean Field Games Model for Cryptocurrency Mining, 2022.
  • Aggarwal et al. [2024] Shubham Aggarwal, Muhammad Aneeq uz Zaman, Melih Bastopcu, Sennur Ulukus, and Tamer Başar. A Mean Field Game Model for Timely Computation in Edge Computing Systems, 2024.
  • Shen et al. [2024] Shigen Shen, Chenpeng Cai, Yizhou Shen, ** Wu, Wenlong Ke, and Shui Yu. MFGD3QN: Enhancing Edge Intelligence Defense against DDoS with Mean-Field Games and Dueling Double Deep Q-network. IEEE Internet of Things Journal, pages 1–1, 2024. doi: 10.1109/JIOT.2024.3387090.
  • Miao et al. [2024] Li Miao, Shuai Li, Xiangjuan Wu, and Bingjie Liu. Mean-Field Stackelberg Game-Based Security Defense and Resource Optimization in Edge Computing. Applied Sciences, 14(9), 2024. ISSN 2076-3417. doi: 10.3390/app14093538. URL https://www.mdpi.com/2076-3417/14/9/3538.
  • Huang et al. [2020] Kuang Huang, Xuan Di, Qiang Du, and Xi Chen. A game-theoretic framework for autonomous vehicles velocity control: Bridging microscopic differential games and macroscopic mean field games. Discrete and Continuous Dynamical Systems - B, 25(12):4869–4903, 2020. ISSN 1531-3492. doi: 10.3934/dcdsb.2020131.
  • Hu et al. [2023] Tianfeng Hu, Zhiqun hu, Zhaoming Lu, and Xiangming Wen. Dynamic traffic signal control using mean field multi-agent reinforcement learning in large scale road-networks. IET Intelligent Transport Systems, 04 2023. doi: 10.1049/itr2.12364.
  • Mao et al. [2022] Weichao Mao, Haoran Qiu, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer, and Tamer Başar. A mean-field game approach to cloud resource management with function approximation. In Proceedings of the 36th Conference on Advances in Neural Information Processing Systems (NIPS 2022), volume 36, pages 1–12, New Orleans, LA, USA, 2022. Curran Associates, Inc.
  • Bauso and Tembine [2016] Dario Bauso and Hamidou Tembine. Crowd-Averse Cyber-Physical Systems: The Paradigm of Robust Mean-Field Games. IEEE Transactions on Automatic Control, 61(8):2312–2317, 2016. doi: 10.1109/TAC.2015.2492038.
  • Mishra et al. [2023] Rajesh Mishra, Sriram Vishwanath, and Deepanshu Vasal. Model-free Reinforcement Learning for Mean Field Games. IEEE Transactions on Control of Network Systems, pages 1–11, 2023. doi: 10.1109/TCNS.2023.3264934.
  • Yoshioka et al. [2024] Hidekazu Yoshioka, Motoh Tsujimura, and Yumi Yoshioka. Numerical analysis of an extended mean field game for harvesting common fishery resource. Computers & Mathematics with Applications, 165:88–105, 2024. ISSN 0898-1221. doi: https://doi.org/10.1016/j.camwa.2024.04.003. URL https://www.sciencedirect.com/science/article/pii/S0898122124001615.
  • Wang et al. [2024] Yao Wang, Chungang Yang, Tong Li, Xinru Mi, Lixin Li, and Zhu Han. A Survey On Mean-Field Game for Dynamic Management and Control in Space-Air-Ground Network. IEEE Communications Surveys & Tutorials, pages 1–1, 2024. doi: 10.1109/COMST.2024.3393369.
  • Le Ménec [0] Stéphane Le Ménec. Swarm Guidance Based on Mean Field Game Concepts. International Game Theory Review, 0(0):2440008, 0. doi: 10.1142/S0219198924400085. URL https://doi.org/10.1142/S0219198924400085.
  • Emami et al. [2024] Yousef Emami, Hao Gao, Kai Li, Luis Almeida, Eduardo Tovar, and Zhu Han. Age of Information Minimization using Multi-agent UAVs based on AI-Enhanced Mean Field Resource Allocation. IEEE Transactions on Vehicular Technology, pages 1–14, 2024. doi: 10.1109/TVT.2024.3394235.
  • Yang et al. [2023] Yaoqi Yang, Bangning Zhang, Daoxing Guo, Renhui Xu, Neeraj Kumar, and Weizheng Wang. Mean Field Game and Broadcast Encryption-Based Joint Data Freshness Optimization and Privacy Preservation for Mobile Crowdsensing. IEEE Transactions on Vehicular Technology, 72(11):14860–14874, 2023. doi: 10.1109/TVT.2023.3282694.
  • Benamor et al. [2022] Amani Benamor, Oussama Habachi, Inès Kammoun, and Jean-Pierre Cances. NOMA-based Power Control for Machine-Type Communications: A Mean Field Game Approach. In 2022 IEEE International Performance, Computing, and Communications Conference (IPCCC), pages 338–343, 2022. doi: 10.1109/IPCCC55026.2022.9894296.
  • Dey and Xu [2023] Shawon Dey and Hao Xu. Intelligent Distributed Charging Control for Large Scale Electric Vehicles: A Multi-Cluster Mean Field Game Approach. In Proceedings of Cyber-Physical Systems and Internet of Things Week 2023, CPS-IoT Week ’23, page 146–151, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400700491. doi: 10.1145/3576914.3587709. URL https://doi.org/10.1145/3576914.3587709.
  • Wang et al. [2020a] Ximing Wang, Yuhua Xu, ** Chen, Chunguo Li, Xin Liu, Dianxiong Liu, and Yifan Xu. Mean Field Reinforcement Learning Based Anti-Jamming Communications for Ultra-Dense Internet of Things in 6G. In 2020 International Conference on Wireless Communications and Signal Processing (WCSP), pages 195–200, 2020a. doi: 10.1109/WCSP49889.2020.9299742.
  • Korecki et al. [2023] Marcin Korecki, Damian Dailisan, and Dirk Helbing. How Well Do Reinforcement Learning Approaches Cope With Disruptions? The Case of Traffic Signal Control. IEEE Access, 11:36504–36515, 2023. doi: 10.1109/ACCESS.2023.3266644.
  • Lauriere et al. [2022] Mathieu Lauriere, Sarah Perrin, Sertan Girgin, Paul Muller, Ayush Jain, Theophile Cabannes, Georgios Piliouras, Julien Perolat, Romuald Elie, Olivier Pietquin, and Matthieu Geist. Scalable Deep Reinforcement Learning Algorithms for Mean Field Games. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12078–12095. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/lauriere22a.html.
  • Perrin et al. [2020] Sarah Perrin, Julien Pérolat, Mathieu Laurière, Matthieu Geist, Romuald Elie, and Olivier Pietquin. Fictitious Play for Mean Field Games: Continuous Time Analysis and Applications. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  • Laurière et al. [2022] Mathieu Laurière, Sarah Perrin, Matthieu Geist, and Olivier Pietquin. Learning Mean Field Games: A Survey, 2022. URL https://arxiv.longhoe.net/abs/2205.12944.
  • Guo et al. [2019a] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. Learning Mean-Field Games. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/030e65da2b1c944090548d36b244b28d-Paper.pdf.
  • Perrin et al. [2021] Sarah Perrin, Mathieu Laurière, Julien Pérolat, Matthieu Geist, Romuald Élie, and Olivier Pietquin. Mean field games flock! the reinforcement learning way. In IJCAI, 2021.
  • Elie et al. [2020] Romuald Elie, Julien Pérolat, Mathieu Laurière, Matthieu Geist, and Olivier Pietquin. On the Convergence of Model Free Learning in Mean Field Games. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7143–7150, Apr. 2020. doi: 10.1609/aaai.v34i05.6203. URL https://ojs.aaai.org/index.php/AAAI/article/view/6203.
  • Carmona and Laurière [2021] René Carmona and Mathieu Laurière. Deep Learning for Mean Field Games and Mean Field Control with Applications to Finance, 2021.
  • Cao et al. [2021] Haoyang Cao, Xin Guo, and Mathieu Laurière. Connecting GANs, MFGs, and OT, 2021.
  • Germain et al. [2022] Maximilien Germain, Joseph Mikael, and Xavier Warin. Numerical resolution of McKean-Vlasov FBSDEs using neural networks, 2022.
  • Fouque and Zhang [2020] Jean-Pierre Fouque and Zhaoyu Zhang. Deep Learning Methods for Mean Field Control Problems With Delay. Frontiers in Applied Mathematics and Statistics, 6, 2020. ISSN 2297-4687. doi: 10.3389/fams.2020.00011. URL https://www.frontiersin.org/articles/10.3389/fams.2020.00011.
  • Algumaei et al. [2023] Talal Algumaei, Ruben Solozabal, Reda Alami, Hakim Hacid, Merouane Debbah, and Martin Takac. Regularization of the policy updates for stabilizing Mean Field Games, 2023.
  • Angiuli et al. [2023] Andrea Angiuli, Jean-Pierre Fouque, Mathieu Laurière, and Mengrui Zhang. Convergence of Multi-Scale Reinforcement Q-Learning Algorithms for Mean Field Game and Control Problems. arXiv preprint arXiv:2312.06659, 2023.
  • Guo et al. [2019b] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. Learning Mean-Field Games, 2019b. URL https://arxiv.longhoe.net/abs/1901.09585.
  • Zhang et al. [2021] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. “Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms", pages 321–384. Springer International Publishing, Cham, 2021. ISBN 978-3-030-60990-0. doi: 10.1007/978-3-030-60990-0_12. URL https://doi.org/10.1007/978-3-030-60990-0_12.
  • Zhang et al. [2018] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5872–5881. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/zhang18n.html.
  • Wai et al. [2018] Hoi-To Wai, Zhuoran Yang, Zhaoran Wang, and Mingyi Hong. Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 9672–9683, Red Hook, NY, USA, 2018. Curran Associates Inc.
  • Zhang et al. [2019] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Decentralized Multi-Agent Reinforcement Learning with Networked Agents: Recent Advances, 2019. URL https://arxiv.longhoe.net/abs/1912.03821.
  • Chen et al. [2021] Mingzhe Chen, Deniz Gündüz, Kaibin Huang, Walid Saad, Mehdi Bennis, Aneta Vulgarakis Feljan, and H. Vincent Poor. Distributed Learning in Wireless Networks: Recent Progress and Future Challenges, 2021. URL https://arxiv.longhoe.net/abs/2104.02151.
  • Jiang et al. [2024] Jiechuan Jiang, Kefan Su, and Zongqing Lu. Fully Decentralized Cooperative Multi-Agent Reinforcement Learning: A Survey. arXiv preprint arXiv:2401.04934, 2024.
  • Mguni et al. [2018] David Mguni, Joel Jennings, and Enrique Munoz de Cote. Decentralised Learning in Systems With Many, Many Strategic Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018. doi: 10.1609/aaai.v32i1.11586. URL https://ojs.aaai.org/index.php/AAAI/article/view/11586.
  • Yongacoglu et al. [2022a] Bora Yongacoglu, Gürdal Arslan, and Serdar Yüksel. Independent Learning in Mean-Field Games: Satisficing Paths and Convergence to Subjective Equilibria, 2022a. URL https://arxiv.longhoe.net/abs/2209.05703.
  • Yongacoglu et al. [2022b] Bora Yongacoglu, Gürdal Arslan, and Serdar Yüksel. Independent Learning and Subjectivity in Mean-Field Games. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 2845–2850, 2022b. doi: 10.1109/CDC51059.2022.9992399.
  • Grammatico et al. [2015a] Sergio Grammatico, Basilio Gentile, Francesca Parise, and John Lygeros. A Mean Field control approach for demand side management of large populations of Thermostatically Controlled Loads. In 2015 European Control Conference (ECC), pages 3548–3553, 2015a. doi: 10.1109/ECC.2015.7331083.
  • Grammatico et al. [2015b] Sergio Grammatico, Francesca Parise, and John Lygeros. Constrained linear quadratic deterministic mean field control: Decentralized convergence to Nash equilibria in large populations of heterogeneous agents. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 4412–4417, 2015b. doi: 10.1109/CDC.2015.7402908.
  • Parise et al. [2015] Francesca Parise, Sergio Grammatico, Basilio Gentile, and John Lygeros. Network Aggregative Games and Distributed Mean Field Control via Consensus Theory, 2015.
  • Grammatico et al. [2016] Sergio Grammatico, Francesca Parise, Marcello Colombino, and John Lygeros. Decentralized Convergence to Nash Equilibria in Constrained Deterministic Mean Field Control. IEEE Transactions on Automatic Control, 61(11):3315–3329, 2016. doi: 10.1109/TAC.2015.2513368.
  • Doan et al. [2019] Thinh T. Doan, Siva Theja Maguluri, and Justin Romberg. Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation for Multi-Agent Reinforcement Learning, 2019. URL https://arxiv.longhoe.net/abs/1902.07393.
  • Lin et al. [2019] Yixuan Lin, Kaiqing Zhang, Zhuoran Yang, Zhaoran Wang, Tamer Başar, Romeil Sandhu, and Ji Liu. A Communication-Efficient Multi-Agent Actor-Critic Algorithm for Distributed Reinforcement Learning. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 5562–5567, 2019. doi: 10.1109/CDC40024.2019.9029257.
  • Heredia et al. [2020] Paulo Heredia, Hasan Ghadialy, and Shaoshuai Mou. Finite-Sample Analysis of Distributed Q-learning for Multi-Agent Networks. In 2020 American Control Conference (ACC), pages 3511–3516, 2020. doi: 10.23919/ACC45564.2020.9147428.
  • Kar et al. [2013] Soummya Kar, José M. F. Moura, and H. Vincent Poor. 𝒬𝒟𝒬𝒟{{\cal Q}{\cal D}}caligraphic_Q caligraphic_D-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus+InnovationsConsensusInnovations{\rm Consensus}+{\rm Innovations}roman_Consensus + roman_Innovations. IEEE Transactions on Signal Processing, 61(7):1848–1862, 2013. doi: 10.1109/TSP.2013.2241057.
  • Suttle et al. [2019] Wesley Suttle, Zhuoran Yang, Kaiqing Zhang, Zhaoran Wang, Tamer Basar, and Ji Liu. A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning, 2019. URL https://arxiv.longhoe.net/abs/1903.06372.
  • Cui et al. [2023a] Kai Cui, Gökçe Dayanıklı, Mathieu Laurière, Matthieu Geist, Olivier Pietquin, and Heinz Koeppl. Learning Discrete-Time Major-Minor Mean Field Games. arXiv preprint arXiv:2312.10787, 2023a.
  • Cui and Koeppl [2021] Kai Cui and Heinz Koeppl. Approximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning, 2021. URL https://arxiv.longhoe.net/abs/2102.01585.
  • Guo et al. [2022] Xin Guo, Renyuan Xu, and Thaleia Zariphopoulou. Entropy Regularization for Mean Field Games with Learning. Math. Oper. Res., 47(4):3239–3260, nov 2022. ISSN 0364-765X. doi: 10.1287/moor.2021.1238. URL https://doi.org/10.1287/moor.2021.1238.
  • Yu and Yuan [2023] Xiang Yu and Fengyi Yuan. Time-inconsistent mean-field stop** problems: A regularized equilibrium approach. arXiv preprint arXiv:2311.00381, 2023.
  • Su and Lu [2022] Kefan Su and Zongqing Lu. Divergence-Regularized Multi-Agent Actor-Critic. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 20580–20603. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/su22b.html.
  • Jadbabaie et al. [2003] A. Jadbabaie, Jie Lin, and A.S. Morse. Coordination of groups of mobile autonomous agents using nearest neighbor rules. IEEE Transactions on Automatic Control, 48(6):988–1001, 2003. doi: 10.1109/TAC.2003.812781.
  • Kotsalis et al. [2022] Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and Optimal Methods for Stochastic Variational Inequalities, II: Markovian Noise and Policy Evaluation in Reinforcement Learning. SIAM Journal on Optimization, 32(2):1120–1155, 2022. doi: 10.1137/20M1381691. URL https://doi.org/10.1137/20M1381691.
  • Rajagopalan and Shah [2010] Shreevatsa Rajagopalan and Devavrat Shah. Distributed Averaging in Dynamic Networks. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’10, page 369–370, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781450300384. doi: 10.1145/1811039.1811091. URL https://doi.org/10.1145/1811039.1811091.
  • Zhang et al. [2020] Kaiqing Zhang, Yang Liu, Ji Liu, Mingyan Liu, and Tamer Basar. Distributed learning of average belief over networks using sequential observations. Automatica, 115:108857, 2020. ISSN 0005-1098. doi: https://doi.org/10.1016/j.automatica.2020.108857. URL https://www.sciencedirect.com/science/article/pii/S0005109820300558.
  • Lin [1992] Long-Ji Lin. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Mach. Learn., 8(3–4):293–321, may 1992. ISSN 0885-6125. doi: 10.1007/BF00992699. URL https://doi.org/10.1007/BF00992699.
  • Fedus et al. [2020] William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting Fundamentals of Experience Replay. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  • Xu et al. [2024] Linjie Xu, Zichuan Liu, Alexander Dockhorn, Diego Perez-Liebana, **yu Wang, Lei Song, and Jiang Bian. Higher Replay Ratio Empowers Sample-Efficient Multi-Agent Reinforcement Learning, 2024.
  • Laurière [2021] Mathieu Laurière. Numerical Methods for Mean Field Games and Mean Field Type Control, 2021.
  • Cui et al. [2023b] Kai Cui, Christian Fabian, and Heinz Koeppl. Multi-Agent Reinforcement Learning via Mean Field Control: Common Noise, Major Agents and Approximation Properties, 2023b.
  • Eck et al. [2023] Adam Eck, Leen-Kiat Soh, and Prashant Doshi. Decision making in open agent systems. AI Mag., 44(4):508–523, dec 2023. ISSN 0738-4602. doi: 10.1002/aaai.12131. URL https://doi.org/10.1002/aaai.12131.
  • Gao et al. [2024] Yuzhao Gao, Yiming Nie, and Hongliang Wang. A Scalable Multi-agent Reinforcement Learning Approach Based on¬†Value Function Decomposition. In Yi Qu, Mancang Gu, Yifeng Niu, and Wenxing Fu, editors, Proceedings of 3rd 2023 International Conference on Autonomous Unmanned Systems (3rd ICAUS 2023), pages 88–96, Singapore, 2024. Springer Nature Singapore. ISBN 978-981-97-1087-4.
  • Dawood et al. [2023] Murad Dawood, Sicong Pan, Nils Dengler, Siqi Zhou, Angela P Schoellig, and Maren Bennewitz. Safe Multi-Agent Reinforcement Learning for Formation Control without Individual Reference Targets. arXiv preprint arXiv:2312.12861, 2023.
  • Wu et al. [2024a] Zida Wu, Mathieu Lauriere, Samuel Jia Cong Chua, Matthieu Geist, Olivier Pietquin, and Ankur Mehta. Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning. arXiv preprint arXiv:2403.03552, 2024a.
  • Pérolat et al. [2022] Julien Pérolat, Sarah Perrin, Romuald Elie, Mathieu Laurière, Georgios Piliouras, Matthieu Geist, Karl Tuyls, and Olivier Pietquin. Scaling Mean Field Games by Online Mirror Descent. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’22, page 1028–1037, Richland, SC, 2022. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450392136.
  • Guo et al. [2020] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. A General Framework for Learning Mean-Field Games, 2020. URL https://arxiv.longhoe.net/abs/2003.06069.
  • Subramanian and Mahajan [2019] Jayakumar Subramanian and Aditya Mahajan. Reinforcement Learning in Stationary Mean-Field Games. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, page 251–259, Richland, SC, 2019. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450363099.
  • Yang et al. [2018a] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean Field Multi-Agent Reinforcement Learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5571–5580. PMLR, 10–15 Jul 2018a. URL https://proceedings.mlr.press/v80/yang18d.html.
  • Subramanian et al. [2020] Sriram Ganapathi Subramanian, Matthew E. Taylor, Mark Crowley, and Pascal Poupart. Partially Observable Mean Field Reinforcement Learning, 2020. URL https://arxiv.longhoe.net/abs/2012.15791.
  • Subramanian et al. [2022] Sriram Ganapathi Subramanian, Pascal Poupart, Matthew E. Taylor, and Nidhi Hegde. Multi Type Mean Field Reinforcement Learning, 2022.
  • Subramanian et al. [2021] Sriram Ganapathi Subramanian, Matthew E. Taylor, Mark Crowley, and Pascal Poupart. Decentralized Mean Field Games, 2021. URL https://arxiv.longhoe.net/abs/2112.09099.
  • Busoniu et al. [2008] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008. doi: 10.1109/TSMCC.2007.913919.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT press, 2018.
  • Leottau et al. [2018] David L. Leottau, Javier Ruiz del Solar, and Robert Babuka. Decentralized Reinforcement Learning of Robot Behaviors. Artificial Intelligence, 256:130–159, 2018. ISSN 0004-3702. doi: https://doi.org/10.1016/j.artint.2017.12.001. URL https://www.sciencedirect.com/science/article/pii/S0004370217301674.
  • Lv et al. [2023] Zefang Lv, Liang Xiao, Yousong Du, Guohang Niu, Chengwen Xing, and Wenyuan Xu. Multi-Agent Reinforcement Learning based UAV Swarm Communications Against Jamming. IEEE Transactions on Wireless Communications, pages 1–1, 2023. doi: 10.1109/TWC.2023.3268082.
  • Orr and Dutta [2023] James Orr and Ayan Dutta. Multi-Agent Deep Reinforcement Learning for Multi-Robot Applications: A Survey. Sensors, 23(7), 2023. ISSN 1424-8220. doi: 10.3390/s23073625. URL https://www.mdpi.com/1424-8220/23/7/3625.
  • Guan et al. [2024] Yue Guan, Sai Zou, Haixia Peng, Wei Ni, Yanglong Sun, and Hongfeng Gao. Cooperative UAV Trajectory Design for Disaster Area Emergency Communications: A Multiagent PPO Method. IEEE Internet of Things Journal, 11(5):8848–8859, 2024. doi: 10.1109/JIOT.2023.3320796.
  • Ali et al. [2023] Adeeba Ali, Rashid Ali, and M.F. Baig. Distributed Multi-Agent Deep Reinforcement Learning based Navigation and Control of UAV Swarm for Wildfire Monitoring. In 2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON), pages 1–8, 2023. doi: 10.1109/INDISCON58499.2023.10270198.
  • Shalev-Shwartz et al. [2016] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
  • Mannion et al. [2016] Patrick Mannion, Jim Duggan, and Enda Howley. An Experimental Review of Reinforcement Learning Algorithms for Adaptive Traffic Signal Control, pages 47–66. Springer International Publishing, Cham, 2016. ISBN 978-3-319-25808-9. doi: 10.1007/978-3-319-25808-9_4. URL https://doi.org/10.1007/978-3-319-25808-9_4.
  • Samvelyan et al. [2019] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, and Shimon Whiteson. The StarCraft Multi-Agent Challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, page 2186–2188, Richland, SC, 2019. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450363099.
  • Vinyals et al. [2019a] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojtek Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo Ewalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin Dalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor Cai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019a.
  • Berner et al. [2019] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d. O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with Large Scale Deep Reinforcement Learning, 2019.
  • Rashedi et al. [2016] Navid Rashedi, Mohammad Amin Tajeddini, and Hamed Kebriaei. Markov game approach for multi-agent competitive bidding strategies in electricity market. IET Generation, Transmission & Distribution, 10:3756–3763(7), November 2016. ISSN 1751-8687. URL https://digital-library.theiet.org/content/journals/10.1049/iet-gtd.2016.0075.
  • Shavandi and Khedmati [2022] Ali Shavandi and Majid Khedmati. A multi-agent deep reinforcement learning framework for algorithmic trading in financial markets. Expert Systems with Applications, 208:118124, 2022. ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2022.118124. URL https://www.sciencedirect.com/science/article/pii/S0957417422013082.
  • Leibo et al. [2017] Joel Z. Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-Agent Reinforcement Learning in Sequential Social Dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’17, page 464–473, Richland, SC, 2017. International Foundation for Autonomous Agents and Multiagent Systems.
  • Cao et al. [2018] Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z. Leibo, Karl Tuyls, and Stephen Clark. Emergent Communication through Negotiation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk6WhagRW.
  • Jaques et al. [2019] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A. Ortega, DJ Strouse, Joel Z. Leibo, and Nando de Freitas. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning, 2019.
  • McKee et al. [2020] Kevin R. McKee, Ian Gemp, Brian McWilliams, Edgar A. Duéñez-Guzmán, Edward Hughes, and Joel Z. Leibo. Social diversity and social preferences in mixed-motive reinforcement learning, 2020.
  • Daskalakis et al. [2006] Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The Complexity of Computing a Nash Equilibrium. In Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’06, page 71–78, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595931341. doi: 10.1145/1132516.1132527. URL https://doi.org/10.1145/1132516.1132527.
  • Vinyals et al. [2019b] Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, L. Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander Sasha Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom Le Paine, Caglar Gulcehre, Ziyun Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy P. Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, pages 1–5, 2019b.
  • Mcaleer et al. [2020] Stephen Mcaleer, JB Lanier, Roy Fox, and Pierre Baldi. Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 20238–20248. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/e9bcd1b063077573285ae1a41025f5dc-Paper.pdf.
  • Wang et al. [2020b] Lingxiao Wang, Zhuoran Yang, and Zhaoran Wang. Breaking the Curse of Many Agents: Provable Mean Embedding Q-Iteration for Mean-Field Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020b.
  • Zheng et al. [2018] Lianmin Zheng, Jiacheng Yang, Han Cai, Ming Zhou, Weinan Zhang, Jun Wang, and Yong Yu. MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  • Cui et al. [2022] Kai Cui, Anam Tahir, Gizem Ekinci, Ahmed Elshamanhory, Yannick Eich, Mengguang Li, and Heinz Koeppl. A Survey on Large-Population Systems and Scalable Multi-Agent Reinforcement Learning. arXiv preprint arXiv:2209.03859, 2022.
  • Ornia et al. [2022] Daniel Jarne Ornia, Pedro J. Zufiria, and Manuel Mazo Jr. Mean Field Behavior of Collaborative Multiagent Foragers. IEEE Transactions on Robotics, 38(4):2151–2165, 2022. doi: 10.1109/TRO.2022.3152691.
  • Shiri et al. [2019] Hamid Shiri, Jihong Park, and Mehdi Bennis. Massive Autonomous UAV Path Planning: A Neural Network Based Mean-Field Game Theoretic Approach, 2019.
  • Andréen et al. [2016] David Andréen, Petra Jenning, Nils Napp, and Kirstin Petersen. Emergent structures assembled by large swarms of simple robots. In Acadia, pages 54–61, 2016.
  • Chang et al. [2023] Lu Chang, Liang Shan, Weilong Zhang, and Yuewei Dai. Hierarchical multi-robot navigation and formation in unknown environments via deep reinforcement learning and distributed optimization. Robotics and Computer-Integrated Manufacturing, 83:102570, 2023. ISSN 0736-5845. doi: https://doi.org/10.1016/j.rcim.2023.102570. URL https://www.sciencedirect.com/science/article/pii/S0736584523000467.
  • Meigs et al. [2020] Emily Meigs, Francesca Parise, Asuman E. Ozdaglar, and Daron Acemoglu. Optimal dynamic information provision in traffic routing. CoRR, abs/2001.03232, 2020. URL https://arxiv.longhoe.net/abs/2001.03232.
  • Achdou and Capuzzo-Dolcetta [2010] Yves Achdou and Italo Capuzzo-Dolcetta. Mean Field Games: Numerical Methods. SIAM Journal on Numerical Analysis, 48(3):1136–1162, 2010. doi: 10.1137/090758477. URL https://doi.org/10.1137/090758477.
  • Carlini and Silva [2014] E. Carlini and F. J. Silva. A Fully Discrete Semi-Lagrangian Scheme for a First Order Mean Field Game Problem. SIAM Journal on Numerical Analysis, 52(1):45–67, 2014. doi: 10.1137/120902987. URL https://doi.org/10.1137/120902987.
  • Briceño-Arias et al. [2018] Luis Briceño-Arias, Dante Kalise, and Francisco Silva. Proximal methods for stationary Mean Field Games with local couplings. SIAM Journal on Control and Optimization, 56:801–, 03 2018.
  • Achdou and Laurière [2020] Yves Achdou and Mathieu Laurière. Mean Field Games and Applications: Numerical Aspects, 2020.
  • Huang et al. [2024a] Jiawei Huang, Batuhan Yardim, and Niao He. On the Statistical Efficiency of Mean-Field Reinforcement Learning with General Function Approximation . In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pages 289–297. PMLR, 02–04 May 2024a. URL https://proceedings.mlr.press/v238/huang24a.html.
  • Huang et al. [2024b] Jiawei Huang, Niao He, and Andreas Krause. Model-Based RL for Mean-Field Games is not Statistically Harder than Single-Agent RL, 2024b.
  • Angiuli et al. [2021] Andrea Angiuli, Jean-Pierre Fouque, and Mathieu Laurière. Unified Reinforcement Q-Learning for Mean Field Game and Control Problems, 2021.
  • Mishra et al. [2020] Rajesh K Mishra, Deepanshu Vasal, and Sriram Vishwanath. Model-free Reinforcement Learning for Non-stationary Mean Field Games. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 1032–1037, 2020. doi: 10.1109/CDC42340.2020.9304340.
  • Cacace, Simone et al. [2021] Cacace, Simone, Camilli, Fabio, and Goffi, Alessandro. A policy iteration method for mean field games. ESAIM: COCV, 27:85, 2021. doi: 10.1051/cocv/2021081. URL https://doi.org/10.1051/cocv/2021081.
  • Pérolat et al. [2021] Julien Pérolat, Sarah Perrin, Romuald Elie, Mathieu Laurière, Georgios Piliouras, Matthieu Geist, Karl Tuyls, and Olivier Pietquin. Scaling up Mean Field Games with Online Mirror Descent, 2021.
  • Lee et al. [2021] Kiyeob Lee, Desik Rengarajan, Dileep Kalathil, and Srinivas Shakkottai. Reinforcement Learning for Mean Field Games with Strategic Complementarities . In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2458–2466. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.press/v130/lee21b.html.
  • Anahtarci et al. [2019] Berkay Anahtarci, Can Deha Karıksız, and Naci Saldi. Fitted Q-Learning in Mean-field Games. ArXiv, abs/1912.13309, 2019.
  • Fu et al. [2019] Zuyue Fu, Zhuoran Yang, Yongxin Chen, and Zhaoran Wang. Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games. arXiv preprint arXiv:1910.07498, 2019.
  • Tembine et al. [2012] Hamidou Tembine, Raul Tempone, and Pedro Vilanova. Mean-Field Learning: a Survey, 2012.
  • Cardaliaguet, Pierre and Hadikhanloo, Saeed [2017] Cardaliaguet, Pierre and Hadikhanloo, Saeed. Learning in mean field games: The fictitious play. ESAIM: COCV, 23(2):569–591, 2017. doi: 10.1051/cocv/2016004. URL https://doi.org/10.1051/cocv/2016004.
  • Geist et al. [2022] Matthieu Geist, Julien Pérolat, Mathieu Laurière, Romuald Elie, Sarah Perrin, Olivier Bachem, Rémi Munos, and Olivier Pietquin. Concave Utility Reinforcement Learning: the Mean-Field Game Viewpoint, 2022.
  • Bonnans et al. [2021] J Frédéric Bonnans, Pierre Lavigne, and Laurent Pfeiffer. Generalized conditional gradient and learning in potential mean field games, 2021.
  • Yu et al. [2024] Hanfei Yu, Jian Li, Yang Hua, Xu Yuan, and Hao Wang. Cheaper and Faster: Distributed Deep Reinforcement Learning with Serverless Computing. 2024.
  • Wiggins et al. [2023] Samuel Wiggins, Yuan Meng, Rajgopal Kannan, and Viktor Prasanna. Characterizing Speed Performance of Multi-Agent Reinforcement Learning. arXiv preprint arXiv:2309.07108, 2023.
  • Wu et al. [2024b] Peiliang Wu, Liqiang Tian, Qian Zhang, Bingyi Mao, and Wenbai Chen. MARRGM: Learning Framework for Multi-agent Reinforcement Learning via Reinforcement Recommendation and Group Modification. IEEE Robotics and Automation Letters, pages 1–8, 2024b. doi: 10.1109/LRA.2024.3389813.
  • Patel et al. [2024] Bhrij Patel, Wesley A. Suttle, Alec Koppel, Vaneet Aggarwal, Brian M. Sadler, Amrit Singh Bedi, and Dinesh Manocha. Global Optimality without Mixing Time Oracles in Average-reward RL via Multi-level Actor-Critic, 2024.
  • Huang and Lai [2024] Han Huang and Rongjie Lai. Unsupervised Solution Operator Learning for Mean-Field Games via Sampling-Invariant Parametrizations. arXiv preprint arXiv:2401.15482, 2024.
  • Bauso et al. [2012] Dario Bauso, Hamidou Tembine, and Tamer Başar. Robust Mean Field Games with Application to Production of an Exhaustible Resource. IFAC Proceedings Volumes, 45(13):454–459, 2012. ISSN 1474-6670. doi: https://doi.org/10.3182/20120620-3-DK-2025.00135. URL https://www.sciencedirect.com/science/article/pii/S1474667015377302. 7th IFAC Symposium on Robust Control Design.
  • Bauso et al. [2016] Dario Bauso, Hamidou Tembine, and Tamer Başar. Robust mean field games. Dynamic games and applications, 6(3):277–303, 2016.
  • Moon and Başar [2017] Jun Moon and Tamer Başar. Linear Quadratic Risk-Sensitive and Robust Mean Field Games. IEEE Transactions on Automatic Control, 62(3):1062–1077, 2017. doi: 10.1109/TAC.2016.2579264.
  • Huang and Huang [2017] Jianhui Huang and Minyi Huang. Robust Mean Field Linear-Quadratic-Gaussian Games with Unknown L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-Disturbance. SIAM Journal on Control and Optimization, 55(5):2811–2840, 2017. doi: 10.1137/15M1014437. URL https://doi.org/10.1137/15M1014437.
  • Yang et al. [2018b] Chungang Yang, Haoxiang Dai, Jiandong Li, Yue Zhang, and Zhu Han. Distributed Interference-Aware Power Control in Ultra-Dense Small Cell Networks: A Robust Mean Field Game. IEEE Access, 6:12608–12619, 2018b. doi: 10.1109/ACCESS.2018.2799138.
  • Tirumalai and Baras [2022] Amoolya Tirumalai and John S. Baras. A Robust Mean-field Game of Boltzmann-Vlasov-like Traffic Flow. In 2022 American Control Conference (ACC), pages 556–561, 2022. doi: 10.23919/ACC53348.2022.9867331.
  • Aydın and Saldi [2023] Uğur Aydın and Naci Saldi. Robustness and Approximation of Discrete-time Mean-field Games under Discounted Cost Criterion. arXiv preprint arXiv:2310.10828, 2023.
  • Eiben and Smith [2015] A. E. Eiben and J. E. Smith. What Is an Evolutionary Algorithm?, pages 25–48. Springer Berlin Heidelberg, Berlin, Heidelberg, 2015. ISBN 978-3-662-44874-8. doi: 10.1007/978-3-662-44874-8_3. URL https://doi.org/10.1007/978-3-662-44874-8_3.
  • Haasdijk et al. [2014] Evert Haasdijk, Nicolas Bredeche, and Agoston E Eiben. Combining environment-driven adaptation and task-driven optimisation in evolutionary robotics. PloS one, 9(6):e98466, 2014.
  • Trueba et al. [2015] Pedro Trueba, Abraham Prieto, Francisco Bellas, and Richard J. Duro. Embodied Evolution for Collective Indoor Surveillance and Location. In Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO Companion ’15, page 1241–1242, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450334884. doi: 10.1145/2739482.2768490. URL https://doi.org/10.1145/2739482.2768490.
  • Hart et al. [2015] Emma Hart, Andreas Steyven, and Ben Paechter. Improving Survivability in Environment-Driven Distributed Evolutionary Algorithms through Explicit Relative Fitness and Fitness Proportionate Communication. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO ’15, page 169–176, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450334723. doi: 10.1145/2739480.2754688. URL https://doi.org/10.1145/2739480.2754688.
  • Fernández Pérez et al. [2018] Iñaki Fernández Pérez, Amine Boumaza, and François Charpillet. Maintaining Diversity in Robot Swarms with Distributed Embodied Evolution. In Marco Dorigo, Mauro Birattari, Christian Blum, Anders L. Christensen, Andreagiovanni Reina, and Vito Trianni, editors, Swarm Intelligence, pages 395–402, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00533-7.
  • Fernández Pérez and Sanchez [2019] Iñaki Fernández Pérez and Stéphane Sanchez. Influence of Local Selection and Robot Swarm Density on the Distributed Evolution of GRNs. In Paul Kaufmann and Pedro A. Castillo, editors, Applications of Evolutionary Computation, pages 567–582, Cham, 2019. Springer International Publishing. ISBN 978-3-030-16692-2.
  • Gomes and Christensen [2013] Jorge Gomes and Anders L. Christensen. Generic Behaviour Similarity Measures for Evolutionary Swarm Robotics. In Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, GECCO ’13, page 199–206, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450319638. doi: 10.1145/2463372.2463398. URL https://doi.org/10.1145/2463372.2463398.
  • Prieto et al. [2016] Abraham Prieto, Francisco Bellas, Pedro Trueba, and Richard J Duro. Real-time optimization of dynamic problems through distributed embodied evolution. Integrated Computer-Aided Engineering, 23(3):237–253, 2016.
  • Cambier et al. [2020] Nicolas Cambier, Roman Miletitch, Vincent Fremont, Marco Dorigo, Eliseo Ferrante, and Vito Trianni. Language Evolution in Swarm Robotics: A Perspective. Frontiers in Robotics and AI, 7, 2020. ISSN 2296-9144. doi: 10.3389/frobt.2020.00012. URL https://www.frontiersin.org/articles/10.3389/frobt.2020.00012.
  • Cambier et al. [2018] Nicolas Cambier, Vincent Frémont, Vito Trianni, and Eliseo Ferrante. Embodied evolution of self-organised aggregation by cultural propagation. In Marco Dorigo, Mauro Birattari, Christian Blum, Anders L. Christensen, Andreagiovanni Reina, and Vito Trianni, editors, Swarm Intelligence, pages 351–359, Cham, 2018. Springer International Publishing.
  • Cambier et al. [2021] Nicolas Cambier, Dario Albani, Vincent Fremont, Vito Trianni, and Eliseo Ferrante. Cultural evolution of probabilistic aggregation in synthetic swarms. Applied Soft Computing, 113:108010, 2021. ISSN 1568-4946. doi: https://doi.org/10.1016/j.asoc.2021.108010. URL https://www.sciencedirect.com/science/article/pii/S1568494621009327.
  • Berner et al. [2023] Rico Berner, Thilo Gross, Christian Kuehn, Jürgen Kurths, and Serhiy Yanchuk. Adaptive Dynamical Networks, 2023.
  • Ma et al. [2024] Zhuangzhuang Ma, Lei Shi, Kai Chen, **liang Shao, and Yuhua Cheng. Multi-Agent Bipartite Flocking Control over Cooperation-Competition Networks with Asynchronous Communications. IEEE Transactions on Signal and Information Processing over Networks, pages 1–12, 2024. doi: 10.1109/TSIPN.2024.3384817.
  • Piazza et al. [2024] Nancirose Piazza, Vahid Behzadan, and Stefan Sarkadi. The Power in Communication: Power Regularization of Communication for Autonomy in Cooperative Multi-Agent Reinforcement Learning, 2024.
  • Tang et al. [2024] Huaze Tang, Yuanquan Hu, Fanfan Zhao, Junji Yan, Ting Dong, and Wenbo Ding. M3ARL: Moment-Embedded Mean-Field Multi-Agent Reinforcement Learning for Continuous Action Space. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7250–7254, 2024. doi: 10.1109/ICASSP48485.2024.10448058.
  • Vieillard et al. [2020] Nino Vieillard, Olivier Pietquin, and Matthieu Geist. Munchausen Reinforcement Learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4235–4246. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/2c6a0bae0f071cbbf0bb3d5b11d90a82-Paper.pdf.
  • Wu et al. [2024c] Zida Wu, Mathieu Lauriere, Samuel Jia Cong Chua, Matthieu Geist, Olivier Pietquin, and Ankur Mehta. Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning, 2024c.
  • Perrin et al. [2022] Sarah Perrin, Mathieu Laurière, Julien Pérolat, Romuald Élie, Matthieu Geist, and Olivier Pietquin. Generalization in mean field games by learning master policies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9413–9421, 2022.
  • Carmona et al. [2021] René Carmona, Mathieu Laurière, and Zongjun Tan. Model-Free Mean-Field Reinforcement Learning: Mean-Field MDP and Mean-Field Q-Learning, 2021.
  • Zhang and Sutton [2017] Shangtong Zhang and Richard S Sutton. A deeper look at experience replay. arXiv preprint arXiv:1712.01275, 2017.

Technical appendices to article
“Networked Communication for Decentralised Agents in Mean-Field Games”

Appendix A Further definitions and assumptions for theorems in Sec. 3.3

Assumption 1 (Lipschitz continuity of P𝑃Pitalic_P and R𝑅Ritalic_R, from Assumption 1, Yardim et al. [6]).

There exist constants Kμ,Ks,Ka,Lμ,Ls,La0subscript𝐾𝜇subscript𝐾𝑠subscript𝐾𝑎subscript𝐿𝜇subscript𝐿𝑠subscript𝐿𝑎subscriptabsent0K_{\mu},K_{s},K_{a},L_{\mu},L_{s},L_{a}\in\mathbb{R}_{\geq 0}italic_K start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT such that s,s𝒮,a,a𝒜,μ,μΔ𝒮formulae-sequencefor-all𝑠superscript𝑠𝒮for-all𝑎superscript𝑎𝒜for-all𝜇superscript𝜇subscriptΔ𝒮\forall s,s^{\prime}\in\mathcal{S},\forall a,a^{\prime}\in\mathcal{A},\forall% \mu,\mu^{\prime}\in\Delta_{\mathcal{S}}∀ italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S , ∀ italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A , ∀ italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT,

||P(|s,a,μ)P(|s,a,μ)||1Kμ||μμ||1+Ksd(s,s)+Kad(a,a),||P(\cdot|s,a,\mu)-P(\cdot|s^{\prime},a^{\prime},\mu^{\prime})||_{1}\leq K_{% \mu}||\mu-\mu^{\prime}||_{1}+K_{s}d(s,s^{\prime})+K_{a}d(a,a^{\prime}),| | italic_P ( ⋅ | italic_s , italic_a , italic_μ ) - italic_P ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_K start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT | | italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_d ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,
|R(s,a,μ)R(s,a,μ)|Lμμμ1+Lsd(s,s)+Lad(a,a).𝑅𝑠𝑎𝜇𝑅superscript𝑠superscript𝑎superscript𝜇subscript𝐿𝜇subscriptnorm𝜇superscript𝜇1subscript𝐿𝑠𝑑𝑠superscript𝑠subscript𝐿𝑎𝑑𝑎superscript𝑎|R(s,a,\mu)-R(s^{\prime},a^{\prime},\mu^{\prime})|\leq L_{\mu}||\mu-\mu^{% \prime}||_{1}+L_{s}d(s,s^{\prime})+L_{a}d(a,a^{\prime}).\qed| italic_R ( italic_s , italic_a , italic_μ ) - italic_R ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ italic_L start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT | | italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_d ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . italic_∎
Definition 9 (Population update operator, from Definition 3.1, Yardim et al. [6]).

The single-step population update operator Γpop:Δ𝒮×ΠΔ𝒮:subscriptΓ𝑝𝑜𝑝subscriptΔ𝒮ΠsubscriptΔ𝒮\Gamma_{pop}:\Delta_{\mathcal{S}}\times\Pi\rightarrow\Delta_{\mathcal{S}}roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT : roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT × roman_Π → roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is defined as, s𝒮for-all𝑠𝒮\forall s\in\mathcal{S}∀ italic_s ∈ caligraphic_S:

Γpop(μ,π)(s):=s𝒮a𝒜μ(s)π(a|s)P(s|s,a,μ).assignsubscriptΓ𝑝𝑜𝑝𝜇𝜋𝑠subscriptsuperscript𝑠𝒮subscriptsuperscript𝑎𝒜𝜇superscript𝑠𝜋conditionalsuperscript𝑎superscript𝑠𝑃conditional𝑠superscript𝑠superscript𝑎𝜇\Gamma_{pop}(\mu,\pi)(s):=\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in% \mathcal{A}}\mu(s^{\prime})\pi(a^{\prime}|s^{\prime})P(s|s^{\prime},a^{\prime}% ,\mu).roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT ( italic_μ , italic_π ) ( italic_s ) := ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_μ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_π ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_P ( italic_s | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) .

Let us use the short hand notation Γpopn(μ,π):=Γpop(Γpop(Γpop(μ,π),π),,π)n timesassignsubscriptsuperscriptΓ𝑛𝑝𝑜𝑝𝜇𝜋subscriptsubscriptΓ𝑝𝑜𝑝subscriptΓ𝑝𝑜𝑝subscriptΓ𝑝𝑜𝑝𝜇𝜋𝜋𝜋𝑛 times\Gamma^{n}_{pop}(\mu,\pi):=\underbrace{\Gamma_{pop}(\dots\Gamma_{pop}(\Gamma_{% pop}(\mu,\pi),\pi),\dots,\pi)}_{n\text{ times}}roman_Γ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT ( italic_μ , italic_π ) := under⏟ start_ARG roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT ( … roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT ( roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT ( italic_μ , italic_π ) , italic_π ) , … , italic_π ) end_ARG start_POSTSUBSCRIPT italic_n times end_POSTSUBSCRIPT. ∎

We recall that ΓpopsubscriptΓ𝑝𝑜𝑝\Gamma_{pop}roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT is known to be Lipschitz:

Lemma 1 (Lipschitz population updates, from Lemma 3.2, Yardim et al. [6]).

ΓpopsubscriptΓ𝑝𝑜𝑝\Gamma_{pop}roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT is Lipschitz with

Γpop(μ,π)Γpop(μ,π)1Lpop,μμμ1+Ka2ππ1,subscriptnormsubscriptΓ𝑝𝑜𝑝𝜇𝜋subscriptΓ𝑝𝑜𝑝superscript𝜇superscript𝜋1subscript𝐿𝑝𝑜𝑝𝜇subscriptnorm𝜇superscript𝜇1subscript𝐾𝑎2subscriptnorm𝜋superscript𝜋1||\Gamma_{pop}(\mu,\pi)-\Gamma_{pop}(\mu^{\prime},\pi^{\prime})||_{1}\leq L_{% pop,\mu}||\mu-\mu^{\prime}||_{1}+\frac{K_{a}}{2}||\pi-\pi^{\prime}||_{1},| | roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT ( italic_μ , italic_π ) - roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_L start_POSTSUBSCRIPT italic_p italic_o italic_p , italic_μ end_POSTSUBSCRIPT | | italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG | | italic_π - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where Lpop,μsubscript𝐿𝑝𝑜𝑝𝜇L_{pop,\mu}italic_L start_POSTSUBSCRIPT italic_p italic_o italic_p , italic_μ end_POSTSUBSCRIPT := (Ks2+Ka2+Kμ)subscript𝐾𝑠2subscript𝐾𝑎2subscript𝐾𝜇\left(\frac{K_{s}}{2}+\frac{K_{a}}{2}+K_{\mu}\right)( divide start_ARG italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + italic_K start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ), πΠ,μΔ𝒮formulae-sequencefor-all𝜋Π𝜇subscriptΔ𝒮\forall\pi\in\Pi,\mu\in\Delta_{\mathcal{S}}∀ italic_π ∈ roman_Π , italic_μ ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT. ∎

For stationary MFGs the population distribution must be stable with respect to a policy, requiring that Γpop(,π)subscriptΓ𝑝𝑜𝑝𝜋\Gamma_{pop}(\cdot,\pi)roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT ( ⋅ , italic_π ) is contractive πΠfor-all𝜋Π\forall\pi\in\Pi∀ italic_π ∈ roman_Π:

Assumption 2 (Stable population, from Assumption 2, Yardim et al. [6]).

Population updates are stable, i.e. Lpop,μ<1subscript𝐿𝑝𝑜𝑝𝜇1L_{pop,\mu}<1italic_L start_POSTSUBSCRIPT italic_p italic_o italic_p , italic_μ end_POSTSUBSCRIPT < 1. ∎

Definition 10 (Stable population operator ΓpopsuperscriptsubscriptΓ𝑝𝑜𝑝\Gamma_{pop}^{\infty}roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT, from Definition 3.3, Yardim et al. [6]).

Given Assumption 2, the operator Γpop:ΠΔ𝒮:superscriptsubscriptΓ𝑝𝑜𝑝ΠsubscriptΔ𝒮\Gamma_{pop}^{\infty}:\Pi\rightarrow\Delta_{\mathcal{S}}roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT : roman_Π → roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT maps a given policy to its unique stable population distribution such that Γpop(Γpop(π),π)=Γpop(π)subscriptΓ𝑝𝑜𝑝superscriptsubscriptΓ𝑝𝑜𝑝𝜋𝜋superscriptsubscriptΓ𝑝𝑜𝑝𝜋\Gamma_{pop}(\Gamma_{pop}^{\infty}(\pi),\pi)=\Gamma_{pop}^{\infty}(\pi)roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT ( roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_π ) , italic_π ) = roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_π ), i.e. the unique fixed point of Γpop(,π):Δ𝒮Δ𝒮.:subscriptΓ𝑝𝑜𝑝𝜋subscriptΔ𝒮subscriptΔ𝒮\Gamma_{pop}(\cdot,\pi):\Delta_{\mathcal{S}}\rightarrow\Delta_{\mathcal{S}}.roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT ( ⋅ , italic_π ) : roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT → roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT .

Definition 11 (Qhsubscript𝑄Q_{h}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and qhsubscript𝑞q_{h}italic_q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT functions).

We define, for any pair (s,a)𝒮×𝒜𝑠𝑎𝒮𝒜(s,a)\in\mathcal{S}\times\mathcal{A}( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A:

Qh(s,a|π,μ):=𝔼[t=0γt(R(st,at,μ)+h(π(st)))|s0=s,a0=a,st+1P(|st,at,μ),at+1π(|st+1),t0]Q_{h}(s,a|\pi,\mu):=\mathbb{E}\left[\sum^{\infty}_{t=0}\gamma^{t}(R(s_{t},a_{t% },\mu)+h(\pi(s_{t})))\left|\begin{subarray}{c}s_{0}=s,\\ a_{0}=a\end{subarray},\begin{subarray}{c}s_{t+1}\sim P(\cdot|s_{t},a_{t},\mu),% \\ a_{t+1}\sim\pi(\cdot|s_{t+1})\end{subarray},\forall t\geq 0\vphantom{\sum^{% \infty}_{t=0}\gamma^{t}(R(s_{t},a_{t},\mu)+}\right]\right.italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a | italic_π , italic_μ ) := blackboard_E [ ∑ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ) + italic_h ( italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) | start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a end_CELL end_ROW end_ARG , start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ) , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG , ∀ italic_t ≥ 0 ]

and

qh(s,a|π,μ):=R(s,a,μ)+γs,aP(s|s,a,μ)π(a|s)Qh(s,a|π,μ).assignsubscript𝑞𝑠conditional𝑎𝜋𝜇𝑅𝑠𝑎𝜇𝛾subscriptsuperscript𝑠superscript𝑎𝑃conditionalsuperscript𝑠𝑠𝑎𝜇𝜋conditionalsuperscript𝑎superscript𝑠subscript𝑄superscript𝑠conditionalsuperscript𝑎𝜋𝜇q_{h}(s,a|\pi,\mu):=R(s,a,\mu)+\gamma\sum_{s^{\prime},a^{\prime}}P(s^{\prime}|% s,a,\mu)\pi(a^{\prime}|s^{\prime})Q_{h}(s^{\prime},a^{\prime}|\pi,\mu).\qeditalic_q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a | italic_π , italic_μ ) := italic_R ( italic_s , italic_a , italic_μ ) + italic_γ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_μ ) italic_π ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_π , italic_μ ) . italic_∎
Definition 12 (ΓqsubscriptΓ𝑞\Gamma_{q}roman_Γ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT operator).

The operator Γq:Π×Δ𝒮𝒬:subscriptΓ𝑞ΠsubscriptΔ𝒮𝒬\Gamma_{q}:\Pi\times\Delta_{\mathcal{S}}\rightarrow\mathcal{Q}roman_Γ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT : roman_Π × roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT → caligraphic_Q map** population-policy pairs to Q-functions is defined as Γq(π,μ):=qh(,|π,μ)𝒬\Gamma_{q}(\pi,\mu):=q_{h}(\cdot,\cdot|\pi,\mu)\in\mathcal{Q}roman_Γ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_π , italic_μ ) := italic_q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ , ⋅ | italic_π , italic_μ ) ∈ caligraphic_Q πΠ,μΔ𝒮formulae-sequencefor-all𝜋Π𝜇subscriptΔ𝒮\forall\pi\in\Pi,\mu\in\Delta_{\mathcal{S}}∀ italic_π ∈ roman_Π , italic_μ ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT.∎

We also assume that the regulariser hhitalic_h ensures that all actions at all states are explored with non-zero probability:

Assumption 3 (Persistence of excitation, from Assumption 3, Yardim et al. [6]).

We assume there exists pinf>0subscript𝑝𝑖𝑛𝑓0p_{inf}>0italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT > 0 such that:

  1. 1.

    πmax(a|s)pinfsubscript𝜋𝑚𝑎𝑥conditional𝑎𝑠subscript𝑝𝑖𝑛𝑓\pi_{max}(a|s)\geq p_{inf}italic_π start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ( italic_a | italic_s ) ≥ italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT s𝒮,a𝒜formulae-sequencefor-all𝑠𝒮𝑎𝒜\forall s\in\mathcal{S},a\in\mathcal{A}∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A,

  2. 2.

    For any πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π and q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q that satisfy, (s,a)𝒮×𝒜for-all𝑠𝑎𝒮𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A}∀ ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A, π(a|s)pinf𝜋conditional𝑎𝑠subscript𝑝𝑖𝑛𝑓\pi(a|s)\geq p_{inf}italic_π ( italic_a | italic_s ) ≥ italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT and 0q(s,a)Qmax0𝑞𝑠𝑎subscript𝑄𝑚𝑎𝑥0\leq q(s,a)\leq Q_{max}0 ≤ italic_q ( italic_s , italic_a ) ≤ italic_Q start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, it holds that Γηmd(q,π)(a|s)pinfsubscriptsuperscriptΓ𝑚𝑑𝜂𝑞𝜋conditional𝑎𝑠subscript𝑝𝑖𝑛𝑓\Gamma^{md}_{\eta}(q,\pi)(a|s)\geq p_{inf}roman_Γ start_POSTSUPERSCRIPT italic_m italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_q , italic_π ) ( italic_a | italic_s ) ≥ italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT, (s,a)𝒮×𝒜for-all𝑠𝑎𝒮𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A}∀ ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A. ∎

Assumption 4 (Sufficient mixing, from Assumption 4, Yardim et al. [6]).

For any πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π satisfying π(a|s)pinf>0𝜋conditional𝑎𝑠subscript𝑝𝑖𝑛𝑓0\pi(a|s)\geq p_{inf}>0italic_π ( italic_a | italic_s ) ≥ italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT > 0 s𝒮,a𝒜formulae-sequencefor-all𝑠𝒮𝑎𝒜\forall s\in\mathcal{S},a\in\mathcal{A}∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A, and any initial states {s0isubscriptsuperscript𝑠𝑖0s^{i}_{0}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT}i𝒮N{}_{i}\in\mathcal{S}^{N}start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, there exist Tmix>0,δmix>0formulae-sequencesubscript𝑇𝑚𝑖𝑥0subscript𝛿𝑚𝑖𝑥0T_{mix}>0,\delta_{mix}>0italic_T start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT > 0 , italic_δ start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT > 0 such that (sTmixj=s|\mathbb{P}(s^{j}_{T_{mix}}=s^{\prime}|blackboard_P ( italic_s start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |{s0isubscriptsuperscript𝑠𝑖0s^{i}_{0}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT})i{}_{i})start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT ) δmixabsentsubscript𝛿𝑚𝑖𝑥\geq\delta_{mix}≥ italic_δ start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT, s𝒮,j[N]formulae-sequencefor-allsuperscript𝑠𝒮𝑗delimited-[]𝑁\forall s^{\prime}\in\mathcal{S},j\in[N]∀ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S , italic_j ∈ [ italic_N ].∎

Definition 13 (Nested learning operator).

For a learning rate η>0𝜂0\eta>0italic_η > 0, Γη:ΠΠ:subscriptΓ𝜂ΠΠ\Gamma_{\eta}:\Pi\rightarrow\Piroman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT : roman_Π → roman_Π is defined as

Γη(π):=Γηmd(Γq(π,Γpop(π)),π).assignsubscriptΓ𝜂𝜋superscriptsubscriptΓ𝜂𝑚𝑑subscriptΓ𝑞𝜋superscriptsubscriptΓ𝑝𝑜𝑝𝜋𝜋\Gamma_{\eta}(\pi):=\Gamma_{\eta}^{md}(\Gamma_{q}(\pi,\Gamma_{pop}^{\infty}(% \pi)),\pi).\qedroman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_π ) := roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_d end_POSTSUPERSCRIPT ( roman_Γ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_π , roman_Γ start_POSTSUBSCRIPT italic_p italic_o italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_π ) ) , italic_π ) . italic_∎
Lemma 2 (Lipschitz continuity of ΓηsubscriptΓ𝜂\Gamma_{\eta}roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT, from Lemma 3.7, [6])).

For any η>0𝜂0\eta>0italic_η > 0, the operator Γη:ΠΠ:subscriptΓ𝜂ΠΠ\Gamma_{\eta}:\Pi\rightarrow\Piroman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT : roman_Π → roman_Π is Lipschitz with constant LΓηsubscript𝐿subscriptΓ𝜂L_{\Gamma_{\eta}}italic_L start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT on (Π,||||1\Pi,||\cdot||_{1}roman_Π , | | ⋅ | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT).∎

Appendix B Full theorems and proofs

B.1 Sample guarantees of independent-learning case

Lemma 3 (Independent learning, from Theorem 4.5, Yardim et al. [6]).

Define t0:=16(1+γ)2((1γ)δmixpinf)2assignsubscript𝑡016superscript1𝛾2superscript1𝛾subscript𝛿𝑚𝑖𝑥subscript𝑝𝑖𝑛𝑓2t_{0}:=\frac{16(1+\gamma)^{2}}{((1-\gamma)\delta_{mix}p_{inf})^{2}}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := divide start_ARG 16 ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( ( 1 - italic_γ ) italic_δ start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Assume that Assumptions 1, 2, 3 and 4 hold, that η>0𝜂0\eta>0italic_η > 0 satisfies LΓη<1subscript𝐿subscriptΓ𝜂1L_{\Gamma_{\eta}}<1italic_L start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT < 1, and that πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the unique MFG-NE. The learning rates are βm=2(1γ)(t0+m1)subscript𝛽𝑚21𝛾subscript𝑡0𝑚1\beta_{m}=\frac{2}{(1-\gamma)(t_{0}+m-1)}italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_m - 1 ) end_ARG m0for-all𝑚0\forall m\geq 0∀ italic_m ≥ 0, and let ε>0𝜀0\varepsilon>0italic_ε > 0 be arbitrary. There exists a problem-dependent constant a[0,)𝑎0a\in[0,\infty)italic_a ∈ [ 0 , ∞ ) such that if K=log8ε1logLΓη1𝐾8superscript𝜀1subscriptsuperscript𝐿1subscriptΓ𝜂K=\frac{\log 8\varepsilon^{-1}}{\log L^{-1}_{\Gamma_{\eta}}}italic_K = divide start_ARG roman_log 8 italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG roman_log italic_L start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG, Mpg>𝒪(ε2a)subscript𝑀𝑝𝑔𝒪superscript𝜀2𝑎M_{pg}>\mathcal{O}(\varepsilon^{-2-a})italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT > caligraphic_O ( italic_ε start_POSTSUPERSCRIPT - 2 - italic_a end_POSTSUPERSCRIPT ) and Mtd>𝒪(log2ε1)subscript𝑀𝑡𝑑𝒪𝑙𝑜superscript𝑔2superscript𝜀1M_{td}>\mathcal{O}(log^{2}\varepsilon^{-1})italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT > caligraphic_O ( italic_l italic_o italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), then the random output {πKisubscriptsuperscript𝜋𝑖𝐾\pi^{i}_{K}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}i of Alg. 1 when run with C=0𝐶0C=0italic_C = 0 (such that there is no communication) satisfies for all agents i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N,

𝔼[πKiπ1]ε+𝒪(1N).𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑖𝐾superscript𝜋1𝜀𝒪1𝑁\mathbb{E}\left[||\pi^{i}_{K}-\pi^{*}||_{1}\right]\;\leq\;\varepsilon+\mathcal% {O}\left(\frac{1}{\sqrt{N}}\right).\qedblackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ italic_ε + caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ) . italic_∎

B.2 Full version of Theorem 1

Theorem 1 (Networked learning with random adoption).

For pinfsubscript𝑝𝑖𝑛𝑓p_{inf}italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT and δmixsubscript𝛿𝑚𝑖𝑥\delta_{mix}italic_δ start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT defined in Assumptions 3 and 4 respectively, define t0:=16(1+γ)2((1γ)δmixpinf)2assignsubscript𝑡016superscript1𝛾2superscript1𝛾subscript𝛿𝑚𝑖𝑥subscript𝑝𝑖𝑛𝑓2t_{0}:=\frac{16(1+\gamma)^{2}}{((1-\gamma)\delta_{mix}p_{inf})^{2}}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := divide start_ARG 16 ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( ( 1 - italic_γ ) italic_δ start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Assume that Assumptions 1, 2, 3 and 4 hold, and that πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the unique MFG-NE policy. For LΓηsubscript𝐿subscriptΓ𝜂L_{\Gamma_{\eta}}italic_L start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT defined in Lemma 2, we assume η>0𝜂0\eta>0italic_η > 0 satisfies LΓη<1subscript𝐿subscriptΓ𝜂1L_{\Gamma_{\eta}}<1italic_L start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT < 1. The learning rates are βm=2(1γ)(t0+m1)subscript𝛽𝑚21𝛾subscript𝑡0𝑚1\beta_{m}=\frac{2}{(1-\gamma)(t_{0}+m-1)}italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_m - 1 ) end_ARG m0for-all𝑚0\forall m\geq 0∀ italic_m ≥ 0, and let ε>0𝜀0\varepsilon>0italic_ε > 0 be arbitrary. Assume also that C>0𝐶0C>0italic_C > 0, with τksubscript𝜏𝑘\tau_{k}\rightarrow\inftyitalic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → ∞. There exists a problem-dependent constant a[0,)𝑎0a\in[0,\infty)italic_a ∈ [ 0 , ∞ ) such that if K=log8ε1logLΓη1𝐾8superscript𝜀1subscriptsuperscript𝐿1subscriptΓ𝜂K=\frac{\log 8\varepsilon^{-1}}{\log L^{-1}_{\Gamma_{\eta}}}italic_K = divide start_ARG roman_log 8 italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG roman_log italic_L start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG, Mpg>𝒪(ε2a)subscript𝑀𝑝𝑔𝒪superscript𝜀2𝑎M_{pg}>\mathcal{O}(\varepsilon^{-2-a})italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT > caligraphic_O ( italic_ε start_POSTSUPERSCRIPT - 2 - italic_a end_POSTSUPERSCRIPT ) and Mtd>𝒪(log2ε1)subscript𝑀𝑡𝑑𝒪𝑙𝑜superscript𝑔2superscript𝜀1M_{td}>\mathcal{O}(log^{2}\varepsilon^{-1})italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT > caligraphic_O ( italic_l italic_o italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), then the random output {πKisubscriptsuperscript𝜋𝑖𝐾\pi^{i}_{K}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}i of Alg. 1 preserves the sample guarantees of the independent-learning case given in Lemma 3, i.e. the output satisfies, for all agents i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N,

𝔼[πKiπ1]ε+𝒪(1N).𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑖𝐾superscript𝜋1𝜀𝒪1𝑁\mathbb{E}\left[||\pi^{i}_{K}-\pi^{*}||_{1}\right]\;\leq\;\varepsilon+\mathcal% {O}\left(\frac{1}{\sqrt{N}}\right).blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ italic_ε + caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ) .

(Proof in Appx. B.3.)∎

B.3 Proof of Theorem 1

Proof.

If τksubscript𝜏𝑘\tau_{k}\rightarrow\inftyitalic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → ∞, the softmax function that defines the probability of a received policy being adopted in Line 17 of Alg. 1 gives a uniform distribution. Policies are thus exchanged at random between communicating agents an arbitrary C>0𝐶0C>0italic_C > 0 times, which does not affect the random output of the algorithm, such that the random output satisfies the same expectation as if C=0𝐶0C=0italic_C = 0. ∎

B.4 Conditional TD learning from a single continuous run of the empirical distribution of N𝑁Nitalic_N agents

Lemma 4 (Conditional TD learning from a single continuous run of the empirical distribution of N𝑁Nitalic_N agents, from Theorem 4.2, Yardim et al. [6]).

Define t0:=16(1+γ)2((1γ)δmixpinf)2assignsubscript𝑡016superscript1𝛾2superscript1𝛾subscript𝛿𝑚𝑖𝑥subscript𝑝𝑖𝑛𝑓2t_{0}:=\frac{16(1+\gamma)^{2}}{((1-\gamma)\delta_{mix}p_{inf})^{2}}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := divide start_ARG 16 ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( ( 1 - italic_γ ) italic_δ start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Assume Assumption 4 holds and let policies {πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT}i be given such that πi(a|s)pinfsuperscript𝜋𝑖conditional𝑎𝑠subscript𝑝𝑖𝑛𝑓\pi^{i}(a|s)\geq p_{inf}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_a | italic_s ) ≥ italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT ifor-all𝑖\forall i∀ italic_i. Assume Lines 5-11 of Alg. 1 are run with policies {πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT}i, arbitrary initial agents states {s0isubscriptsuperscript𝑠𝑖0s^{i}_{0}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT}i, learning rates βm=2(1γ)(t0+m1)subscript𝛽𝑚21𝛾subscript𝑡0𝑚1\beta_{m}=\frac{2}{(1-\gamma)(t_{0}+m-1)}italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_m - 1 ) end_ARG, m0for-all𝑚0\forall m\geq 0∀ italic_m ≥ 0 and Mpg>𝒪(ε2)subscript𝑀𝑝𝑔𝒪superscript𝜀2M_{pg}>\mathcal{O}(\varepsilon^{-2})italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT > caligraphic_O ( italic_ε start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), Mtd>𝒪(logε1)subscript𝑀𝑡𝑑𝒪superscript𝜀1M_{td}>\mathcal{O}(\log\varepsilon^{-1})italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT > caligraphic_O ( roman_log italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). If π¯Π¯𝜋Π\bar{\pi}\in\Piover¯ start_ARG italic_π end_ARG ∈ roman_Π is an arbitrary policy, ΔΔ\Deltaroman_Δ := i=1Nπiπ¯1superscriptsubscript𝑖1𝑁subscriptnormsuperscript𝜋𝑖¯𝜋1\sum_{i=1}^{N}||\pi^{i}-\bar{\pi}||_{1}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - over¯ start_ARG italic_π end_ARG | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := Qh(,|π¯,μπ¯)Q_{h}(\cdot,\cdot|\bar{\pi},\mu_{\bar{\pi}})italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ , ⋅ | over¯ start_ARG italic_π end_ARG , italic_μ start_POSTSUBSCRIPT over¯ start_ARG italic_π end_ARG end_POSTSUBSCRIPT ), then the random output Q^Mpgisubscriptsuperscript^𝑄𝑖subscript𝑀𝑝𝑔\hat{Q}^{i}_{M_{pg}}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT of Lines 5-11 satisfies

𝔼[Q^MpgiQ]ε+𝒪(1N+1NΔ+πiπ¯1).𝔼delimited-[]subscriptnormsubscriptsuperscript^𝑄𝑖subscript𝑀𝑝𝑔superscript𝑄𝜀𝒪1𝑁1𝑁Δsubscriptnormsuperscript𝜋𝑖¯𝜋1\mathbb{E}\left[||\hat{Q}^{i}_{M_{pg}}-\;Q^{*}||_{\infty}\right]\leq% \varepsilon+\mathcal{O}\left(\frac{1}{\sqrt{N}}+\frac{1}{N}\Delta+||\pi^{i}-% \bar{\pi}||_{1}\right).blackboard_E [ | | over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] ≤ italic_ε + caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG roman_Δ + | | italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - over¯ start_ARG italic_π end_ARG | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

B.5 Full version of Theorem 2

Theorem 2 (Networked learning with non-random adoption).

Assume that Assumptions 1, 2, 3 and 4 hold, and that Alg. 1 is run with learning rates and constants as defined in Thm. 1, except now τk>0subscript𝜏𝑘subscriptabsent0\tau_{k}\in\mathbb{R}_{>0}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT. Assume that σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is generated uniquely for each i𝑖iitalic_i, in a manner independent of any metric related to πk+1isubscriptsuperscript𝜋𝑖𝑘1\pi^{i}_{k+1}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, e.g. σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is random or related only to the index i𝑖iitalic_i (so as not to bias the spread of any particular policy). Let the random output of this Algorithm be denoted as {πKi,netsubscriptsuperscript𝜋𝑖𝑛𝑒𝑡𝐾\pi^{i,net}_{K}italic_π start_POSTSUPERSCRIPT italic_i , italic_n italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}i. Also consider an independent-learning version of the algorithm (i.e. with the same parameters except C=0𝐶0C=0italic_C = 0) and denote its random output {πKj,indsubscriptsuperscript𝜋𝑗𝑖𝑛𝑑𝐾\pi^{j,ind}_{K}italic_π start_POSTSUPERSCRIPT italic_j , italic_i italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}j; and a centralised version of the algorithm with the same parameters (see Rem. 3) and denote its random output as πKcentsubscriptsuperscript𝜋𝑐𝑒𝑛𝑡𝐾\pi^{cent}_{K}italic_π start_POSTSUPERSCRIPT italic_c italic_e italic_n italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. Then for all agents i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N and j=1,,N𝑗1𝑁j=1,\dots,Nitalic_j = 1 , … , italic_N, the random outputs {πKi,netsubscriptsuperscript𝜋𝑖𝑛𝑒𝑡𝐾\pi^{i,net}_{K}italic_π start_POSTSUPERSCRIPT italic_i , italic_n italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}i, {πKj,indsubscriptsuperscript𝜋𝑗𝑖𝑛𝑑𝐾\pi^{j,ind}_{K}italic_π start_POSTSUPERSCRIPT italic_j , italic_i italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}j and πKcentsubscriptsuperscript𝜋𝑐𝑒𝑛𝑡𝐾\pi^{cent}_{K}italic_π start_POSTSUPERSCRIPT italic_c italic_e italic_n italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT satisfy

𝔼[πKcentπ1]𝔼[πKi,netπ1]𝔼[πKj,indπ1]ε+𝒪(1N).𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑐𝑒𝑛𝑡𝐾superscript𝜋1𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑖𝑛𝑒𝑡𝐾superscript𝜋1𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑗𝑖𝑛𝑑𝐾superscript𝜋1𝜀𝒪1𝑁\mathbb{E}\left[||\pi^{cent}_{K}-\pi^{*}||_{1}\right]\;\leq\;\mathbb{E}\left[|% |\pi^{i,net}_{K}-\pi^{*}||_{1}\right]\leq\;\mathbb{E}\left[||\pi^{j,ind}_{K}-% \pi^{*}||_{1}\right]\;\leq\;\varepsilon+\mathcal{O}\left(\frac{1}{\sqrt{N}}% \right).blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_c italic_e italic_n italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_i , italic_n italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_j , italic_i italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ italic_ε + caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ) .

(Proof in Appx. B.6.)∎

B.6 Proof of Theorem 2

Proof.

We build off the proof of our Lemma 3, given in Theorem D.9 of Yardim et al. [6], where the sample guarantees of the independent case are worse than those of the centralised algorithm as a result of the divergence between the decentralised policies due to the stochasticity of the PMA updates. For an arbitrary policy π¯kΠsubscript¯𝜋𝑘Π\bar{\pi}_{k}\in\Piover¯ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ roman_Π, for all k=0,1,,K𝑘01𝐾k=0,1,\dots,Kitalic_k = 0 , 1 , … , italic_K define the policy divergence as the random variable ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := i=1Nπkiπ¯k1superscriptsubscript𝑖1𝑁subscriptnormsubscriptsuperscript𝜋𝑖𝑘subscript¯𝜋𝑘1\sum_{i=1}^{N}||\pi^{i}_{k}-\bar{\pi}_{k}||_{1}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over¯ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We can say that Δk,cent=0subscriptΔ𝑘𝑐𝑒𝑛𝑡0\Delta_{k,cent}=0roman_Δ start_POSTSUBSCRIPT italic_k , italic_c italic_e italic_n italic_t end_POSTSUBSCRIPT = 0 kfor-all𝑘\forall k∀ italic_k is the divergence in the centralised case, while in the networked case the policy divergence is Δk+1,csubscriptΔ𝑘1𝑐\Delta_{k+1,c}roman_Δ start_POSTSUBSCRIPT italic_k + 1 , italic_c end_POSTSUBSCRIPT after communication round c1,,C𝑐1𝐶c\in 1,\dots,Citalic_c ∈ 1 , … , italic_C. The independent case is equivalent to the scenario when C=0𝐶0C=0italic_C = 0, such that its policy divergence can be written Δk+1,0subscriptΔ𝑘10\Delta_{k+1,0}roman_Δ start_POSTSUBSCRIPT italic_k + 1 , 0 end_POSTSUBSCRIPT.

For τk>0subscript𝜏𝑘subscriptabsent0\tau_{k}\in\mathbb{R}_{>0}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT, the adoption probability Pr(σk+1adopted=σk+1j)subscriptsuperscript𝜎𝑎𝑑𝑜𝑝𝑡𝑒𝑑𝑘1subscriptsuperscript𝜎𝑗𝑘1\left(\sigma^{adopted}_{k+1}=\sigma^{j}_{k+1}\right)( italic_σ start_POSTSUPERSCRIPT italic_a italic_d italic_o italic_p italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) = exp(σk+1j/τk)x=1[Jti]exp(σk+1x/τk)subscriptsuperscript𝜎𝑗𝑘1subscript𝜏𝑘subscriptsuperscriptdelimited-[]subscriptsuperscript𝐽𝑖𝑡𝑥1subscriptsuperscript𝜎𝑥𝑘1subscript𝜏𝑘\frac{\exp{(\sigma^{j}_{k+1}}/\tau_{k})}{\sum^{[J^{i}_{t}]}_{x=1}\exp{(\sigma^% {x}_{k+1}}/\tau_{k})}divide start_ARG roman_exp ( italic_σ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT [ italic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x = 1 end_POSTSUBSCRIPT roman_exp ( italic_σ start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG (as in Line 17 of Alg. 1) is higher for some jJti𝑗subscriptsuperscript𝐽𝑖𝑡j\in J^{i}_{t}italic_j ∈ italic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT than for others. This means that for c>0𝑐0c>0italic_c > 0 for which there are communication links in the population, in expectation the number of unique policies in the population will decrease, as it will likely become that πk+1i=πk+1jsubscriptsuperscript𝜋𝑖𝑘1subscriptsuperscript𝜋𝑗𝑘1\pi^{i}_{k+1}=\pi^{j}_{k+1}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT for some i,j{1,,N}𝑖𝑗1𝑁i,j\in\{1,\dots,N\}italic_i , italic_j ∈ { 1 , … , italic_N }. As such, Δk+1,cent𝔼[Δk+1,c]𝔼[Δk+1,0]subscriptΔ𝑘1𝑐𝑒𝑛𝑡𝔼delimited-[]subscriptΔ𝑘1𝑐𝔼delimited-[]subscriptΔ𝑘10\Delta_{k+1,cent}\leq\mathbb{E}\left[\Delta_{k+1,c}\right]\leq\mathbb{E}\left[% \Delta_{k+1,0}\right]roman_Δ start_POSTSUBSCRIPT italic_k + 1 , italic_c italic_e italic_n italic_t end_POSTSUBSCRIPT ≤ blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k + 1 , italic_c end_POSTSUBSCRIPT ] ≤ blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k + 1 , 0 end_POSTSUBSCRIPT ], i.e. the policy divergence in the independent-learning case is expected to be greater than or equal to that of the networked case.

The proof of Lemma 3 given in Theorem D.9 of Yardim et al. [6] ends with, for constants χ𝜒\chiitalic_χ and ξ𝜉\xiitalic_ξ,

𝔼[πKiπ1]2LΓηK+χ1LΓη+ξk=1K1LΓηKk1𝔼[Δk],𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑖𝐾superscript𝜋12superscriptsubscript𝐿subscriptΓ𝜂𝐾𝜒1subscript𝐿subscriptΓ𝜂𝜉subscriptsuperscript𝐾1𝑘1superscriptsubscript𝐿subscriptΓ𝜂𝐾𝑘1𝔼delimited-[]subscriptΔ𝑘\mathbb{E}\left[||\pi^{i}_{K}-\pi^{*}||_{1}\right]\leq 2L_{\Gamma_{\eta}}^{K}+% \frac{\chi}{1-L_{\Gamma_{\eta}}}+\xi\sum^{K-1}_{k=1}L_{\Gamma_{\eta}}^{K-k-1}% \mathbb{E}\left[\Delta_{k}\right],blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ 2 italic_L start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT + divide start_ARG italic_χ end_ARG start_ARG 1 - italic_L start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG + italic_ξ ∑ start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - italic_k - 1 end_POSTSUPERSCRIPT blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ,

where in our context the policy divergence in the independent case 𝔼[Δk+1]𝔼delimited-[]subscriptΔ𝑘1\mathbb{E}\left[\Delta_{k+1}\right]blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] is equivalent to 𝔼[Δk+1,c]𝔼delimited-[]subscriptΔ𝑘1𝑐\mathbb{E}\left[\Delta_{k+1,c}\right]blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k + 1 , italic_c end_POSTSUBSCRIPT ] when C=0𝐶0C=0italic_C = 0, i.e. 𝔼[Δk+1,0]𝔼delimited-[]subscriptΔ𝑘10\mathbb{E}\left[\Delta_{k+1,0}\right]blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k + 1 , 0 end_POSTSUBSCRIPT ].

Since Δk+1,cent𝔼[Δk+1,c]𝔼[Δk+1,0]subscriptΔ𝑘1𝑐𝑒𝑛𝑡𝔼delimited-[]subscriptΔ𝑘1𝑐𝔼delimited-[]subscriptΔ𝑘10\Delta_{k+1,cent}\leq\mathbb{E}\left[\Delta_{k+1,c}\right]\leq\mathbb{E}\left[% \Delta_{k+1,0}\right]roman_Δ start_POSTSUBSCRIPT italic_k + 1 , italic_c italic_e italic_n italic_t end_POSTSUBSCRIPT ≤ blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k + 1 , italic_c end_POSTSUBSCRIPT ] ≤ blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k + 1 , 0 end_POSTSUBSCRIPT ], we obtain our result, i.e. for all agents i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N and j=1,,N𝑗1𝑁j=1,\dots,Nitalic_j = 1 , … , italic_N, the random outputs {πKi,netsubscriptsuperscript𝜋𝑖𝑛𝑒𝑡𝐾\pi^{i,net}_{K}italic_π start_POSTSUPERSCRIPT italic_i , italic_n italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}i, {πKj,indsubscriptsuperscript𝜋𝑗𝑖𝑛𝑑𝐾\pi^{j,ind}_{K}italic_π start_POSTSUPERSCRIPT italic_j , italic_i italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}j and πKcentsubscriptsuperscript𝜋𝑐𝑒𝑛𝑡𝐾\pi^{cent}_{K}italic_π start_POSTSUPERSCRIPT italic_c italic_e italic_n italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT satisfy

𝔼[πKcentπ1]𝔼[πKi,netπ1]𝔼[πKj,indπ1]ε+𝒪(1N).𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑐𝑒𝑛𝑡𝐾superscript𝜋1𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑖𝑛𝑒𝑡𝐾superscript𝜋1𝔼delimited-[]subscriptnormsubscriptsuperscript𝜋𝑗𝑖𝑛𝑑𝐾superscript𝜋1𝜀𝒪1𝑁\mathbb{E}\left[||\pi^{cent}_{K}-\pi^{*}||_{1}\right]\;\leq\;\mathbb{E}\left[|% |\pi^{i,net}_{K}-\pi^{*}||_{1}\right]\leq\;\mathbb{E}\left[||\pi^{j,ind}_{K}-% \pi^{*}||_{1}\right]\;\leq\;\varepsilon+\mathcal{O}\left(\frac{1}{\sqrt{N}}% \right).blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_c italic_e italic_n italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_i , italic_n italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ blackboard_E [ | | italic_π start_POSTSUPERSCRIPT italic_j , italic_i italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ italic_ε + caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ) .

Remark 5.

It may help to see that our result is a consequence of the following. Denote Q^Mpgi,netsubscriptsuperscript^𝑄𝑖𝑛𝑒𝑡subscript𝑀𝑝𝑔\hat{Q}^{i,net}_{M_{pg}}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i , italic_n italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Q^Mpgi,indsubscriptsuperscript^𝑄𝑖𝑖𝑛𝑑subscript𝑀𝑝𝑔\hat{Q}^{i,ind}_{M_{pg}}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i , italic_i italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Q^Mpgcentsubscriptsuperscript^𝑄𝑐𝑒𝑛𝑡subscript𝑀𝑝𝑔\hat{Q}^{cent}_{M_{pg}}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c italic_e italic_n italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT as the random outputs of Lines 5-11 of Alg. 1 in the networked, independent and centralised cases respectively. In Lemma 4, we can see that policy divergence gives bias terms in the estimation of the Q-value. Therefore, given Δk+1,cent𝔼[Δk+1,c]𝔼[Δk+1,0]subscriptΔ𝑘1𝑐𝑒𝑛𝑡𝔼delimited-[]subscriptΔ𝑘1𝑐𝔼delimited-[]subscriptΔ𝑘10\Delta_{k+1,cent}\leq\mathbb{E}\left[\Delta_{k+1,c}\right]\leq\mathbb{E}\left[% \Delta_{k+1,0}\right]roman_Δ start_POSTSUBSCRIPT italic_k + 1 , italic_c italic_e italic_n italic_t end_POSTSUBSCRIPT ≤ blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k + 1 , italic_c end_POSTSUBSCRIPT ] ≤ blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k + 1 , 0 end_POSTSUBSCRIPT ], we can also say

𝔼[Q^MpgcentQ]𝔼[Q^Mpgi,netQ]𝔼[Q^Mpgi,indQ].𝔼delimited-[]subscriptnormsubscriptsuperscript^𝑄𝑐𝑒𝑛𝑡subscript𝑀𝑝𝑔superscript𝑄𝔼delimited-[]subscriptnormsubscriptsuperscript^𝑄𝑖𝑛𝑒𝑡subscript𝑀𝑝𝑔superscript𝑄𝔼delimited-[]subscriptnormsubscriptsuperscript^𝑄𝑖𝑖𝑛𝑑subscript𝑀𝑝𝑔superscript𝑄\mathbb{E}\left[||\hat{Q}^{cent}_{M_{pg}}-Q^{*}||_{\infty}\right]\;\leq\;% \mathbb{E}\left[||\hat{Q}^{i,net}_{M_{pg}}-Q^{*}||_{\infty}\right]\\ \leq\;\mathbb{E}\left[||\hat{Q}^{i,ind}_{M_{pg}}-Q^{*}||_{\infty}\right].blackboard_E [ | | over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c italic_e italic_n italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] ≤ blackboard_E [ | | over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i , italic_n italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] ≤ blackboard_E [ | | over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i , italic_i italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] .

In other words, the networked case will require the same or fewer outer iterations K𝐾Kitalic_K to reduce the variance caused by this bias than the independent case requires (where the bias is non-vanishing), and the same or more iterations than the centralised case requires. ∎

B.7 Continuation of Rem. 4

Remark 6.

For an arbitrary policy π¯kΠsubscript¯𝜋𝑘Π\bar{\pi}_{k}\in\Piover¯ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ roman_Π, for all k=0,1,,K𝑘01𝐾k=0,1,\dots,Kitalic_k = 0 , 1 , … , italic_K define the policy divergence as the random variable ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := i=1Nπkiπ¯k1superscriptsubscript𝑖1𝑁subscriptnormsubscriptsuperscript𝜋𝑖𝑘subscript¯𝜋𝑘1\sum_{i=1}^{N}||\pi^{i}_{k}-\bar{\pi}_{k}||_{1}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over¯ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We can say that Δk,cent=0subscriptΔ𝑘𝑐𝑒𝑛𝑡0\Delta_{k,cent}=0roman_Δ start_POSTSUBSCRIPT italic_k , italic_c italic_e italic_n italic_t end_POSTSUBSCRIPT = 0 kfor-all𝑘\forall k∀ italic_k is the divergence in the centralised case, while in the networked case the policy divergence is Δk+1,csubscriptΔ𝑘1𝑐\Delta_{k+1,c}roman_Δ start_POSTSUBSCRIPT italic_k + 1 , italic_c end_POSTSUBSCRIPT after communication round cC𝑐𝐶c\in Citalic_c ∈ italic_C. As detailed in Appx. B.6 and as per Rem. 2, if τk0subscript𝜏𝑘0\tau_{k}\rightarrow 0italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → 0 and C𝐶Citalic_C is large enough that the number of jointly connected collections of graphs occurring within C𝐶Citalic_C is equal to the largest diameter of the union of any collection, then Δk,C=Δk,cent=0subscriptΔ𝑘𝐶subscriptΔ𝑘𝑐𝑒𝑛𝑡0\Delta_{k,C}=\Delta_{k,cent}=0roman_Δ start_POSTSUBSCRIPT italic_k , italic_C end_POSTSUBSCRIPT = roman_Δ start_POSTSUBSCRIPT italic_k , italic_c italic_e italic_n italic_t end_POSTSUBSCRIPT = 0 kKfor-all𝑘𝐾\forall k\in K∀ italic_k ∈ italic_K. (If it is always σk+11subscriptsuperscript𝜎1𝑘1\sigma^{1}_{k+1}italic_σ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT and πk+11subscriptsuperscript𝜋1𝑘1\pi^{1}_{k+1}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT that is adopted by the whole population then this is the same as the centralised case; if the σk+1isubscriptsuperscript𝜎𝑖𝑘1\sigma^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT and πk+1isubscriptsuperscript𝜋𝑖𝑘1\pi^{i}_{k+1}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT that gets adopted has different i𝑖iitalic_i for each k𝑘kitalic_k then this is akin to a version of the centralised setting where the central learning agent may differ for each k𝑘kitalic_k.) If instead τk>0subscript𝜏𝑘subscriptabsent0\tau_{k}\in\mathbb{R}_{>0}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT, we have limC𝔼[Δk+1,C]0subscript𝐶𝔼delimited-[]subscriptΔ𝑘1𝐶0\lim_{C\rightarrow\infty}\mathbb{E}\left[\Delta_{k+1,C}\right]\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_C → ∞ end_POSTSUBSCRIPT blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_k + 1 , italic_C end_POSTSUBSCRIPT ] → 0, assuming that the communication network becomes jointly connected infinitely often. ∎

B.8 Remark on theoretical hyperparameters when used in practical settings

Remark 7.

The theoretical analysis in Sec. 3.3 and Appx. B requires algorithmic hyperparameters (Thms. 1 and 2) that render convergence impractically slow in all of the centralised, independent and networked cases. (Indeed Yardim et al. [6] do not provide empirical demonstrations of their algorithms for the centralised and independent cases.) In particular, the values of δmixsubscript𝛿𝑚𝑖𝑥\delta_{mix}italic_δ start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT and pinfsubscript𝑝𝑖𝑛𝑓p_{inf}italic_p start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT give rise to very large t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, causing very small learning rates {βm}m{0,,Mpg1}subscriptsubscript𝛽𝑚𝑚0subscript𝑀𝑝𝑔1\{\beta_{m}\}_{m\in\{0,\dots,M_{pg}-1\}}{ italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 0 , … , italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT - 1 } end_POSTSUBSCRIPT, and necessitating very large values for Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT and Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT.

Appendix C Algorithm acceleration by use of experience-replay buffer (further details)

The intuition behind the better learning efficiency resulting from our experience replay buffer in Alg. 2 is as follows. The value of a state-action pair p𝑝pitalic_p is dependent on the values of subsequent states reached, but the value of p𝑝pitalic_p is only updated when the TD update is conducted on p𝑝pitalic_p, rather than every time a subsequent pair is updated. By learning from each stored transition multiple times, we not only make repeated use of the reward and transition information in each costly experience, but also repeatedly update each state-action pair in light of its likewise updated subsequent states.

We leave βmsubscript𝛽𝑚\beta_{m}italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT fixed across all iterations, as we found empirically that this yields sufficient learning. We have not experimented with decreasing β𝛽\betaitalic_β as l𝑙litalic_l increases, though this may benefit learning.

The transitions in the buffer are discarded after the replay cycles and a new buffer is initialised for the next iteration k𝑘kitalic_k, as in Line 6. As such the space complexity of the buffer only grows linearly with the number of Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT iterations within each outer loop k𝑘kitalic_k, rather than with the number of K𝐾Kitalic_K loops.

Algorithm 2 Networked learning with experience replay
1:loop parameters K,Mpg,Mtd,C𝐾subscript𝑀𝑝𝑔subscript𝑀𝑡𝑑𝐶K,M_{pg},M_{td},Citalic_K , italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT , italic_C, L𝐿Litalic_L, E, learning parameters η𝜂\etaitalic_η, β𝛽\betaitalic_β, λ,γ𝜆𝛾\lambda,\gammaitalic_λ , italic_γ, {τk}k{0,,K1}subscriptsubscript𝜏𝑘𝑘0𝐾1\{\tau_{k}\}_{k\in\{0,\dots,K-1\}}{ italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ { 0 , … , italic_K - 1 } end_POSTSUBSCRIPT
2:initial states {s0isubscriptsuperscript𝑠𝑖0s^{i}_{0}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT}i, i=1,,N𝑖1𝑁i=1,\ldots,Nitalic_i = 1 , … , italic_N
3:Set π0i=πmax,isubscriptsuperscript𝜋𝑖0subscript𝜋𝑚𝑎𝑥for-all𝑖\pi^{i}_{0}=\pi_{max},\forall iitalic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , ∀ italic_i and t0𝑡0t\leftarrow 0italic_t ← 0
4:for k=0,,K1𝑘0𝐾1k=0,\dots,K-1italic_k = 0 , … , italic_K - 1 do
5:     s,a,i:Q^0i(s,a)=Qmax:for-all𝑠𝑎𝑖subscriptsuperscript^𝑄𝑖0𝑠𝑎subscript𝑄𝑚𝑎𝑥\forall s,a,i:\hat{Q}^{i}_{0}(s,a)=Q_{max}∀ italic_s , italic_a , italic_i : over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_Q start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT
6:     ifor-all𝑖\forall i∀ italic_i: Empty i𝑖iitalic_i’s buffer
7:     for m=0,,Mpg1𝑚0subscript𝑀𝑝𝑔1m=0,\dots,M_{pg}-1italic_m = 0 , … , italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT - 1 do
8:         for Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT iterations do
9:              Take step i:atiπki(|sti),rti=R(sti,ati,μ^t),st+1iP(|sti,ati,μ^t)\forall i:a^{i}_{t}\sim\pi^{i}_{k}(\cdot|s^{i}_{t}),r^{i}_{t}=R(s^{i}_{t},a^{i% }_{t},\hat{\mu}_{t}),s^{i}_{t+1}\sim P(\cdot|s^{i}_{t},a^{i}_{t},\hat{\mu}_{t})∀ italic_i : italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_R ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ); tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
10:         end for
11:         ifor-all𝑖\forall i∀ italic_i: Add ζt2isubscriptsuperscript𝜁𝑖𝑡2\zeta^{i}_{t-2}italic_ζ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT to i𝑖iitalic_i’s buffer
12:     end for
13:     for l=0,,L1𝑙0𝐿1l=0,\dots,L-1italic_l = 0 , … , italic_L - 1 do
14:         i::for-all𝑖absent\forall i:∀ italic_i : Shuffle buffer
15:         for transition ζbisubscriptsuperscript𝜁𝑖𝑏\zeta^{i}_{b}italic_ζ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT in i𝑖iitalic_i’s buffer do (ifor-all𝑖\forall i∀ italic_i)
16:              Compute TD update (ifor-all𝑖\forall i∀ italic_i): Q^m+1i=F~βπki(Q^mi,ζt2i)subscriptsuperscript^𝑄𝑖𝑚1subscriptsuperscript~𝐹subscriptsuperscript𝜋𝑖𝑘𝛽subscriptsuperscript^𝑄𝑖𝑚subscriptsuperscript𝜁𝑖𝑡2\hat{Q}^{i}_{m+1}=\tilde{F}^{\pi^{i}_{k}}_{\beta}(\hat{Q}^{i}_{m},\zeta^{i}_{t% -2})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT = over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_ζ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ) (see Def. 7)
17:         end for
18:     end for
19:     PMA step i:πk+1i=Γηmd(Q^Mpgi,πki):for-all𝑖subscriptsuperscript𝜋𝑖𝑘1subscriptsuperscriptΓ𝑚𝑑𝜂subscriptsuperscript^𝑄𝑖subscript𝑀𝑝𝑔subscriptsuperscript𝜋𝑖𝑘\forall i:\pi^{i}_{k+1}=\Gamma^{md}_{\eta}(\hat{Q}^{i}_{M_{pg}},\pi^{i}_{k})∀ italic_i : italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = roman_Γ start_POSTSUPERSCRIPT italic_m italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (see Def. 8)
20:     i::for-all𝑖absent\forall i:∀ italic_i : σk+1i=0subscriptsuperscript𝜎𝑖𝑘10\sigma^{i}_{k+1}=0italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = 0
21:     for e=0,,E1𝑒0𝐸1e=0,\dots,E-1italic_e = 0 , … , italic_E - 1 evaluation steps do
22:         Take step i:atiπk+1i(|sti),rti=R(sti,ati,μ^t),st+1iP(|sti,ati,μ^t)\forall i:a^{i}_{t}\sim\pi^{i}_{k+1}(\cdot|s^{i}_{t}),r^{i}_{t}=R(s^{i}_{t},a^% {i}_{t},\hat{\mu}_{t}),s^{i}_{t+1}\sim P(\cdot|s^{i}_{t},a^{i}_{t},\hat{\mu}_{% t})∀ italic_i : italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_R ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
23:         i::for-all𝑖absent\forall i:∀ italic_i : σk+1i=σk+1i+γe(rti+h(πk+1i(sti)))subscriptsuperscript𝜎𝑖𝑘1subscriptsuperscript𝜎𝑖𝑘1superscript𝛾𝑒subscriptsuperscript𝑟𝑖𝑡subscriptsuperscript𝜋𝑖𝑘1subscriptsuperscript𝑠𝑖𝑡\sigma^{i}_{k+1}=\sigma^{i}_{k+1}+\gamma^{e}(r^{i}_{t}+h(\pi^{i}_{k+1}(s^{i}_{% t})))italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) )
24:         tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
25:     end for
26:     for C𝐶Citalic_C rounds do
27:         i::for-all𝑖absent\forall i:∀ italic_i : Broadcast σk+1i,πk+1isubscriptsuperscript𝜎𝑖𝑘1subscriptsuperscript𝜋𝑖𝑘1\sigma^{i}_{k+1},\pi^{i}_{k+1}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT
28:         i:Jti=i{j𝒩:(i,j)t\forall i:J^{i}_{t}=i\cup\{j\in\mathcal{N}:(i,j)\in\mathcal{E}_{t}∀ italic_i : italic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_i ∪ { italic_j ∈ caligraphic_N : ( italic_i , italic_j ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT}
29:         i::for-all𝑖absent\forall i:∀ italic_i : Select σk+1adoptedsimilar-tosubscriptsuperscript𝜎𝑎𝑑𝑜𝑝𝑡𝑒𝑑𝑘1absent\sigma^{adopted}_{k+1}\simitalic_σ start_POSTSUPERSCRIPT italic_a italic_d italic_o italic_p italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∼ Pr(σk+1adopted=σk+1j)subscriptsuperscript𝜎𝑎𝑑𝑜𝑝𝑡𝑒𝑑𝑘1subscriptsuperscript𝜎𝑗𝑘1\left(\sigma^{adopted}_{k+1}=\sigma^{j}_{k+1}\right)( italic_σ start_POSTSUPERSCRIPT italic_a italic_d italic_o italic_p italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) = exp(σk+1j/τk)x=1[Jti]exp(σk+1x/τk)subscriptsuperscript𝜎𝑗𝑘1subscript𝜏𝑘subscriptsuperscriptdelimited-[]subscriptsuperscript𝐽𝑖𝑡𝑥1subscriptsuperscript𝜎𝑥𝑘1subscript𝜏𝑘\frac{\exp{(\sigma^{j}_{k+1}}/\tau_{k})}{\sum^{[J^{i}_{t}]}_{x=1}\exp{(\sigma^% {x}_{k+1}}/\tau_{k})}divide start_ARG roman_exp ( italic_σ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT [ italic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x = 1 end_POSTSUBSCRIPT roman_exp ( italic_σ start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG jJtifor-all𝑗subscriptsuperscript𝐽𝑖𝑡\forall j\in J^{i}_{t}∀ italic_j ∈ italic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
30:         i:σk+1i=σk+1adopted,πk+1i=πk+1adopted:for-all𝑖formulae-sequencesubscriptsuperscript𝜎𝑖𝑘1subscriptsuperscript𝜎𝑎𝑑𝑜𝑝𝑡𝑒𝑑𝑘1subscriptsuperscript𝜋𝑖𝑘1subscriptsuperscript𝜋𝑎𝑑𝑜𝑝𝑡𝑒𝑑𝑘1\forall i:\sigma^{i}_{k+1}=\sigma^{adopted}_{k+1},\pi^{i}_{k+1}=\pi^{adopted}_% {k+1}∀ italic_i : italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT italic_a italic_d italic_o italic_p italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT italic_a italic_d italic_o italic_p italic_t italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT
31:         Take step i:atiπk+1i(|sti),rti=R(sti,ati,μ^t),st+1iP(|sti,ati,μ^t)\forall i:a^{i}_{t}\sim\pi^{i}_{k+1}(\cdot|s^{i}_{t}),r^{i}_{t}=R(s^{i}_{t},a^% {i}_{t},\hat{\mu}_{t}),s^{i}_{t+1}\sim P(\cdot|s^{i}_{t},a^{i}_{t},\hat{\mu}_{% t})∀ italic_i : italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_R ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ); tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
32:     end for
33:end for
34:Return policies {πKisubscriptsuperscript𝜋𝑖𝐾\pi^{i}_{K}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT}i, i=1,,N𝑖1𝑁i=1,\ldots,Nitalic_i = 1 , … , italic_N

Appendix D Extended discussion on robustness of communication networks in MFGs and related experimental settings

We consider two scenarios to which we desire real-world many-agent systems (e.g. robotic swarms or autonomous vehicle traffic) to be robust; these scenarios form the basis of our experiments on robustness (see Sec. 4 and Figs. 3, 3, 5 and 6). The networked setup affords population fault-tolerance and online scalability, which are motivating qualities of many-agent systems.

Firstly, we consider a scenario in which the learning/updating procedure of agents fails with a certain probability within each iteration, in which cases πk+1i=πkisubscriptsuperscript𝜋𝑖𝑘1subscriptsuperscript𝜋𝑖𝑘\pi^{i}_{k+1}=\pi^{i}_{k}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (see Figs. 3 and 5 for our experimental results and discussion of this scenario). In real-life decentralised settings, this might be particularly liable to occur since the updating process might only be synchronised between agents by internal clock ticks, such that some agents may not complete their update in the allotted time but will nevertheless be required to take the next step in the environment. Such failures slow the improvement of the population in the independent case, and in the centralised setting it means no improvement occurs at all in any iteration in which failure occurs, as there is a single point of failure. Networked communication instead provides redundancy in case of failures, with the updated policies of any agents that have managed to learn spreading through the population to those that have not. This feature thus ensures that improvement can continue for potentially the whole population even if a high number of agents do not manage to learn at a given iteration.

Secondly, we may want to arbitrarily increase the size of a population of agents that are already learning or operating in the environment (we can imagine extra fleets of autonomous cars or drones being deployed) - see Appx. F for comparison with other works considering this type of robustness [75, 76, 77, 78]. A purely independent setting would require all the new agents to learn a policy individually given the existing distribution, and the process of their following and improving policies from scratch may itself disturb the NE that has already been achieved by the original population. With a communication network, however, the policies that have been learnt so far can quickly be shared with the new agents in a decentralised way, hopefully before their unoptimised policies can destabilise the current NE. This would provide, for example, a way to bootstrap a large population from a smaller pre-trained group, if training were considered expensive in a given setting. See Figs. 3 and 6 for our experimental results and discussion of this scenario.

Appendix E Experiments

Refer to caption
Figure 4: ‘Cluster’ game. Even with only a single communication round, our networked architecture significantly outperforms the independent case, which hardly appears to be learning at all. At numerous points, some broadcast radii of our networked architecture outperform even the centralised case. CPU time for 5 trials = 128,134 secs.
Refer to caption
Figure 5: ‘Target agreement’ game, testing robustness to 50% probability of policy update failure. All the networked cases significantly outperform the independent case and also learn much faster than the centralised case. The communication network allows agents that have successfully updated their policies to spread this information to those that have not, providing redundancy. Independent learners cannot do this so have even slower convergence than normal; likewise the centralised architecture is susceptible to its single point of failure, hence learning is slower than in the networked case. CPU time for 5 trials = 131,055 secs.
Refer to caption
Figure 6: ‘Cluster’ game, testing robustness to a five-times increase in population. While the independent algorithm appears to enjoy similar exploitability to the other cases (see Rem. 8), we can see from its average return that it is not in fact learning at all; while the return rises after the increase in population size this is only because there are now more agents with which to be co-located, rather than because learning has progressed. Since here, unlike in the ‘target agreement’ game in Fig. 3, independent agents have hardly improved their return in the first place, we do not see the adverse effect that the addition of agents to the population has on the progress of learning. All networked cases apart from that with the largest broadcast radius (1.0) perform similarly to or significantly outperform the centralised case, and all significantly outperform the independent case in terms of return. The communication network allows the learnt policies to quickly spread to the newly arrived agents, such that the progression of learning is minimally disturbed, without needing to rely on the assumption of a centralised learner. The fact that the return in all cases, most notably that of the largest broadcast radius (1.0), is lower than in Fig. 4, is reflective of the fact that the error in the solution reduces as N𝑁Nitalic_N tends to infinity; it is only when the population in the case with the largest broadcast radius (1.0) increases from 50 to 250 that the agents are able to learn to increase their return. CPU time for 5 trials = 62,457 secs.
Refer to caption
Figure 7: ‘Cluster’ game on the larger 16x16 grid. While the independent-learning case has similar exploitability to the other settings, we can see that it is not actually learning to increase its return at all, making this an undesirable equilibrium. (I.e. agents are moving about randomly so there is little a deviating agent can do to increase its reward, hence exploitability is low even though the agents are not in fact clustered - see Rem. 8.) All the networked settings significantly outperform the return of the independent agents. CPU time for 5 trials = 405,919 secs.
Refer to caption
Figure 8: ‘Target agreement’ game on the larger 16x16 grid. There is greater differentiation in this setting than in the 8x8 grid (Fig. 1) between the different broadcast radii in the networked cases, as is to be expected in a less densely populated environment. The two largest broadcast radii (1.0 and 0.8), which have the most connected networks, outperform the independent case in terms of exploitability and even more so in terms of return; at times the largest broadcast radius even outperforms the centralised case. However, the other broadcast radii perform similarly to the independent case. CPU time for 5 trials = 412,964 secs.
Refer to caption
Figure 9: ‘Cluster’ game with τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT fixed as 100 for all k𝑘kitalic_k; compare this to Fig. 4 where τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is annealed. Without the annealing scheme, the networked architecture appears to perform similarly to the independent case in terms of exploitability, but several broadcast radii outperform the independent case in terms of return, demonstrating that our networked algorithm can still help agents find more ‘preferable’ equilibria. However, whereas with annealing the networked architecture converges similarly to the centralised case, here it performs less well. CPU time for 5 trials = 141,271 secs.
Refer to caption
Figure 10: ‘Target agreement’ game with τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT fixed as 100 for all k𝑘kitalic_k. Without our annealing scheme for the softmax temperature, the networked architecture does not outperform the independent case. Compare this to Fig. 1 which shows the benefit of annealing τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. CPU time for 5 trials = 127,667 secs.

Experiments were conducted on a MacBook Pro, Apple M1 Max chip, 32 GB, 10 cores. We use scipy.optimize.minimize (employing Sequential Least Squares Programming) to conduct the optimisation step in Def. 8, and the JAX framework to accelerate and vectorise some elements of our code. We report the CPU time for each experiment in its caption; the total CPU time for the experiments presented in this work was 1,747,001 secs. (similar-to\sim485 hours), though the full research project required more compute time due to testing other hyperparameter values (see Table 1) for which we do not give plots here.

For reproducibility, the code to run our experiments is provided with our Supplementary Material, and will be made publicly available upon publication.

E.1 Games

We conduct numerical tests with two games (defined by the agents’ objectives), chosen for being particularly amenable to intuitive and visualisable understanding of whether the agents are learning behaviours that are appropriate and explainable for the respective objective functions. In all cases, rewards are normalised in [0,1] after they are computed.

Cluster.

This is the inverse of the ‘exploration’ game in [30], where in our case agents are encouraged to gather together by the reward function R(sti,ati,μ^t)=𝑅subscriptsuperscript𝑠𝑖𝑡subscriptsuperscript𝑎𝑖𝑡subscript^𝜇𝑡absentR(s^{i}_{t},a^{i}_{t},\hat{\mu}_{t})=italic_R ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = log(μ^t(sti))subscript^𝜇𝑡subscriptsuperscript𝑠𝑖𝑡(\hat{\mu}_{t}(s^{i}_{t}))( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ). That is, agent i𝑖iitalic_i receives a reward that is logarithmically proportional to the fraction of the population that is co-located with it at time t𝑡titalic_t. We give the population no indication where they should cluster, agreeing this themselves over time.

Agree on a single target.

Unlike in the above ‘cluster’ game, the agents are given options of locations at which to gather, and they must reach consensus among themselves. If the agents are co-located with one of a number of specified targets ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ (in our experiments we place one target in each of the four corners of the grid), and other agents are also at that target, they get a reward proportional to the fraction of the population found there; otherwise they receive a penalty of -1. In other words, the agents must coordinate on which of a number of mutually beneficial points will be their single gathering place. The reward function is given by R(sti,ati,μ^t)=rtarg(rcollab(μ^t(sti)))𝑅subscriptsuperscript𝑠𝑖𝑡subscriptsuperscript𝑎𝑖𝑡subscript^𝜇𝑡subscript𝑟𝑡𝑎𝑟𝑔subscript𝑟𝑐𝑜𝑙𝑙𝑎𝑏subscript^𝜇𝑡subscriptsuperscript𝑠𝑖𝑡R(s^{i}_{t},a^{i}_{t},\hat{\mu}_{t})=r_{targ}(r_{collab}(\hat{\mu}_{t}(s^{i}_{% t})))italic_R ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_r start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_c italic_o italic_l italic_l italic_a italic_b end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ), where

rtarg(x)={xifϕΦ s.t. dist(sti,ϕ)=01otherwise,subscript𝑟𝑡𝑎𝑟𝑔𝑥cases𝑥ifitalic-ϕΦ s.t. distsubscriptsuperscript𝑠𝑖𝑡italic-ϕ01otherwise,r_{targ}(x)=\begin{cases}x\quad&\text{if}\,\exists\phi\in\Phi\text{ s.t. dist}% (s^{i}_{t},\phi)=0\\ -1\quad&\text{otherwise,}\\ \end{cases}italic_r start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL italic_x end_CELL start_CELL if ∃ italic_ϕ ∈ roman_Φ s.t. dist ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ ) = 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL otherwise, end_CELL end_ROW
rcollab(x)={xifμ^t(sti)>1/N1otherwise.subscript𝑟𝑐𝑜𝑙𝑙𝑎𝑏𝑥cases𝑥ifsubscript^𝜇𝑡subscriptsuperscript𝑠𝑖𝑡1𝑁1otherwise.r_{collab}(x)=\begin{cases}x\quad&\text{if}\,\hat{\mu}_{t}(s^{i}_{t})>1/N\\ -1\quad&\text{otherwise.}\\ \end{cases}italic_r start_POSTSUBSCRIPT italic_c italic_o italic_l italic_l italic_a italic_b end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL italic_x end_CELL start_CELL if over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > 1 / italic_N end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL otherwise. end_CELL end_ROW

E.2 Experimental metrics

To give as informative results as possible about both performance and proximity to the NE, we provide several metrics for each experiment. All metrics are plotted with 2-sigma error bars (2 ×\times× standard deviation), computed over the five trials (each with a random seed) of the system evolution in each setting. This is computed based on a call to numpy.std for each metric over each run.

E.2.1 Exploitability

Works on MFGs frequently use the exploitability metric to evaluate how close a given policy π𝜋\piitalic_π is to a NE policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [30, 31, 32, 40, 79]. The metric quantifies how much an agent can benefit by deviating from the policy pursued by the rest of the population, by measuring the difference between the return given by a policy that maximises the expected discounted regularised (via hhitalic_h) reward Vhsubscript𝑉V_{h}italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for a given population distribution, and the return given by the policy that gives rise to this distribution. If π𝜋\piitalic_π has a large exploitability then an agent can significantly improve its return by deviating from π𝜋\piitalic_π, meaning that π𝜋\piitalic_π is far from πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, whereas an exploitability of 0 implies that π=π𝜋superscript𝜋\pi=\pi^{*}italic_π = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Denote by μπsuperscript𝜇𝜋\mu^{\pi}italic_μ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT the distribution generated when π𝜋\piitalic_π is the policy followed by all of the population aside from the deviating agent; then the exploitability of policy π𝜋\piitalic_π is defined as:

(π)=maxπVh(π,μπ)Vh(π,μπ).𝜋subscriptsuperscript𝜋subscript𝑉superscript𝜋superscript𝜇𝜋subscript𝑉𝜋superscript𝜇𝜋\mathcal{E}(\pi)=\max_{\pi^{\prime}}V_{h}(\pi^{\prime},\mu^{\pi})-V_{h}(\pi,% \mu^{\pi}).caligraphic_E ( italic_π ) = roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) .

Since we do not have access to the exact best response policy argmaxπVh(π,μπ)subscriptsuperscript𝜋subscript𝑉superscript𝜋superscript𝜇𝜋\arg\max_{\pi^{\prime}}V_{h}(\pi^{\prime},\mu^{\pi})roman_arg roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ), we instead approximate the exploitability metric, similarly to [34], as follows. We freeze the policy of all agents apart from a deviating agent, for which we store its current policy and then conduct 40 k𝑘kitalic_k loops of policy improvement (we found that 40 iterations was enough to converge to a policy that maximised Vhsubscript𝑉V_{h}italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for the given population distribution). To approximate the expectations, we take the best return of the deviating agent across the 40 k𝑘kitalic_k loops, as well as the mean of all the other agents’ returns across these same loops. We then revert the agent back to its stored policy, before learning continues for all agents. As such, the quality of our approximation is limited by the number of policy improvement rounds, which must be restricted for the sake of running speed of the experiments. Due to the expensive computations required for this metric, we evaluate it on alternate k𝑘kitalic_k iterations.

Since prior works conducting empirical testing have generally focused on the centralised setting, evaluations have not had to consider the exploitability metric when not all agents are following a single policy πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, as may occur in the the independent or networked settings. The method described above for approximating exploitability involves calculating the mean return of all non-deviating agents’ policies. While this is πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the centralised case, if the non-deviating agents do not share a single policy, then this method is in fact approximating the exploitability of their joint policy 𝝅kdsubscriptsuperscript𝝅𝑑𝑘\boldsymbol{\pi}^{-d}_{k}bold_italic_π start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where d𝑑ditalic_d is the deviating agent.

E.2.2 Average discounted return

We record the average discounted return of the agents’ policies πkisuperscriptsubscript𝜋𝑘𝑖\pi_{k}^{i}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT during the Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT steps - this allows us to observe that settings that converge to similar exploitability values may not have similar average agent returns, suggesting that some algorithms are better than others at finding not just NE, but preferable NE. See for example Fig. 10, where the networked agents converge to similar exploitability as the independent agents, but receive higher average reward.

E.2.3 Policy divergence

We record the population’s average policy divergence 1NΔk1𝑁subscriptΔ𝑘\frac{1}{N}\Delta_{k}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := 1Ni=1Nπkiπk111𝑁superscriptsubscript𝑖1𝑁subscriptnormsuperscriptsubscript𝜋𝑘𝑖superscriptsubscript𝜋𝑘11\frac{1}{N}\sum_{i=1}^{N}||\pi_{k}^{i}-{\pi_{k}^{1}}||_{1}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for the arbitrary policy π¯=π1¯𝜋superscript𝜋1\bar{\pi}={\pi^{1}}over¯ start_ARG italic_π end_ARG = italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. This allows us to demonstrate that populations approaching the NE (i.e. with joint exploitability approaching zero) do not necessarily actually share a single policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as suggested by the theoretical sample guarantees in Sec. 3.3. Our experimental plots show that this is particularly often the case in the independent setting. The greater divergence in the independent case also indicates why convergence is slower here (see Rem. 5).

E.3 Hyperparameters

See Table 1 for our hyperparameter choices. In general, we seek to show that our networked algorithm is robust to ‘poor’ choices of hyperparameters e.g. low numbers of iterations, as may be required when aiming for practical convergence times in complex real-world problems. By contrast, the convergence speed of the independent-learning algorithm (and sometimes also the centralised algorithm) suffers much more significantly without idealised hyperparameter choices. As such, our experimental demonstrations in the plots generally involve hyperparameter choices at the low end of the values we tested during our research.

We can group our hyperparameters into those controlling the size of the experiment, those controlling the number of iterations of each loop in the algorithm and those affecting the learning/policy updates or policy adoption (β,η,λ,τ,γ𝛽𝜂𝜆𝜏𝛾\beta,\eta,\lambda,\tau,\gammaitalic_β , italic_η , italic_λ , italic_τ , italic_γ).

\tablefirsthead

Hyper-parameter Value Comment
\tablehead Hyper-parameter Value Comment

\tabletail

Continued on next page
\tablelasttail

Table 1: Hyperparameters
Hyper-parameter Value Comment
Gridsize 8x8 / 16x16 Most experiments are run on the smaller grid, while Figures 7 and 8 showcase learning in a larger state space.
Trials 5 We run 5 trials with different random seeds for each experiment. We plot the mean and 2-sigma error bars for each metric across the trials.
Population 250 We tested N𝑁Nitalic_N in {25,50,100,200,250,500}, with the networked architecture generally performing equally well with all population sizes \geq 50. We chose 250 for our demonstrations, to show that our algorithm can handle large populations, indeed often larger than those demonstrated in other mean-field works, especially for grid-world environments [50, 62, 74, 80, 81, 82, 83, 84, 85], while also being feasible to simulate wrt. time and computation constraints. In experiments testing robustness to population increase, the population instead begins at 50 agents and has 200 added at the marked point.
K𝐾Kitalic_K 200 K𝐾Kitalic_K is chosen to be large enough to see exploitability reducing, and converging where possible.
Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT 500 / 1000 We wish to illustrate the benefits of our networked architecture and replay buffer in reducing the number of loops required for convergence, i.e. we wish to select a low value that still permits learning. We tested Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT in {300,500,600,800,1000,1200,1300, 1400,1500,1800,2000,2500,3000}, and chose 500 for demonstrations on the 8x8 grids, and 1000 for the 16x16 grids. It may be possible to optimise these values further in combination with other hyperparameters.
Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT 1 We tested Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT in {1,2,10,100}, and found that we were still able to achieve convergence with Mtd=1subscript𝑀𝑡𝑑1M_{td}=1italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT = 1. This is much lower than the requirements of the theoretical algorithms, essentially allowing us to remove the innermost nested learning loop.
C𝐶Citalic_C 1 We tested C𝐶Citalic_C in {1,20,50,300}. We choose 1 to show the convergence benefits brought by even a single communication round, even in networks that may have limited connectivity; higher C𝐶Citalic_C has even better performance.
L𝐿Litalic_L 100 As with Mpgsubscript𝑀𝑝𝑔M_{pg}italic_M start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT, we wish to select a low value that still permits learning. We tested L𝐿Litalic_L in {50,100,200,300,400,500}. In combination with our other hyperparameters, we found L50𝐿50L\leq 50italic_L ≤ 50 led to less good results, but it may be possible to optimise this hyperparameter further.
E𝐸Eitalic_E 100 We tested E𝐸Eitalic_E in {100,300,1000}, and choose the lowest value to show the benefit to convergence even from very few evaluation steps. It may be possible to reduce this value further and still achieve similar results.
γ𝛾\gammaitalic_γ 0.9 Standard choice across RL literature.
β𝛽\betaitalic_β 0.1 We tested β𝛽\betaitalic_β in {0.01,0.1} and found 0.1 to be small enough for sufficient learning at an acceptable speed. Further optimising this hyperparameter (including by having it decay with increasing l0,,L1𝑙0𝐿1l\in 0,\dots,L-1italic_l ∈ 0 , … , italic_L - 1, rather than leaving it fixed) may lead to better results.
η𝜂\etaitalic_η 0.01 We tested η𝜂\etaitalic_η in {0.001,0.01,0.1,1,10} and found that 0.01 gave stable learning that progressed sufficiently quickly.
λ𝜆\lambdaitalic_λ 0 We tested λ𝜆\lambdaitalic_λ in {0,0.0001,0.001,0.01,0.1,1}. Since we are able to reduce λ𝜆\lambdaitalic_λ to 0 with no detriment to empirical convergence, we do so in order not to bias the NE.
τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT cf. comment For fixed τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT kfor-all𝑘\forall k∀ italic_k, we tested {1,10,100,1000}. In our experiments for fixed τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the value is 100 (see Figs. 9 and 10); this yields learning, but does not perform as well as if we anneal τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as follows. We begin with τ0=10000/(10(K1)/10\tau_{0}=10000/(10**\lceil(K-1)/10\rceilitalic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10000 / ( 10 ∗ ∗ ⌈ ( italic_K - 1 ) / 10 ⌉), and multiply τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by 10 whenever kmod10=1modulo𝑘101k\bmod 10=1italic_k roman_mod 10 = 1 i.e. every 10 iterations. Further optimising the annealing process may lead to better results.

E.4 Additional experiments and discussion

In this section we showcase results with our standard hyperparameter choices continuing from those shown in Sec. 4.1 (Figs. 4, 5 and 6), and we also vary several hyperparameters to show their effects on convergence (Figs. 7 - 10).

Remark 8.

Note that the reward structure of our games is such that exploitability sometimes increases from its initial value before it decreases down to 0. This is because agents are rewarded proportionally to how many other agents are co-located with them: when agents are evenly dispersed at the beginning of the run, it is difficult for even a deviating, best-responding agent to significantly increase its reward. However, once some agents start to aggregate, a best-responding agent can take advantage of this to substantially increase its reward (giving higher exploitability), before all the other agents catch up and aggregate at a single point, reducing the exploitability down to 0. Due to this arc, in some of our plots the independent case may have lower exploitability at certain points than the other architectures, but this is not necessarily a sign of good performance. In fact, we can see in some such cases that the independent case is not learning at all, with the independent agents’ average return not increasing and the exploitability staying level rather than ultimately decreasing (see, for example, Figs. 3, 4, 6) and 7.∎

In our additional experiments, where the results are discussed fully in each figure’s caption, the factors we vary to show the effects on convergence are as follows:

  • Grid size. Figs. 7 and 8 show the result of learning on a grid of size 16x16 instead of 8x8 as in all other experiments. There is greater differentiation in this setting than in the 8x8 grid between the performances of the different broadcast radii of the networked architecture (as is to be expected in a less densely populated environment). The networked architecture continues to significantly outperform the independent case for most broadcast radii, and sometimes even the centralised case.

  • Choice of softmax temperature. Figs. 9 and 10 illustrate the effect of fixed {τk}k{0,,K1}=100subscriptsubscript𝜏𝑘𝑘0𝐾1100\{\tau_{k}\}_{k\in\{0,\dots,K-1\}}=100{ italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ { 0 , … , italic_K - 1 } end_POSTSUBSCRIPT = 100, where the networked architecture does not perform as well as if we use the stepped annealing scheme employed in all the other experiments and detailed in Table 1. The intuition behind the better performance achieved with the annealing scheme is as follows. If we begin with τk0subscript𝜏𝑘0\tau_{k}\rightarrow 0italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → 0 (such that the softmax approaches being a max function), we heavily favour the adoption of the highest rewarded policies to speed up progress in the early stages of learning. Subsequently we increase τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in steps, promoting greater randomness in adoption, so that as the agents come closer to equilibrium, poorer policy updates that nevertheless receive a high return (due to randomness) do not introduce too much instability to learning and prevent convergence.

Appendix F Extended related work

Multi-agent reinforcement learning (MARL) [43, 86] is a generalisation of reinforcement learning [87] that has recently seen empirical success in a variety of domains, underpinned by breakthroughs in deep learning, including robotics [88, 89, 90, 91, 92], smart autonomy and infrastructures [93, 94], complex games [95, 96, 97], economics [98, 99], social science and cooperative AI [100, 101, 102, 103]. However, it has been computationally difficult to scale MARL algorithms beyond configurations with agents numbering in the low tens, as the joint state and action spaces grow exponentially with the number of agents [3, 30, 31, 99, 104, 105, 106]. Nevertheless, the value of reasoning about interactions among very large populations of agents has been recognised, and an informal distinction is sometimes drawn between multi- and many-agent systems [107, 108, 109]. The latter situation can be more useful (as in cases where better solutions arise from the presence of more agents [90, 110, 111, 75]), more parallelisable [112], more fault tolerant [113], or otherwise more reflective of certain real-world systems involving large numbers of decision makers [75, 98, 99, 114]. Indeed, MFGs have been applied to a wide variety of real world problems, including financial markets [11]; cryptocurrency mining [12]; autonomous vehicles [16]; traffic signal control [17]; resource management in fisheries [21]; crowdsensing [25]; electric vehicle charging [27]; communication networks [22, 28]; swarms [23]; data collection by UAVs [24]; edge computing [13, 14, 15]; cloud resource management [18]; smart grids, and other large-scale cyber-physical systems [19, 20, 26].

Our networked communication framework possesses all of the following desirable qualities for mean-field algorithms when applied to large, complex real-world many-agent systems: learning from the empirical distribution of N𝑁Nitalic_N agents without generation or manipulation of this distribution by the algorithm itself or by an external oracle; learning from a single continued system run that is not arbitrarily reset as in episodic learning (also referred to in other works as a single sample path/trajectory [5, 6]); model-free learning; decentralisation; fast practical convergence; and robustness to unexpected failures of decentralised learners or changes in the size of the population.

Conversely, as we emphasise in Sec. 1, the MFG framework was originally mainly theoretical [1, 2]. The MFG-NE is traditionally found by solving a coupled system of dynamical equations: a forward evolution equation for the mean-field distribution, and a backwards equation for the representative agent’s optimal response to the mean-field distribution, as in Def. 5 [21]; crucially, these methods relied on the assumption of an infinite population [32]. Early work solved the coupled equations using numerical methods that did not scale well for more complex state and action spaces [115, 116, 117, 118]; or, even if they could handle higher-dimensional problems, the methods were based on known models of the environment’s dynamics (i.e. they were model-based) [4, 33, 36, 37, 38, 39, 119, 120], and/or computed a best-response to the mean-field distribution [2, 30, 31, 32, 33, 34, 35, 40]. The latter approach is both computationally inefficient in non-trivial settings [6, 32], and in many cases is not convergent (as in general it does not induce a contractive operator) [30, 62]. Subsequent work, including our own, has therefore moved towards model-free and/or policy-improvement scenarios [20, 32, 80, 81, 121, 122, 123, 124, 125], possibly with learning taking place by observing N-agent empirical population distributions [6, 10, 50].

Most prior works, including algorithms designed to solve the MFG using an N-agent empirical distribution, have also assumed an oracle that can generate samples of the game dynamics (for any distribution) to be provided to the learning agent [4, 33, 80, 126, 127], or otherwise that the algorithm has direct control over the population distribution at each time step, such as in cases when the agents’ policies and distribution are updated on different timescales [41], with the fictitious play method being particularly popular [3, 5, 18, 30, 31, 34, 49, 61, 73, 81, 121, 128, 129, 130, 131]. In practice, many-agent problems may not admit such arbitrary generation or manipulation (for example, in the context of robotics or controlling vehicle traffic), and so a desirable quality of learning algorithms is that they update only the agents’ policies, rather than being able to arbitrarily reset their states. Learning may thus also need to leverage continuing, rather than episodic, tasks [87]. Yardim et al. [6], Yongacoglu et al. [50] and our own work therefore present algorithms that seek the MFG-NE using only a single run of the empirical population. Almost all prior work relies on a centralised controller to conduct learning on behalf of all the agents [3, 4, 5, 32, 42]. More recent work, including our own, has explored MFG algorithms for decentralised learning with N agents [6, 49, 50, 51, 52, 53, 54, 55].

Naturally, inter-agent communication is most applicable in settings where learning takes place along a continuing system run, rather than the distribution being manipulated by an oracle or arbitrarily reset for new episodes, since these imply a level of external control over the population that results in centralised learning. Equally, it is in situations of learning from finite numbers of real, deployed agents (rather than simulated settings) that we are most likely to be concerned with fault tolerance. As such, our work is most closely related to Yardim et al. [6] and Yongacoglu et al. [50], which provide algorithms for centralised and independent learning with empirical distributions along continued system runs: we contribute a networked learning algorithm in this setting. Yongacoglu et al. [50] empirically demonstrates an independent learning algorithm when agents observe compressed information about the mean-field distribution as well as their local state, but they do not compare this to any other algorithms or baselines. Yardim et al. [6] compares algorithms for centralised and independent learning theoretically, but does not provide empirical demonstrations. In contrast, in addition to providing theoretical guarantees, we empirically demonstrate our networked learning algorithm, where agents observe only their local state, in comparison to both centralised and independent baselines, as well as concerning ourselves with the speed of practical convergence and robustness, unlike these works.

Improving the training speed and sample efficiency of (deep) (multi-agent) RL is gaining increasing attention [132, 133, 134, 135], though our own work is one of the only on MFGs to be concerned with this. Huang and Lai [136] trains on a distribution of MFG configurations to speed up inference on unseen problems, but does not learn online in a decentralised manner as in our own work. Similarly, while some attention has been given to the robustness of multi-agent systems to varying numbers of agents, where it is sometimes referred to as ‘ad-hoc teaming’, ‘open-agent systems’, ‘scalability’ or ‘generalisation’ [75], it has more commonly been addressed in MARL [76, 77] than in MFGs [78]. Wu et al. [78] presents an MFG approach that allows new agents to join the population during execution, but training itself takes place offline in a centralised, episodic manner. Our networked communication framework presented in the current work, on the other hand, allows decentralised agents to join the population during online learning and to have minimal impact on the learning process by adopting policies from existing members of the population through communication.

An existing area of work called ‘robust mean-field games’ studies the robustness of these games to uncertainty in the transition and reward functions [19, 137, 138, 139, 140, 141, 142, 143], but does not consider fault-tolerance, despite this being one of the original motivations behind many-agent systems. On the other hand, we focus on robustness to failures and changes in the agent population itself.

We note a similarity between 1. our method for deciding which policies to propagate through the population (described in Sec. 3.4.1) and 2. the computation of evaluation/fitness functions within evolutionary algorithms to indicate which solutions are desirable to keep in the population for the next generation [144]. Moreover, the research avenue broadly referred to as ‘distributed embodied evolution’ involves swarms of agents independently running evolutionary algorithms while operating within a physical/simulated environment and communicating behaviour parameters to neighbours [145, 146], and is therefore even more similar to our setting, where decentralised RL updates are computed locally and then shared with neighbours. In distributed embodied evolution, the computed fitness of solutions helps determine both which are preserved by agents during local updates, and also which are chosen for broadcast or adoption between neighbours [147, 148, 149]. Indeed, some works on distributed embodied evolution specifically consider features or rewards relating to the joint behaviour of the whole population [150, 151], similar to MFGs. The adjacent research area of cultural/language evolution for swarm robotics [152, 153, 154] has similarly demonstrated the combination of evolutionary approaches and multi-agent communication networks for self-organised behaviours in swarms. However, unlike our own work, none of these areas employ reinforcement learning in the update of policies or the computation of the fitness functions.

We preempt objections that communication with neighbours might violate the anonymity that is characteristic of the mean-field paradigm, by emphasising that the communication in our algorithm takes place outside of the ongoing learning-and-updating parts of each iteration. Thus the core learning assumptions of the mean-field paradigm are unaffected, as they essentially apply at a different level of abstraction (a convenient approximation) to the reality we face of N𝑁Nitalic_N agents that interact within the same environment. Indeed, prior works have combined networks with mean-field theory, such as using a mean field to describe adaptive dynamical networks [155].

Appendix G Limitations and ongoing/future work

Our algorithm for the networked case (Alg. 1), as well as prior work on the centralised and independent cases [6], all have multiple nested loops. This is a potential limitation for real-world implementation, since the decentralised agents might be sensitive to failures in synchronising these loops. However, in practice, we show that our networked architecture provides redundancy and robustness (which the independent-learning algorithm lacks) in case of learning failures that may result from the necessities of synchronisation (see Appx. D). We have also shown that networked communication in combination with the replay buffer allows us to reduce the hyperparameter Mtdsubscript𝑀𝑡𝑑M_{td}italic_M start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT to 1, essentially removing the inner ‘waiting’ loop. Nevertheless, our algorithm still features multiple loops, and future work lies in simplifying the algorithms further to aid practical implementation, possibly by techniques such as asynchronous communication [156].

Since the MFG setting is technically non-cooperative, we have preempted objections that agents would not have incentive to communicate their policies by focusing on coordination games, i.e. where agents seek to maximise only their individual returns, but receive higher rewards when they follow the same strategy as other agents. In this case they stand to benefit by exchanging their policies with others. Nevertheless, in real-world settings, the communication network could be vulnerable to malfunctioning agents or adversarial actors poisoning the equilibrium by broadcasting untrue policy information. It is outside the scope of this paper to analyse how much false information would have to be broadcast by how many agents to affect the equilibrium, but real-world applications may need to compute this and prevent it. Future research to mitigate this risk might build on work such as Piazza et al. [157], where ‘power regularisation’ of information flow is proposed to limit the adverse effects of communication by misaligned agents.

While our MFG algorithms are designed to handle arbitrarily large numbers of agents (and theoretically perform better as N𝑁N\rightarrow\inftyitalic_N → ∞), the code for our experiments naturally still suffers from a bottleneck of computational speed when simulating agents that in the real world would be acting and learning in parallel, since the GPU can only process JAX-vectorised elements in batches of a certain size.

Our experiments are based on relatively small toy examples that clearly demonstrate the advantages of our new approach, but which lack the complexity of the real-world applications to which we wish to address the approach. It is feasible that in more complex problems, it may not be possible to remove theoretical assumptions and reduce hyperparameter values to the same extent we have demonstrated in our experimental examples.

Moreover, real-world examples would likely require handling larger and continuous state/action spaces (the latter perhaps building on related work such as Tang et al. [158]), which in turn may require (non-linear) function approximation. Our ongoing work therefore involves incorporating neural networks into our networked communication architecture for oracle-free, non-episodic MFG settings. Extending our algorithms in this way, which depends on modifying the PMA step [159, 160], allows us to introduce communication networks to MFGs with non-stationary equilibria, in addition to those with larger state/action spaces. Our method for the latter variety of game is to have agents’ policies depend both on their local state and also on the population distribution [32, 122, 161, 162], but such a high-dimensional observation object requires moving beyond tabular settings to those of function approximation. The present work demonstrates the benefits of the networked architecture when the Q-function is poorly estimated and introduces experience relay buffers to the setting of learning from a continuous run of the empirical system. Both elements are an important bridge to employing (non-linear) function approximation in this setting, where the problems of data efficiency and imprecise value estimation can be even more acute, and where we also employ experience replay buffers to provide uncorrelated data to train the neural networks [163]. When the policy functions are approximated rather than tabular, our agents communicate the functions’ parameters instead of the whole policy as now.

In our future work with non-stationary equilibria, where agents’ policies will also depend on the population distribution, it may be a strong assumption to suppose that decentralised agents with local state observations and limited communication radius would be able to observe the entire population distribution. We will therefore explore a framework of networked agents estimating the empirical distribution from only their local neighbourhood as in [83], and possibly also improving this estimation by communicating with neighbours [50], such that this useful information spreads through the network along with policy parameters.

Appendix H Negative impacts and mitigation

As with many advances in machine learning, and those relating to multi-agent systems in particular, in the long term our research on large populations of coordinating agents could have negative social outcomes if pursued by malicious actors, including surveillance and military uses. However, our work is primarily foundational and far from deployments, and it also has a large range of potential beneficial applications (such as smart grids and disaster response). Moreover, better understanding the dynamics of large multi-agent systems (as we seek to do in this paper) can contribute to ensuring safety by reducing the risks of unintended failures or outcomes.

We hope to help mitigate potential harmful consequences of this research by fostering transparency through submitting our code in the Supplementary Material, which we commit to publishing online under license upon acceptance of the paper. Details of how to install the dependencies and run the code are provided in the README file, and our hyperparameter choices are provided in Table 1. Details of the computational resources required are given in E.