\setcopyright

none \acmConference[ALA ’24]Proc. of the Adaptive and Learning Agents Workshop (ALA 2024) May 6-7, 2024Online, https://ala2024.github.io/Avalos, Müller, Wang, Yates (eds.) \copyrightyear2024 \acmYear2024 \acmDOI \acmPrice \acmISBN \settopmatterprintacmref=false \affiliation \institutionETH Zurich \cityZurich \countrySwitzerland \affiliation \institutionNational University of Singapore \countrySingapore \affiliation \institutionETH Zurich \cityZurich \countrySwitzerland \affiliation \institutionA*STAR \countrySingapore \affiliation \institutionNational University of Singapore \countrySingapore \affiliation \institutionETH Zurich \cityZurich \countrySwitzerland

CAESAR: Enhancing Federated RL in Heterogeneous MDPs through Convergence-Aware Sampling with Screening

Hei Yi Mak [email protected] Flint Xiaofeng Fan [email protected] Luca A. Lanzendörfer [email protected] Cheston Tan [email protected] Wei Tsang Ooi [email protected]  and  Roger Wattenhofer [email protected]
Abstract.

In this study, we delve into Federated Reinforcement Learning (FedRL) in the context of value-based agents operating across diverse Markov Decision Processes (MDPs). Existing FedRL methods typically aggregate agents’ learning by averaging the value functions across them to improve their performance. However, this aggregation strategy is suboptimal in heterogeneous environments where agents converge to diverse optimal value functions. To address this problem, we introduce the Convergence-AwarE SAmpling with scReening (CAESAR) aggregation scheme designed to enhance the learning of individual agents across varied MDPs. CAESAR is an aggregation strategy used by the server that combines convergence-aware sampling with a screening mechanism. By exploiting the fact that agents learning in identical MDPs are converging to the same optimal value function, CAESAR enables the selective assimilation of knowledge from more proficient counterparts, thereby significantly enhancing the overall learning efficiency. We empirically validate our hypothesis and demonstrate the effectiveness of CAESAR in enhancing the learning efficiency of agents, using both a custom-built GridWorld environment and the classical FrozenLake-v1 task, each presenting varying levels of environmental heterogeneity.

Key words and phrases:
Reinforcement learning, federated reinforcement learning, heterogeneous environments.

1. Introduction

Federated Reinforcement Learning (FedRL) Qi et al. (2021); Yang et al. (2020) is a burgeoning field in Reinforcement Learning. Distinct for its collaborative learning approach, FedRL enables distributed agents to learn collectively while maintaining the privacy of their local data — the raw trajectories sampled from the local environments. FedRL leverages techniques in Federated Learning (FL), notably Federated Averaging McMahan et al. (2017), to aggregate agent parameters to improve learning efficiency. While existing research on FedRL Khodadadian et al. (2022); Zhuo et al. (2019); Fan et al. (2023); Woo et al. (2023); Fan et al. (2021); Zhang et al. (2022); Dai et al. (2024) predominantly assumes homogeneous environments, where all local environments correspond to the same Markov Decision Process (MDP) Sutton and Barto (2018) with identical dynamics and rewards, real-world applications often defy this assumption. For instance, in the healthcare domain, FedRL holds promise for optimizing predictive models across various hospitals, each characterized by distinct patient demographics and disease patterns Xue et al. (2021). This diversity among patient populations and clinical manifestations leads to inherent heterogeneity within the data environments shaped by the MDPs.

This challenge is underscored in the research by Hao et al. (2022). While their work investigates FedRL in the context of heterogeneous environments, it primarily focuses on training a unified model to perform consistently across disparate local environments. This approach, akin to implementing a standard healthcare protocol across hospitals serving diverse patient populations, may prove to be impractical. Such a one-size-fits-all approach fails to accommodate the unique healthcare needs and specific disease prevalence of different communities, potentially resulting in suboptimal or even detrimental outcomes. This underscores the critical need for tailored approaches that respect and respond to the unique characteristics of each environment.

In contrast, our research is centered on scenarios where each agent learns a localized policy for its designated MDP. This is analogous to designing customized healthcare strategies for each hospital, taking into account the unique health demographics and local environmental influences of their patient population. We explore the potential of these agents to collaboratively enhance the learning of localized policies, each specifically tailored to the corresponding environment. A pivotal assumption in our work is the unknown nature of both the number of distinct MDPs and the specific assignments of agents to these MDPs.

In response to these challenges, we propose a convergence-aware adaptive sampling strategy for value-based agents in FedRL settings characterized by heterogeneous environments. This strategy is based on the insight that value functions of agents optimizing for the same MDP are expected to converge towards a singular optimal value over time, thereby naturally reducing the variance in learning trajectories among these agents, or ”peers.” Preliminary experiments suggest that while this strategy is effective in filtering out ”non-peers”—agents whose environmental contexts or MDPs diverge significantly from one another, leading to disparate optimal policies and value functions—it might inadvertently prioritize the inclusion of suboptimal peers. These are agents within the same MDP whose strategies or learning progress are not as advanced, potentially anchoring the group to suboptimal points. To address this, we introduce an additional screening process, aimed at incorporating only those agents that exhibit better performance. This dual approach of adaptive sampling and selective screening effectively mitigates the risk of suboptimal peer selection, enhancing the learning efficacy of agents in their respective MDPs.

In this paper, we address the challenges of training individual policies with environmental heterogeneity in FedRL. We begin by formulating the problem setup of FedRL in heterogeneous environments (Sec. 3) and proceed to examine various aggregation schemes (Sec. 4). We then introduce the Convergence-AwarE SAmpling with scReening (CAESAR) aggregation scheme that tailors the average value functions for individuals to effectively improve their learning (Sec. 4.5). CAESAR stands out for its dual-layered approach: firstly, utilizing a convergence-aware sampling mechanism for efficient peer identification in diverse MDPs (Sec. 4.4); and secondly, incorporating a selective screening process (Sec. 4.6) to refine agent interactions, prioritizing only those agents that demonstrate superior performance. We empirically validate the effectiveness and robustness of CAESAR to improve agents’ learning using environments of GridWorld and FrozenLake-v1, engineered for the purpose of illustrating environmental heterogeneity (Sec. 5). We have made our work publicly available and open-sourced,111https://github.com/hughiemak/CAESAR providing new perspectives and viable approaches for tackling the challenges of FedRL in heterogeneous settings.

2. Preliminaries

Markov Decision Processes (MDPs). In the realm of reinforcement learning, sequential decision-making problems are commonly modeled using MDPs Sutton and Barto (2018). An MDP is characterized by a 6666-tuple (𝒮,𝒜,𝒫,,γ,ρ)𝒮𝒜𝒫𝛾𝜌(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma,\rho)( caligraphic_S , caligraphic_A , caligraphic_P , caligraphic_R , italic_γ , italic_ρ ) where 𝒮𝒮\mathcal{S}caligraphic_S denotes the state space, 𝒜𝒜\mathcal{A}caligraphic_A represents the action space, 𝒫(s|s,a)𝒫conditionalsuperscript𝑠𝑠𝑎\mathcal{P}(s^{\prime}|s,a)caligraphic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) defines the transition probabilities between states, :𝒮×𝒜:𝒮𝒜\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}caligraphic_R : caligraphic_S × caligraphic_A → blackboard_R is the reward function, γ𝛾\gammaitalic_γ is the discount factor, and ρ𝜌\rhoitalic_ρ is the initial state distribution.

Q-learning. Q-learning JCH and Dayan (1992) stands as a cornerstone in classical reinforcement learning, operating as an off-policy temporal difference algorithm. In Q-learning, an agent learns an action-value function Q:𝒮×𝒜:𝑄𝒮𝒜Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{R}italic_Q : caligraphic_S × caligraphic_A → caligraphic_R using a table. The entry Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ), also known as the Q-values, estimates the expected return of taking action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A in state s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. The value of Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) is updated by applying the Bellman equation:

Q(s,a)(1α)Q(s,a)+α(r+γmaxaQ(s,a))𝑄𝑠𝑎1𝛼𝑄𝑠𝑎𝛼𝑟𝛾subscriptsuperscript𝑎𝑄superscript𝑠superscript𝑎\displaystyle Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha(r+\gamma\max_{a^{\prime}% }Q(s^{\prime},a^{\prime}))italic_Q ( italic_s , italic_a ) ← ( 1 - italic_α ) italic_Q ( italic_s , italic_a ) + italic_α ( italic_r + italic_γ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (1)

where s𝑠sitalic_s is the current state, a𝑎aitalic_a is the current action to be executed, r𝑟ritalic_r is the immediate reward, ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the next state, and α𝛼\alphaitalic_α is the learning rate. Then a decision policy πQsubscript𝜋𝑄\pi_{Q}italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT can be obtained via exploiting the updated Q𝑄Qitalic_Q-values:

πQ(s)at=argmaxaQ(st,a).subscript𝜋𝑄𝑠subscript𝑎𝑡subscript𝑎𝑄subscript𝑠𝑡𝑎\pi_{Q}(s)\leftarrow a_{t}=\arg\max_{a}Q(s_{t},a).italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_s ) ← italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) . (2)

The optimal action at state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as at=argmaxaQ(st,a)subscriptsuperscript𝑎𝑡subscript𝑎superscript𝑄subscript𝑠𝑡𝑎a^{*}_{t}=\arg\max_{a}Q^{*}(s_{t},a)italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) where Q(st,a)superscript𝑄subscript𝑠𝑡𝑎Q^{*}(s_{t},a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) is the optimal Q-function which gives the expected return for starting in state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, taking action a𝑎aitalic_a, and following the policy thereafter.

Federated Reinforcement Learning (FedRL). Initially introduced by Zhuo et al. (2019), Federated Reinforcement Learning (FedRL) has gained increasing prominence, evidenced by its extensive application in various real-world scenarios (Nadiger et al., 2019; Liu et al., 2019; Yu et al., 2020; Wang et al., 2020; Fujita et al., 2022; Liang et al., 2023; Zhang et al., 2022; Fan et al., 2023) and its substantial theoretical development (Fan et al., 2021; Hao et al., 2022; Khodadadian et al., 2022; Woo et al., 2023; Jordan et al., 2024). Notably, Fan et al. (2021) conducted pioneering work on the robust convergence of federated policy gradients, demonstrating sublinear speedup. Khodadadian et al. (2022); Woo et al. (2023) further advanced the field by showcasing linear speedup in federated Q-learning under Markovian Sampling. Shen et al. (2023) established a linear speedup for federated Actor-Critic algorithms under i.i.d. sampling. A common assumption in these related works is the homogeneity of MDPs across all agents participating in FedRL. This perspective was expanded by Hao et al. (2022), who explored FedRL in the context of environmental heterogeneity. Their research primarily aimed at develo** a global shared policy model within an imaginary MDP framework.

3. Federated Reinforcement Learning with Heterogeneous Environments

MDP Configuration. In the FedRL setting under consideration, we have N𝑁Nitalic_N agents, and a collection of MDPs, where the quantity of distinct MDPs (K𝐾Kitalic_K) is less than or equal to N𝑁Nitalic_N. Each MDP, denoted as Mksubscript𝑀𝑘M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, shares a common state space 𝒮𝒮\mathcal{S}caligraphic_S and action space 𝒜𝒜\mathcal{A}caligraphic_A, but is uniquely defined by its transition dynamics 𝒫k(s|s,a)subscript𝒫𝑘conditionalsuperscript𝑠𝑠𝑎\mathcal{P}_{k}(s^{\prime}|s,a)caligraphic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) and reward function k:𝒮×𝒜:subscript𝑘𝒮𝒜\mathcal{R}_{k}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → blackboard_R. Thus, the MDP Mksubscript𝑀𝑘M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is represented by the 6-tuple (𝒮,𝒜,𝒫k,k,γ,ρk)𝒮𝒜subscript𝒫𝑘subscript𝑘𝛾subscript𝜌𝑘(\mathcal{S},\mathcal{A},\mathcal{P}_{k},\mathcal{R}_{k},\gamma,\rho_{k})( caligraphic_S , caligraphic_A , caligraphic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_γ , italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), where γ𝛾\gammaitalic_γ is the discount factor and ρksubscript𝜌𝑘\rho_{k}italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the initial state distribution specific to MDP Mksubscript𝑀𝑘M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. An assignment function f:[N][K]:𝑓delimited-[]𝑁delimited-[]𝐾f:[N]\rightarrow[K]italic_f : [ italic_N ] → [ italic_K ] determines the allocation of each agent to these MDPs.

Heterogeneity in MDPs. The core of heterogeneity in this setting stems from the differences in transition dynamics and reward functions among the MDPs. An example of this heterogeneity is depicted in Fig. 1, where two MDPs share the same state and action spaces but have distinct reward functions. This diversity in dynamics and rewards exemplifies the complexity and variability agents encounter in heterogeneous environments.

Refer to caption
Figure 1. Two heterogeneous MDPs. MDP M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT rewards 11-1- 1 for action 00 and +11+1+ 1 for action 1111, while MDP M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT rewards +11+1+ 1 for action 00 and 11-1- 1 for action 1111. The optimal value functions are Q1(s0,0)=1,Q1(s0,1)=1formulae-sequencesubscript𝑄1subscript𝑠001subscript𝑄1subscript𝑠011Q_{1}(s_{0},0)=-1,Q_{1}(s_{0},1)=1italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ) = - 1 , italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ) = 1 for M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and Q2(s0,0)=1,Q2(s0,1)=1formulae-sequencesubscript𝑄2subscript𝑠001subscript𝑄2subscript𝑠011Q_{2}(s_{0},0)=1,Q_{2}(s_{0},1)=-1italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ) = 1 , italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ) = - 1 for M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively. Averaging these value functions results in Q¯(s0,0)=Q¯(s0,1)=0¯𝑄subscript𝑠00¯𝑄subscript𝑠010\bar{Q}(s_{0},0)=\bar{Q}(s_{0},1)=0over¯ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ) = over¯ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ) = 0, showing a misrepresentation of optimal values for both MDPs.

Operational Assumptions. A pivotal assumption in our approach is the unknown nature of both K𝐾Kitalic_K (the number of MDPs) and the assignment function f𝑓fitalic_f. This uncertainty adds a layer of complexity to the learning process, as agents must navigate and adapt to their assigned MDPs without prior knowledge of the overall system configuration.

Agent Learning and Objectives. Each agent in our system, denoted as i𝑖iitalic_i, is a value-based learner, employing techniques such as Q-learning for policy optimization. Every Agent i𝑖iitalic_i interacts solely with a local instantiation of its assigned MDP, Mf(i)subscript𝑀𝑓𝑖M_{f(i)}italic_M start_POSTSUBSCRIPT italic_f ( italic_i ) end_POSTSUBSCRIPT, from which it gathers and analyzes sample trajectories to inform its learning. The primary goal for each agent is to optimize its action-value function Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, aiming to achieve optimal expected performance within its unique local environment. This focus on individual optimization within a shared learning framework underscores the challenge of balancing local adaptation with collaborative learning in FedRL.

Federated Updates in FedRL. Following existing FedRL work Fan et al. (2021); Khodadadian et al. (2022); Zhuo et al. (2019); Fan et al. (2023), a central server is available to coordinate the federated learning. We consider a FedRL training process where a federated update takes place every H𝐻Hitalic_H local updates. At each local update step, each agent performs a standard learning step using the value-based RL algorithm. During the federated update, for each agent i𝑖iitalic_i, the server will select a subset of agents S[N]𝑆delimited-[]𝑁S\subseteq[N]italic_S ⊆ [ italic_N ] and aggregate their value functions into a new value function Q¯¯𝑄\bar{Q}over¯ start_ARG italic_Q end_ARG. For tabular Q-learning, a straightforward aggregation method is averaging the value functions (Q-tables) across selected agents:

Q¯(s,a)=jSQj(s,a),s𝒮,a𝒜.formulae-sequence¯𝑄𝑠𝑎subscript𝑗𝑆subscript𝑄𝑗𝑠𝑎formulae-sequencefor-all𝑠𝒮𝑎𝒜\displaystyle\bar{Q}(s,a)=\sum_{j\in S}Q_{j}(s,a),\quad\forall s\in\mathcal{S}% ,a\in\mathcal{A}.over¯ start_ARG italic_Q end_ARG ( italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_j ∈ italic_S end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s , italic_a ) , ∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A . (3)

After aggregation, the server sends Q¯¯𝑄\bar{Q}over¯ start_ARG italic_Q end_ARG to agent i𝑖iitalic_i and updates Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT towards Q¯¯𝑄\bar{Q}over¯ start_ARG italic_Q end_ARG:

Qi(s,a)βQi(s,a)+(1β)Q¯(s,a),s𝒮,a𝒜formulae-sequencesubscript𝑄𝑖𝑠𝑎𝛽subscript𝑄𝑖𝑠𝑎1𝛽¯𝑄𝑠𝑎formulae-sequencefor-all𝑠𝒮𝑎𝒜Q_{i}(s,a)\leftarrow\beta Q_{i}(s,a)+(1-\beta)\bar{Q}(s,a),\quad\forall s\in% \mathcal{S},a\in\mathcal{A}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) ← italic_β italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) + ( 1 - italic_β ) over¯ start_ARG italic_Q end_ARG ( italic_s , italic_a ) , ∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A (4)

where β[0,1]𝛽01\beta\in[0,1]italic_β ∈ [ 0 , 1 ] is a blending parameter controlling the extent of update from the federated value function.

A key challenge in the federated update process is determining the optimal subset S𝑆Sitalic_S for each agent without prior knowledge of f𝑓fitalic_f, the agent’s specific environment, or direct access to its local trajectories. The selection of S𝑆Sitalic_S is pivotal in ensuring that the aggregated value function Q¯¯𝑄\bar{Q}over¯ start_ARG italic_Q end_ARG is conducive to agent i𝑖iitalic_i’s learning in its MDP. In Sec. 4, we will explore various aggregation schemes for selecting S𝑆Sitalic_S.

4. Aggregation Schemes

In this section, we explore various schemes for selecting the subset of agents, S𝑆Sitalic_S, for each agent i𝑖iitalic_i, culminating in the introduction of our novel CAESAR scheme.

4.1. Self

The Self scheme serves as a baseline, where agents learn independently, without federated updates. Using this scheme, Eq. (3) can be viewed as:

Q¯=Qi,iNformulae-sequence¯𝑄subscript𝑄𝑖for-all𝑖𝑁\displaystyle\bar{Q}=Q_{i},\ \ \forall i\in Nover¯ start_ARG italic_Q end_ARG = italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ italic_N

implying no external influence during the federated update phase. Consequently, the selected subset S𝑆Sitalic_S only includes the agent itself:

S={i}.𝑆𝑖\displaystyle S=\{i\}.italic_S = { italic_i } .

It is a fundamental expectation that, for agents to be incentivized to engage in the federative process, any employed selection scheme must ensure that the aggregated knowledge surpasses the performance achievable by the Self. This is essential to justify the collaborative effort in the federative learning context.

4.2. All

The scheme All is another baseline corresponding to the canonical FedRL averaging scheme where all agents are included for aggregation to compute Eq. (3):

Q¯(s,a)=j=1NQj(s,a).¯𝑄𝑠𝑎superscriptsubscript𝑗1𝑁subscript𝑄𝑗𝑠𝑎\displaystyle\bar{Q}(s,a)=\sum_{j=1}^{N}Q_{j}(s,a).over¯ start_ARG italic_Q end_ARG ( italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s , italic_a ) .

In this scheme, the selected subset always includes all agents:

S={1,2,,N}.𝑆12𝑁\displaystyle S=\{1,2,\dots,N\}.italic_S = { 1 , 2 , … , italic_N } .

In a FedRL setting characterized by heterogeneous local environments, the All scheme may impede the learning process and potentially obstruct convergence. This issue arises because each agent’s value function, denoted as Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, is being optimized for different MDPs. In essence, they are converging towards disparate optimal value functions. Consequently, value functions optimized for one MDP might adversely affect the aggregated value function Q¯¯𝑄\bar{Q}over¯ start_ARG italic_Q end_ARG, resulting in misleading guidance for agent i𝑖iitalic_i. To illustrate, consider the scenario with two simple MDPs, M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as shown in Fig. 1. Suppose agents 1 and 2 are learning in these MDPs respectively and have both reached their optimal value functions: Q1(s0,0)=1,Q1(s0,1)=1formulae-sequencesubscript𝑄1subscript𝑠001subscript𝑄1subscript𝑠011Q_{1}(s_{0},0)=-1,Q_{1}(s_{0},1)=1italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ) = - 1 , italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ) = 1 for MDP M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and Q2(s0,0)=1,Q2(s0,1)=1formulae-sequencesubscript𝑄2subscript𝑠001subscript𝑄2subscript𝑠011Q_{2}(s_{0},0)=1,Q_{2}(s_{0},1)=-1italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ) = 1 , italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ) = - 1 for MDP M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. However, the averaged Q-values, Q¯(s0,0)¯𝑄subscript𝑠00\bar{Q}(s_{0},0)over¯ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ) and Q¯(s0,1)¯𝑄subscript𝑠01\bar{Q}(s_{0},1)over¯ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ), both result in 0. These average values are suboptimal for both MDPs. Updating Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT based on Q¯¯𝑄\bar{Q}over¯ start_ARG italic_Q end_ARG would therefore misguide the agents and steer them away from their currently optimal values, highlighting the challenge of aggregation in heterogeneous environments.

4.3. Peers

The Peers scheme is an unrealistic scheme in our setting that serves as a hypothetical benchmark. This scheme operates under the assumption of having prior knowledge of MDP assignments, denoted as f𝑓fitalic_f, and including only those agents assigned to the same MDP as agent i𝑖iitalic_i. These agents are referred to as the ‘peers’ of agent i𝑖iitalic_i. The selected subset of agents is therefore

S={j[N]:f(i)=f(j)}.𝑆conditional-set𝑗delimited-[]𝑁𝑓𝑖𝑓𝑗\displaystyle S=\{j\in[N]:f(i)=f(j)\}.italic_S = { italic_j ∈ [ italic_N ] : italic_f ( italic_i ) = italic_f ( italic_j ) } .

Such a presumption renders it impractical in scenarios where this information is not available, i.e., the server lacks insight into the peers of agent i𝑖iitalic_i. Despite this, the scheme serves as a valuable benchmark, illustrating the potential advantages of precise, environment-specific aggregation, such that:

Q¯(s,a)=jSQj(s,a),S={j[N]:f(i)=f(j)}.formulae-sequence¯𝑄𝑠𝑎subscript𝑗𝑆subscript𝑄𝑗𝑠𝑎𝑆conditional-set𝑗delimited-[]𝑁𝑓𝑖𝑓𝑗\displaystyle\bar{Q}(s,a)=\sum_{j\in S}Q_{j}(s,a),\ \ S=\{j\in[N]:f(i)=f(j)\}.over¯ start_ARG italic_Q end_ARG ( italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_j ∈ italic_S end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_S = { italic_j ∈ [ italic_N ] : italic_f ( italic_i ) = italic_f ( italic_j ) } .

Contrasting with the All scheme (Sec. 4.2), this conceptual approach offers greater efficiency by exclusively incorporating value functions that are optimized for the same MDP. This selective aggregation ensures that value functions from disparate MDPs, which could potentially mislead the learning process, are not included. Furthermore, this scheme provides a distinct advantage over the Self scheme, where agents learn in isolation. By leveraging the collective knowledge of agents assigned to the same MDP, it enables a more targeted and effective aggregation of value functions, enhancing the overall learning effectiveness.

4.4. Sampling

Inspired by the advantageous attributes of the hypothetical Peers scheme, we explore the feasibility of devising a similar selection scheme. Our goal is to accurately identify the peers of Agent i𝑖iitalic_i without relying on the prior assumption of peer knowledge inherent to the Peers approach. This task is especially challenging in our scenario due to the lack of prior knowledge about each agent’s assigned MDP and the absence of direct access to local trajectories at the server.

To circumvent this, we propose utilizing the convergence of agent value functions as a heuristic for peer detection. The main idea is that if the value functions Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are both being optimized for the same MDP, they should converge towards a unique optimal value function Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT over time. As a result, the values Qi(s,a)subscript𝑄𝑖𝑠𝑎Q_{i}(s,a)italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) and Qj(s,a)subscript𝑄𝑗𝑠𝑎Q_{j}(s,a)italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s , italic_a ) for all state-action pairs (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) will progressively become more similar. We empirically validate this convergence behavior in a gridworld setting, as detailed in Fig. 4 (Sec. 5.2).

Given this intuition, the convergence of value functions emerges as a practical heuristic for estimating whether two agents are learning in the same MDP. This insight leads to our convergence-aware sampling scheme, Sampling, wherein the subset S𝑆Sitalic_S is sampled based on probabilities pi1,,piNsubscript𝑝𝑖1subscript𝑝𝑖𝑁p_{i1},\dots,p_{iN}italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_i italic_N end_POSTSUBSCRIPT. Each probability pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT quantifies the likelihood of including agent j𝑗jitalic_j in S𝑆Sitalic_S and is dynamically adjusted based on the observed convergence between Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. At the onset of training, the server initializes an N×N𝑁𝑁N\times Nitalic_N × italic_N matrix p𝑝pitalic_p, where the entry pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is set as:

pij={p0ij,1i=j.,i,j[N]formulae-sequencesubscript𝑝𝑖𝑗casessubscript𝑝0𝑖𝑗1𝑖𝑗for-all𝑖𝑗delimited-[]𝑁\displaystyle p_{ij}=\begin{cases}p_{0}&i\neq j,\\ 1&i=j.\end{cases},\quad\forall i,j\in[N]italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_i ≠ italic_j , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_i = italic_j . end_CELL end_ROW , ∀ italic_i , italic_j ∈ [ italic_N ] (5)

where p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT functions as an initial assumption or ‘prior’ about the task, reflecting the preliminary likelihood of agents being peers before any learning occurs. By default, it can be assigned a value of 00 to encourage self-learning at the start of training when the convergence information is insufficient.

Prior to each federated update, the server updates the entries pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of the probability matrix p𝑝pitalic_p. This update is contingent upon evaluating the evolving similarity between the value functions Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Specifically, the server assesses how the similarity of these value functions has changed relative to their states observed H𝐻Hitalic_H steps ago. This dissimilarity between two value functions Q𝑄Qitalic_Q and Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is defined as the mean absolute difference across all state-action pairs:

d(Q,Q)=1|𝒮|×|𝒜|s,a|Q(s,a)Q(s,a)|.𝑑𝑄superscript𝑄1𝒮𝒜subscript𝑠𝑎𝑄𝑠𝑎superscript𝑄𝑠𝑎\displaystyle d(Q,Q^{\prime})=\frac{1}{|\mathcal{S}|\times|\mathcal{A}|}\sum_{% s,a}|Q(s,a)-Q^{\prime}(s,a)|.italic_d ( italic_Q , italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_S | × | caligraphic_A | end_ARG ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT | italic_Q ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a ) | .

Let Qk(t)superscriptsubscript𝑄𝑘𝑡Q_{k}^{(t)}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT be agent k𝑘kitalic_k’s current value function and Qk(tH)superscriptsubscript𝑄𝑘𝑡𝐻Q_{k}^{(t-H)}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - italic_H ) end_POSTSUPERSCRIPT be agent k𝑘kitalic_k’s value function H𝐻Hitalic_H steps ago. For each pair of agents {i,j}𝑖𝑗\{i,j\}{ italic_i , italic_j }, we update

pij{min(pij+δ,1)if d(Qi(tH),Qj(tH))d(Qi(t),Qj(t))>ξ,max(pijδ,0)otherwise.subscript𝑝𝑖𝑗casessubscript𝑝𝑖𝑗𝛿1if 𝑑superscriptsubscript𝑄𝑖𝑡𝐻superscriptsubscript𝑄𝑗𝑡𝐻𝑑superscriptsubscript𝑄𝑖𝑡superscriptsubscript𝑄𝑗𝑡𝜉subscript𝑝𝑖𝑗𝛿0otherwise.p_{ij}\leftarrow\begin{cases}\min(p_{ij}+\delta,1)&\text{if }d(Q_{i}^{(t-H)},Q% _{j}^{(t-H)})-d(Q_{i}^{(t)},Q_{j}^{(t)})>\xi,\\ \max(p_{ij}-\delta,0)&\text{otherwise.}\end{cases}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ← { start_ROW start_CELL roman_min ( italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_δ , 1 ) end_CELL start_CELL if italic_d ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - italic_H ) end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - italic_H ) end_POSTSUPERSCRIPT ) - italic_d ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) > italic_ξ , end_CELL end_ROW start_ROW start_CELL roman_max ( italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_δ , 0 ) end_CELL start_CELL otherwise. end_CELL end_ROW

where δ>0,ξ0formulae-sequence𝛿0𝜉0\delta>0,\xi\geq 0italic_δ > 0 , italic_ξ ≥ 0. This update rule is designed such that if the value functions Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT demonstrate a sufficient decrease in dissimilarity over a specific time window H𝐻Hitalic_H, the server will increase the probability value pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. This increment in pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT effectively raises the likelihood of agent j𝑗jitalic_j being selected for agent i𝑖iitalic_i’s subset for aggregation. The time window H𝐻Hitalic_H acts as a temporal frame of reference, enabling the server to assess changes in similarity over a defined period. Conversely, if Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT do not exhibit the required degree of convergence over the time window, pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is reduced. Persistent convergence trends lead to a gradual increment in pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, favoring the selection of agents with converging value functions. Consequently, the Sampling scheme dynamically adapts its selection criteria over time, increasingly favoring the inclusion of agents with value functions that demonstrate a tendency to converge. Parameters δ𝛿\deltaitalic_δ and ξ𝜉\xiitalic_ξ control the sensitivity of pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT adjustments and the required degree of convergence, respectively.

1 ServerExecutes(H,σ,p0,δ,ξ,β𝐻𝜎subscript𝑝0𝛿𝜉𝛽H,\sigma,p_{0},\delta,\xi,\betaitalic_H , italic_σ , italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_δ , italic_ξ , italic_β):
2       initialize value functions Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each agent i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ]
3       initialize QioldQisuperscriptsubscript𝑄𝑖𝑜𝑙𝑑subscript𝑄𝑖Q_{i}^{old}\leftarrow Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT ← italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each agent i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ]
4       initialize matrix p𝑝pitalic_p: pij={p0ij,1i=j.,i,j[N]formulae-sequencesubscript𝑝𝑖𝑗casessubscript𝑝0𝑖𝑗1𝑖𝑗for-all𝑖𝑗delimited-[]𝑁p_{ij}=\begin{cases}p_{0}&i\neq j,\\ 1&i=j.\end{cases},\forall i,j\in[N]italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_i ≠ italic_j , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_i = italic_j . end_CELL end_ROW , ∀ italic_i , italic_j ∈ [ italic_N ]
5       for each step t=1,,T𝑡1𝑇t=1,\dots,Titalic_t = 1 , … , italic_T do
6             LocalUpdate(i𝑖iitalic_i) for each agent i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ]
7             if t𝑡titalic_t mod H=0𝐻0H=0italic_H = 0 then
8                   gksubscript𝑔𝑘absentg_{k}\leftarrowitalic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← EvalLocalPerformance(k𝑘kitalic_k), k[N]for-all𝑘delimited-[]𝑁\forall k\in[N]∀ italic_k ∈ [ italic_N ]
9                   UpdatePMatrix(p,δ,ξ,{Qkold}k,{Qk}k𝑝𝛿𝜉subscriptsuperscriptsubscript𝑄𝑘𝑜𝑙𝑑𝑘subscriptsubscript𝑄𝑘𝑘p,\delta,\xi,\{Q_{k}^{old}\}_{k},\{Q_{k}\}_{k}italic_p , italic_δ , italic_ξ , { italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT)
10                   FederatedUpdate(i,β,p,{Qk}k,{gk}k𝑖𝛽𝑝subscriptsubscript𝑄𝑘𝑘subscriptsubscript𝑔𝑘𝑘i,\beta,p,\{Q_{k}\}_{k},\{g_{k}\}_{k}italic_i , italic_β , italic_p , { italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) for each agent i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ]
11                   QioldQisuperscriptsubscript𝑄𝑖𝑜𝑙𝑑subscript𝑄𝑖Q_{i}^{old}\leftarrow Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT ← italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each agent i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ]
12                  
13            
14      
15
16 UpdatePMatrix(p,δ,ξ,{Qkold}k,{Qk}k𝑝𝛿𝜉subscriptsuperscriptsubscript𝑄𝑘𝑜𝑙𝑑𝑘subscriptsubscript𝑄𝑘𝑘p,\delta,\xi,\{Q_{k}^{old}\}_{k},\{Q_{k}\}_{k}italic_p , italic_δ , italic_ξ , { italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT):
17       for agent i=1𝑖1i=1italic_i = 1 to N𝑁Nitalic_N do
18             for agent j=i+1𝑗𝑖1j=i+1italic_j = italic_i + 1 to N𝑁Nitalic_N do
19                   pij{min(pij+δ,1)if d(Qiold,Qjold)d(Qi,Qj)>ξ,max(pijδ,0)otherwise.subscript𝑝𝑖𝑗casessubscript𝑝𝑖𝑗𝛿1if 𝑑superscriptsubscript𝑄𝑖𝑜𝑙𝑑superscriptsubscript𝑄𝑗𝑜𝑙𝑑𝑑subscript𝑄𝑖subscript𝑄𝑗𝜉subscript𝑝𝑖𝑗𝛿0otherwise.p_{ij}\leftarrow\begin{cases}\min(p_{ij}+\delta,1)&\text{if }d(Q_{i}^{old},Q_{% j}^{old})-d(Q_{i},Q_{j})>\xi,\\ \max(p_{ij}-\delta,0)&\text{otherwise.}\end{cases}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ← { start_ROW start_CELL roman_min ( italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_δ , 1 ) end_CELL start_CELL if italic_d ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_l italic_d end_POSTSUPERSCRIPT ) - italic_d ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) > italic_ξ , end_CELL end_ROW start_ROW start_CELL roman_max ( italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_δ , 0 ) end_CELL start_CELL otherwise. end_CELL end_ROW
20                  
21            
22      
23
24 FederatedUpdate(i,β,p,{Qk}k,{gk}k𝑖𝛽𝑝subscriptsubscript𝑄𝑘𝑘subscriptsubscript𝑔𝑘𝑘i,\beta,p,\{Q_{k}\}_{k},\{g_{k}\}_{k}italic_i , italic_β , italic_p , { italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT):
25       initialize S={}superscript𝑆S^{\prime}=\{\}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { }
26       for j[N]𝑗delimited-[]𝑁j\in[N]italic_j ∈ [ italic_N ] do
27             add j𝑗jitalic_j to Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with probability pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT
28      S={j:jS and gj>gi}𝑆conditional-set𝑗𝑗superscript𝑆 and subscript𝑔𝑗subscript𝑔𝑖S=\{j:j\in S^{\prime}\text{ and }g_{j}>g_{i}\}italic_S = { italic_j : italic_j ∈ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }
29       Construct Q¯¯𝑄\bar{Q}over¯ start_ARG italic_Q end_ARG: Q¯(s,a)=jSQj(s,a),s𝒮,a𝒜formulae-sequence¯𝑄𝑠𝑎subscript𝑗𝑆subscript𝑄𝑗𝑠𝑎formulae-sequencefor-all𝑠𝒮𝑎𝒜\bar{Q}(s,a)=\sum_{j\in S}Q_{j}(s,a),\quad\forall s\in\mathcal{S},a\in\mathcal% {A}over¯ start_ARG italic_Q end_ARG ( italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_j ∈ italic_S end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s , italic_a ) , ∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A
30       Update Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: Qi(s,a)βQi(s,a)+(1β)Q¯(s,a),s𝒮,a𝒜formulae-sequencesubscript𝑄𝑖𝑠𝑎𝛽subscript𝑄𝑖𝑠𝑎1𝛽¯𝑄𝑠𝑎formulae-sequencefor-all𝑠𝒮𝑎𝒜Q_{i}(s,a)\leftarrow\beta Q_{i}(s,a)+(1-\beta)\bar{Q}(s,a),\quad\forall s\in% \mathcal{S},a\in\mathcal{A}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) ← italic_β italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) + ( 1 - italic_β ) over¯ start_ARG italic_Q end_ARG ( italic_s , italic_a ) , ∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A
Algorithm 1 CAESAR

4.5. CAESAR

As will be discussed in Sec. 5, the Sampling scheme excels at filtering out non-peers from the set S𝑆Sitalic_S, but it also has a potential downside: the server might inadvertently incorporate only peers that are underperforming, confining slow-progressing agents in suboptimal points in the value function space. To mitigate this issue, we introduce an additional screening process to refine agent interactions, prioritizing only those agents that demonstrate superior performance, culminating in the CAESAR aggregation scheme.

In the CAESAR scheme, we initially select a subset of agents based on the probabilities pi1,,piNsubscript𝑝𝑖1subscript𝑝𝑖𝑁p_{i1},\dots,p_{iN}italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_i italic_N end_POSTSUBSCRIPT, following the same process as in Sampling. The primary objective of this sampling step, akin to that in Sampling, is to identify probable peers by assessing the convergence trends of their value functions. Subsequently, we introduce a screening layer, which focuses on the comparative performance of these selected agents. The rationale behind this additional step is to circumvent the pitfall of updating the value function Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT towards the average of lower-performing peers, which could hinder the convergence of Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to its optimal state.

To implement this, the local performance of each agent k𝑘kitalic_k, gk𝔼Mk,πQk[t=0γtk(st,at)]subscript𝑔𝑘subscript𝔼subscript𝑀𝑘subscript𝜋subscript𝑄𝑘delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑘subscript𝑠𝑡subscript𝑎𝑡g_{k}\approx\mathbb{E}_{M_{k},\pi_{Q_{k}}}\left[\sum_{t=0}^{\infty}\gamma^{t}% \mathcal{R}_{k}(s_{t},a_{t})\right]italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≈ blackboard_E start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ], is measured prior to the federated update. During the update, for a given agent i𝑖iitalic_i, the server initially samples a preliminary subset of agents, Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, in line with the probabilities pi1,,piNsubscript𝑝𝑖1subscript𝑝𝑖𝑁p_{i1},\dots,p_{iN}italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_i italic_N end_POSTSUBSCRIPT. It then further refines Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by retaining only those agents whose performance, gjsubscript𝑔𝑗g_{j}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, exceeds that of agent i𝑖iitalic_i (gj>gi)subscript𝑔𝑗subscript𝑔𝑖(g_{j}>g_{i})( italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The final subset for aggregation is thus defined as:

S={j:jS and gj>gi}𝑆conditional-set𝑗𝑗superscript𝑆 and subscript𝑔𝑗subscript𝑔𝑖\displaystyle S=\{j:j\in S^{\prime}\text{ and }g_{j}>g_{i}\}italic_S = { italic_j : italic_j ∈ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }

This resulting subset S𝑆Sitalic_S is then utilized to assemble the aggregated value function Q¯¯𝑄\bar{Q}over¯ start_ARG italic_Q end_ARG for updating Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of each agent. This completes the outlines for the CAESAR scheme. For a detailed procedural breakdown, refer to the pseudocode presented in Algorithm 1.

4.6. Screen

As a complementary approach, the Screen scheme focuses solely on the screening process based on local performance, without considering convergence trends:

S={j:gj>gi}𝑆conditional-set𝑗subscript𝑔𝑗subscript𝑔𝑖\displaystyle S=\{j:g_{j}>g_{i}\}italic_S = { italic_j : italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }

Screen selects agents that are performing better than the target agent, but may include those from different MDPs. This scheme tests the efficacy of performance-based selection in isolation.

Each scheme presents a unique approach to aggregating value functions within a FedRL framework. Our goal, as detailed in Sec. 5, is to assess the effectiveness of these schemes in enhancing individual agent performances, particularly in heterogeneous environments. This analytical endeavor aims to uncover the most effective strategies for knowledge aggregation in practical FedRL settings, thereby providing valuable insights into optimizing agent performance in diverse and complex scenarios.

5. Empirical Evaluation

5.1. Experimental Settings

In this study, we conduct a comparative analysis of the six aggregation schemes discussed in Sec. 4. For this comparison, we employ Q-learning agents within two distinct environments: a custom-built environment GridWorld and the well-known FrozenLake-v1 task from the OpenAI Gym toolkit Brockman et al. (2016).

The GridWorld is designed as a 1-dimensional discrete environment, characterized by a state space 𝒮={5,4,,4,5}𝒮5445\mathcal{S}=\{-5,-4,\dots,4,5\}caligraphic_S = { - 5 , - 4 , … , 4 , 5 } and a binary action space 𝒜={0,1}𝒜01\mathcal{A}=\{0,1\}caligraphic_A = { 0 , 1 }. The initial state for each episode is set at 00, with terminal states being 5555 and 55-5- 5. The agent’s actions impact the state transitions: action 00 moves the state from x𝑥xitalic_x to x1𝑥1x-1italic_x - 1, while action 1111 advances the state from x𝑥xitalic_x to x+1𝑥1x+1italic_x + 1. Two distinct versions of this environment are considered, corresponding to two different MDPs. Fig. 2 provides a visual representation of these GridWorld environments. In the first MDP (MDP 1), a transition from state 4444 to 5555 yields a reward of +11+1+ 1, and a transition from 44-4- 4 to 55-5- 5 results in a reward of 11-1- 1. All other state transitions provide a neutral reward of +00+0+ 0. The second MDP (MDP 2) inverts the reward structure of MDP 1, such that r2(s,a)=r1(s,a)subscript𝑟2𝑠𝑎subscript𝑟1𝑠𝑎r_{2}(s,a)=-r_{1}(s,a)italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s , italic_a ) = - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A).

Refer to caption
Figure 2. Two GridWorld MDPs. Their initial states are 00. In MDP 1 (top), transiting from state 4444 to 5555 generates a reward of +11+1+ 1 and transiting from state 44-4- 4 to 55-5- 5 yields a reward of 11-1- 1. In MDP 2 (bottom), the signs of the rewards are flipped.
\Description

The two GridWorld MDPs. The initial state is 00 in both MDPs. In MDP 1 (top), transiting from state 4444 to 5555 generates a reward of 11-1- 1 and transiting from state 44-4- 4 to 55-5- 5 yields a reward of 11-1- 1. In MDP 2 (bottom), the signs of the rewards are flipped.

The FrozenLake-v1 environment presents a 2-dimensional discrete challenge that effectively encapsulates the complexities of environmental heterogeneity. In this environment, agents are tasked with navigating to a designated goal while avoiding hazardous holes. The environment is characterized by a four-directional action space, and episodes end with a reward of +11+1+ 1 upon reaching the goal, or +00+0+ 0 if the agent falls into a hole or exhausts the allowed steps. The heterogeneity of the FrozenLake-v1 environment is induced by the distinct map configurations, as shown in Fig. 3. Each map represents a unique instantiation of a local MDP within the environment, characterized by its own specific arrangement of holes and paths, necessitating different strategic approaches for successful navigation. This diversity in maps provides a practical scenario to assess how FedRL algorithms perform across dynamically varied MDPs.

Refer to caption
(a) Map 0
Refer to caption
(b) Map 1
Refer to caption
(c) Map 2
Figure 3. FrozenLake-v1 environments generated by three different maps. The agent’s task is to navigate to the goal (the gift box) without falling into the holes.
\Description

FrozenLake-v1 environments generated by three different maps. The agent’s task is to navigate to the goal (the gift box) without falling into the holes.

5.2. Hypothesis Verification Using GridWorld

For GridWorld, we partition N=20𝑁20N=20italic_N = 20 agents into K=2𝐾2K=2italic_K = 2 groups, each comprising 10101010 agents. These groups are then assigned to two different MDPs, M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as depicted in Fig. 2. Each agent is trained for T=10000𝑇10000T=10000italic_T = 10000 steps with an exploration rate ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1, and receives a federated update every H=100𝐻100H=100italic_H = 100 steps.

Convergence Among Peers. In the relatively simple GridWorld environment, we capture the agents’ Q-tables every H𝐻Hitalic_H steps. Fig. 4 shows how Q-values Qi(s,a)subscript𝑄𝑖𝑠𝑎Q_{i}(s,a)italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) for various state-action (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) pairs evolve for all agents i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] under Self (independent learning). The optimal values of these state-action pairs are different for M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Notably, Q-values among peers converge towards the optimal values for their respective MDPs over time, supporting the use of Q-value convergence as a heuristic for detecting probable peers, as elaborated in Sec. 4.4.

Refer to caption
Figure 4. Convergence of Q-values among peers in GridWorld under Self. Q-values of M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents (blue) and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT agents (orange) converge to their respective optimal values (black dotted lines) for state-actions (s=4,a=)formulae-sequence𝑠4𝑎(s=-4,a=\cdot)( italic_s = - 4 , italic_a = ⋅ ) and (s=3,a=)formulae-sequence𝑠3𝑎(s=-3,a=\cdot)( italic_s = - 3 , italic_a = ⋅ ) in GridWorld. ϵitalic-ϵ\epsilonitalic_ϵ is set to 0.90.90.90.9 to speed up convergence.
\Description

Convergence of Q-values among peers in GridWorld under Self. Q-values of M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents (blue) and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT agents (orange) converge to their respective optimal values (black dotted lines) for state-actions (s=4,a=)formulae-sequence𝑠4𝑎(s=-4,a=\cdot)( italic_s = - 4 , italic_a = ⋅ ) and (s=3,a=)formulae-sequence𝑠3𝑎(s=-3,a=\cdot)( italic_s = - 3 , italic_a = ⋅ ) in GridWorld. ϵitalic-ϵ\epsilonitalic_ϵ is set to 0.90.90.90.9 to speed up convergence.

Comparative Performance Analysis. Fig. 5 illustrates the average performance (over 30 random seeds) of all agents under different aggregation schemes in GridWorld. We can observe that All is outperformed by Peers, affirming our hypothesis that including all agents in S𝑆Sitalic_S to compute Q¯¯𝑄\bar{Q}over¯ start_ARG italic_Q end_ARG according to Eq. (3) is less effective in heterogeneous environments. The slower learning progress observed under Screen is attributed to its selection based solely on local performance, often including high-performing agents from different MDPs. Significantly, CAESAR shows comparable results to the hypothetical approach Peers, which operates under the assumption of perfect knowledge about agent-MDP assignments. Remarkably, CAESAR surpasses both Sampling and Screen, highlighting the synergistic effect of their combination on learning enhancement.

Refer to caption
Figure 5. Average performance of the N=20𝑁20N=20italic_N = 20 agents in GridWorld under different averaging schemes with exploration rate ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1. The plot averages independent runs over 30 random seeds where the shadows represent the 95%percent9595\%95 % confidence intervals.
\Description

Average performance of the N=20𝑁20N=20italic_N = 20 agents in GridWorld under different averaging schemes with exploration rate ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1.

Refer to caption
Figure 6. The matrix p𝑝pitalic_p as a heatmap (yellow and purple indicate 1111 and 00 respectively) at 5555 different time points under Sampling. The numbers on the axes correspond the agents, where agents 00 to 9999 are assigned to M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and agents 10101010 to 19191919 are assigned to M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The color of the cell (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) indicates the probability of selecting agent j𝑗jitalic_j for agent i𝑖iitalic_i.
\Description

The matrix p𝑝pitalic_p as a heatmap (yellow and purple indicate 1111 and 00 respectively) at 5555 different time points under Sampling. The numbers on the axes correspond to the agents, where agents 00 to 9999 are assigned to M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and agents 10101010 to 19191919 are assigned to M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The color of the cell (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) indicates the probability of selecting agent j𝑗jitalic_j for agent i𝑖iitalic_i.

Analysis of Q-Value Evolution. To understand the critical role of the screening process in the CAESAR scheme, we track the progression of Q-values throughout the training period. Fig. 7 and Fig. 8 present the evolution of Q-values for agents assigned to M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under the Sampling and CAESAR schemes, respectively. These plots are generated from training sessions with the same random seed and initial agent configurations. Under Sampling, we notice that only two agents are able to approximate the optimal Q-values (indicated by black dotted lines), while the remaining agents stagnate at suboptimal points. In contrast, when employing CAESAR, a uniform and rapid convergence to optimal values is observed for all agents.

Refer to caption
Figure 7. Q-values of the M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents under Sampling. Two M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents, Agent 5 (red curve) and Agent 6 (green curve), exhibit fast learning progress and converge to the true optimal values (black dotted lines) but the remaining M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents (blue curves), Agents 0, 1, 2, 3, 4, 7, 8, 9, converge to non-optimal values.
\Description

Q-values of the M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents under Sampling. Two M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents, Agent 5 (red curve) and Agent 6 (green curve), exhibit fast learning progress and converge to the true optimal values (black dotted lines) but the remaining M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents (blue curves), Agents 0, 1, 2, 3, 4, 7, 8, 9, converge to non-optimal values.

Refer to caption
Figure 8. Under CAESAR, Q-values of all M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents converge quickly to the true optimal values (black dotted lines).
\Description

Under CAESAR, Q-values of all M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents converge quickly to the true optimal values (black dotted lines).

Refer to caption
(a) Homogeneous environments
Refer to caption
(b) Random heterogeneous environments
Refer to caption
(c) Strongly heterogeneous environments
Figure 9. Average performance of the N=20𝑁20N=20italic_N = 20 agents in FrozenLake-v1 under different averaging schemes with exploration ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1 in all three settings. The plots average independent runs over 30 random seeds where the shadows represent the 95%percent9595\%95 % confidence intervals.
\Description

Average performance of the N=20𝑁20N=20italic_N = 20 agents in FrozenLake-v1 under different averaging schemes with exploration ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1 in all three settings. The plots average independent runs over 30 random seeds where the shadows represent the 95%percent9595\%95 % confidence intervals.

To gain a deeper understanding of the dynamics at play within the Sampling scheme, we analyze the changes in the p𝑝pitalic_p-matrix (Sec. 4.4) over various training stages, as illustrated in Fig. 6. This analysis reveals that Sampling is highly effective in filtering out non-peers, consistently selecting them with near-zero probability from timestep t=4000𝑡4000t=4000italic_t = 4000 onwards. However, an intriguing behavior is observed: Sampling tends to overlook agents who are advancing quickly in their learning curve, opting instead for peers with slower progress rates. Specifically, at t=4000𝑡4000t=4000italic_t = 4000 (as shown in the third plot of Fig. 6), Sampling assigns negligible probabilities to aggregate values of the fast learners, Agents 5555 and 6666, into the learning process of the slower-progressing peers, namely Agents 0,1,2,3,4,7,8,9012347890,1,2,3,4,7,8,90 , 1 , 2 , 3 , 4 , 7 , 8 , 9, as shown in Fig. 7. Despite the evident progress of the fast learners, Sampling scheme leads to a tendency for slower peers to primarily learn from each other, gravitating towards a consensus that strays from the optimal value. Such a strategy, while fostering a form of convergence, risks cementing the learning of slower-progressing agents around suboptimal values.

In contrast, CAESAR circumvents this issue through its dual-layered approach: it employs Sampling to effectively identify and exclude non-peers, and Screen, the screening process that prioritizes the inclusion of fast-progressing agents based on their local performance metrics, as evident in Fig. 8. This strategic selection tends to aggregate knowledge from faster-progressing peers, whose value functions Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are more optimal for the same MDP. Hence, CAESAR not only avoids the pitfalls of Sampling but also facilitates a more effective knowledge transfer, significantly enhancing the learning efficiency across agents.

5.3. Effectiveness evaluation using FrozenLake-v1

In our study using FrozenLake-v1, we maintain the same experimental settings as in GridWorld, with N=20𝑁20N=20italic_N = 20 agents divided into two groups K=2𝐾2K=2italic_K = 2, each group comprising 10 agents assigned to two distinct MDPs M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively. We assess the performance of the aggregation schemes under the following scenarios, each offering a different level of environmental heterogeneity:

  1. (1)

    Homogeneous environments: M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are identical, both generated using the same random map with 4 holes. An example of such a map is shown in Fig. 3(a) (a).

  2. (2)

    Randomly heterogeneous environments M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are distinct, created using two random maps with differing positions of the 4 holes.

  3. (3)

    Strongly heterogeneous environments Maps 1 and 2, as depicted in Fig. 3(a) (b) and (c) respectively, exhibit a significant disparity in difficulty levels, with Map 1 being the easier and Map 2 the more challenging. The two maps are designed to have substantial differences in their optimal Q-functions.

Fig. 9 shows the average performance of all agents across the different aggregation schemes in these scenarios. The results reveal that CAESAR consistently demonstrates robust performance in all three scenarios, contrasting with other schemes that struggle in at least one scenario.

In Scenario 1 (Fig. 9(a)), with identical MDPs M1=M2subscript𝑀1subscript𝑀2M_{1}=M_{2}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, All demonstrates superior learning outcomes compared to Peers. This aligns with expectations, as all agents are engaged in the same task, making the inclusion of the entire agent pool in S𝑆Sitalic_S more effective for leveraging collective insights. In this context, All benefits from a broader knowledge base than Peers, which limits its focus to peers, hence reducing the number of participating agents.

Conversely, in Scenario 2 (Fig. 9(b)), where M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT differ, Peers slightly outperforms All. This indicates that peer-based learning is more advantageous when agents are dealing with different MDPs, as it enables more targeted knowledge sharing.

The contrast becomes more pronounced in Scenario 3 (Fig. 9(c)), where Peers significantly surpasses All, with the latter even falling behind Self (independent learning). This scenario underscores the importance of excluding non-peers from S𝑆Sitalic_S in heterogeneous environments. The discrepancy arises from the varying difficulty levels of M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Map 1) and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(Map 2). Agents in the simpler M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT quickly master the task, leading to high-value estimates of Q(s0,a=)𝑄subscript𝑠0𝑎Q(s_{0},a=\rightarrow)italic_Q ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a = → ) which is not optimal for M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT where the action a=𝑎a=\rightarrowitalic_a = → often leads to holes. Therefore, including M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT agents’ value functions in S𝑆Sitalic_S can detrimentally affect the learning progress of M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT agents, as their optimal Q-value functions diverge significantly.

Screen displays a notably inconsistent performance pattern across different levels of heterogeneity, excelling in Scenarios 1 and 2 but faltering in Scenario 3, where its results are even inferior to those of the independent learning approach, Self. This phenomenon stems from that the Screen scheme is effectively the All scheme with an additional screening process. In homogeneous environments (Scenario 1), where M1=M2subscript𝑀1subscript𝑀2M_{1}=M_{2}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, this screening process effectively boosts performance by prioritizing agents with superior performance. However, in the more complex Scenario 3, Screen tends to erroneously include high-performing agents from the simpler M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, whose optimal values are counterproductive in M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. As a result, Screen inadvertently hinders the learning process for M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT agents by propagating suboptimal Q-values. This issue is clearly demonstrated in Fig. 9(c), where Screen achieves an average performance of only 0.5, suggesting that half of the agents are unable to effectively address their assigned tasks.

CAESAR demonstrates remarkable robustness across all three scenarios. Notably, in Scenario 1 (Fig. 9(a)), where M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are identical and thus all agents are peers, the inclusion of the sampling process within CAESAR does not impede learning gains. This is evidenced by its performance being on par with Screen, suggesting that the additional process does not detract from learning efficiency in homogeneous environments. In scenarios where M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT differ, particularly in the more complex Scenario 3, CAESAR continues to show strong performance, in stark contrast to the diminishing results of Screen and All. This resilience is primarily attributed to the sampling process integral to CAESAR, which effectively filters out non-peers, thereby ensuring that agents are exposed to relevant and beneficial strategies for their specific environments. It is important to note that while Peers excels in scenario 3, its implementation is not practical in real applications where the agent-MDP assignments are not known, as discussed in Sec. 4.3. CAESAR stands out in practical settings where the degree of heterogeneity among environments might be unknown or unpredictable. Its consistent performance across diverse scenarios underscores its suitability as a versatile and reliable aggregation strategy for FedRL in a practical setting.

6. Conclusion

In this study, we have tackled the intricate challenge of training distinct policies for agents across diverse environments within the realm of Federated Reinforcement Learning. Our investigation entailed a thorough analysis of six different aggregation strategies within the FedRL paradigm.

The experiments conducted in both customized GridWorld and FrozenLake-v1 demonstrated the efficacy of Q-value convergence as a heuristic for peer detection in FedRL. Notably, the proposed CAESAR scheme stood out for its adaptability and resilience across a spectrum of environmental heterogeneity, consistently surpassing other evaluated baselines. This adaptability makes CAESAR particularly advantageous for real-world FedRL applications, where the unique characteristics of each environment are accommodated.

While our exploration focused on a tabular setting, future research directions include extending our methodologies to more complex and dynamic environments, especially those featuring a continuous control space. Furthermore, this work is primarily centered around agents that employ Q-value-based strategies. Acknowledging this as a limitation, another valuable direction for future research would be the incorporation of policy-based methods.

References

  • (1)
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. In arXiv preprint arXiv:1606.01540.
  • Dai et al. (2024) Zhongxiang Dai, Flint Xiaofeng Fan, Cheston Tan, Trong Nghia Hoang, Bryan Kian Hsiang Low, and Patrick Jaillet. 2024. Chapter 14 - Federated sequential decision making: Bayesian optimization, reinforcement learning, and beyond. In Federated Learning. Academic Press, 257–279.
  • Fan et al. (2021) Flint Xiaofeng Fan, Yining Ma, Zhongxiang Dai, Wei **g, Cheston Tan, and Bryan Kian Hsiang Low. 2021. Fault-Tolerant Federated Reinforcement Learning with Theoretical Guarantee. In 35th Conference on Neural Information Processing Systems (NeurIPS).
  • Fan et al. (2023) Flint Xiaofeng Fan, Yining Ma, Zhongxiang Dai, Cheston Tan, Bryan Kian Hsiang Low, and Roger Wattenhofer. 2023. FedHQL: Federated Heterogeneous Q-Learning. arXiv:2301.11135.
  • Fujita et al. (2022) Koki Fujita, Shugo Fujimura, Yuwei Sun, Hiroshi Esaki, and Hideya Ochiai. 2022. Federated Reinforcement Learning for the Building Facilities. In 2022 IEEE International Conference on Omni-layer Intelligent Systems (COINS). 1–6. https://doi.org/10.1109/COINS54846.2022.9854959
  • Hao et al. (2022) ** Hao, Yang Peng, Wenhao Yang, Shusen Wang, and Zhihua Zhang. 2022. Federated reinforcement learning with environment heterogeneity. In International Conference on Artificial Intelligence and Statistics (AISTATS).
  • JCH and Dayan (1992) Watkins Christopher JCH and Peter Dayan. 1992. Q-learning. In Machine learning 8.
  • Jordan et al. (2024) Philip Jordan, Florian Grötschla, Flint Xiaofeng Fan, and Roger Wattenhofer. 2024. Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance and Provably Fast Convergence. arXiv preprint arXiv:2401.03489 (2024).
  • Khodadadian et al. (2022) Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. 2022. Federated Reinforcement Learning: Linear Speedup Under Markovian Sampling. In Proceedings of the 39th International Conference on Machine Learning (ICML).
  • Liang et al. (2023) Xinle Liang, Yang Liu, Tianjian Chen, Ming Liu, and Qiang Yang. 2023. Federated transfer reinforcement learning for autonomous driving. In Federated and Transfer Learning. Springer, 357–371.
  • Liu et al. (2019) Boyi Liu, Lujia Wang, and Ming Liu. 2019. Lifelong Federated Reinforcement Learning: A Learning Architecture for Navigation in Cloud Robotic Systems. IEEE Robotics and Automation Letters 4, 4 (2019), 4555–4562. https://doi.org/10.1109/LRA.2019.2931179
  • McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS).
  • Nadiger et al. (2019) Chetan Nadiger, Anil Kumar, and Sherine Abdelhak. 2019. Federated Reinforcement Learning for Fast Personalization. In 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE). 123–127. https://doi.org/10.1109/AIKE.2019.00031
  • Qi et al. (2021) Jiaju Qi, Qihao Zhou, Lei Lei, and Kan Zheng. 2021. Federated reinforcement learning: Techniques, applications, and open challenges. (2021).
  • Shen et al. (2023) Han Shen, Kaiqing Zhang, Mingyi Hong, and Tianyi Chen. 2023. Towards Understanding Asynchronous Advantage Actor-critic: Convergence and Linear Speedup. IEEE Transactions on Signal Processing (2023).
  • Sutton and Barto (2018) Richard Sutton and Andrew Barto. 2018. Reinforcement learning: An introduction. MIT press.
  • Wang et al. (2020) Xiaofei Wang, Chenyang Wang, Xiuhua Li, Victor CM Leung, and Tarik Taleb. 2020. Federated deep reinforcement learning for Internet of Things with decentralized cooperative edge caching. IEEE Internet of Things Journal 7, 10 (2020), 9441–9455.
  • Woo et al. (2023) Jiin Woo, Gauri Joshi, and Yuejie Chi. 2023. The Blessing of Heterogeneity in Federated Q-Learning: Linear Speedup and Beyond. In Proceedings of the 40th International Conference on Machine Learning.
  • Xue et al. (2021) Zeyue Xue, Pan Zhou, Zichuan Xu, ** Wen. 2021. A resource-constrained and privacy-preserving edge-computing-enabled clinical decision system: A federated reinforcement learning approach. IEEE Internet of Things Journal 8, 11 (2021), 9122–9138.
  • Yang et al. (2020) Qiang Yang, Yang Liu, Yong Cheng, Yan Kang, Tianjian Chen, and Han Yu. 2020. Federated Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (2020), 121–131.
  • Yu et al. (2020) Shuai Yu, Xu Chen, Zhi Zhou, Xiaowen Gong, and Di Wu. 2020. When Deep Reinforcement Learning Meets Federated Learning: Intelligent Multitimescale Resource Management for Multiaccess Edge Computing in 5G Ultradense Network. IEEE Internet of Things Journal 8, 4 (2020), 2238–2251.
  • Zhang et al. (2022) Mingyue Zhang, Zhi **, Jian Hou, and Renwei Luo. 2022. Resilient Mechanism Against Byzantine Failure for Distributed Deep Reinforcement Learning. In 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 378–389.
  • Zhuo et al. (2019) Hankz Hankui Zhuo, Wenfeng Feng, Yufeng Lin, Qian Xu, and Qiang Yang. 2019. Federated deep reinforcement learning. arXiv:1901.08277.