Towards Energy-Aware Federated Learning via MARL: A Dual-Selection Approach for Model and Client

Jun Xia University of Notre DameNotre Dame, INUSA [email protected]  and  Yiyu Shi University of Notre DameNotre Dame, INUSA [email protected]
Abstract.

Although Federated Learning (FL) is promising in knowledge sharing for heterogeneous Artificial Intelligence of Thing (AIoT) devices, their training performance and energy efficacy are severely restricted in practical battery-driven scenarios due to the “wooden barrel effect” caused by the mismatch between homogeneous model paradigms and heterogeneous device capability. As a result, due to various kinds of differences among devices, it is hard for existing FL methods to conduct training effectively in energy-constrained scenarios, such as battery constraints of devices. To tackle the above issues, we propose an energy-aware FL framework named DR-FL, which considers the energy constraints in both clients and heterogeneous deep learning models to enable energy-efficient FL. Unlike Vanilla FL, DR-FL adopts our proposed Muti-Agents Reinforcement Learning (MARL)-based dual-selection method, which allows participated devices to make contributions to the global model effectively and adaptively based on their computing capabilities and energy capacities in a MARL-based manner. Experiments on various well-known datasets show that DR-FL can not only maximize knowledge sharing among heterogeneous models under the energy constraint of large-scale AIoT systems but also improve the model performance of each involved heterogeneous device.

1. Introduction

The increasing popularity of Artificial Intelligence (AI) techniques, especially for Deep Learning (DL), accelerates the significant evolution of Internet of Things (IoT) toward Artificial Intelligence of Things (AIoT), where various AIoT devices are equipped with DL models to enable accurate perception and intelligent control (Bhardwaj et al., 2020). Although AIoT systems (e.g., autonomous driving, intelligent control (Samarjit and Faruque, 2016), and healthcare systems (Wu et al., 2022; Baghersalimi et al., 2021)) play an important role in various safety-critical domains, due to both the limited classification capabilities of local device models and the restricted access to private local data, it is hard to guarantee the training and inference performance of AIoT devices in Federated Learning (FL) (McMahan et al., 2016a), especially when they are powered by batteries and deployed within an uncertain dynamic environment (Cui et al., 2022). To quickly figure out the training procedure inference perception of devices, more and more large-scale AIoT systems have the aid of cloud computing (Zhang et al., 2021b), which has tremendous computing power and flexible device management schemes. However, such a cloud-based architecture still cannot fundamentally improve the inference accuracy of AIoT devices, since they are not allowed to transmit private local data to each other. Due to concerns about data privacy, both training and inference performance of local models are greatly suppressed.

As a promising collaborative machine learning paradigm, FL allows local DL model training among various devices without compromising their local data privacy. Instead of sharing local sensitive data among devices, FL only needs to send gradients or weights of local device models to a cloud server for knowledge aggregation, thus enhancing both the training and inference capability of local models. Although FL is promising in knowledge sharing, it faces the problems of both large-scale deployment and quick adaption to dynamic environments, where local models are required to be frequently trained to accommodate an ever-changing world. In practice, such problems are hard to be solved, since classic Federated Averaging (e.g., FedAvg) methods require that all devices should have homogeneous local models with the same architecture.

According to the well-known “wooden barrel effect” caused by homogeneous assumption as shown in Figure 1, the energy consumption waste in Vanilla FL is usually due to the following two reasons, i.e., the mismatch between computing power and homogeneous model, and the mismatch between power consumption and homogeneous model. The former uses device energy for waiting time, while the latter uses device energy for useless training time (only enough power to support training but not support communication). Thus, such a homogeneous model assumption strongly limits the overall energy efficiency of the entire FL system. This is because energy usages in the entire system are mainly determined by how much power is used in the effective model learning other than waiting, which consumes energy to wait other than training or communication.

Refer to caption
Figure 1. The energy consumption waste of the “wooden barrel effect” in Vanilla FL is usually due to the following two reasons, i.e., the mismatch between computing power and homogeneous model, and the mismatch between power consumption and homogeneous model. The former uses device energy for waiting time, while the latter uses device energy for useless training time (only enough power to support training but not support communication).

Typically, an AIoT system involves various types of devices with different settings (i.e., computing power and remaining power). If all devices have been equipped with homogeneous local models, the inference potential of devices with superior computing power will be eclipsed. Things become even worse when the devices of AIoT applications are powered by batteries. In this case, the devices with less battery energy will be reluctant to participate in frequent interactions with the cloud server. Otherwise, if one device runs out of power at an early stage of the FL training, it is hard for the global model to achieve an expected inference performance. Meanwhile, the overall inference performance of the global model will be strongly deteriorated due to the absence of such an exhausted devices in the following training process. Therefore, how to fully explore the potential of energy-constrained heterogeneous devices to enable high-performance and energy-efficient FL is becoming a major bottleneck in the design of an AIoT system.

Although various heterogeneous FL methods (e.g., HeteroFL (Diao et al., 2021), Scale-FL (Ilhan et al., 2023), PervasiveFL (Xia et al., 2022)) and energy-saving techniques (Li et al., 2020, 2019) have been investigated to address the above issue, most of them focus on either enabling effective knowledge sharing between heterogeneous models or reducing the energy consumption of devices. Based on the coarse-grained FedAvg operations, few of the existing FL methods can substantially address the above challenges to quickly adapt to new environments within an energy-constrained scenario. Inspired by the concepts of BranchyNet (Teerapittayanon et al., 2016) and multi-agent reinforcement learning (Zhang et al., 2021a), in this paper, we propose a novel FL framework named DR-FL, which takes both the layer-wise structure information of DL models and the remaining energy of each client into account to enable energy-efficient federated training. Unlike traditional FedAvg-based FL method that relies on homogeneous device models, DR-FL maintains a layer-wise global model on the cloud server, while each device only installs a subset layer-wise model according to its computing power and remaining battery. In this way, all the heterogeneous local models can effectively make contributions to the global model based on their computing capabilities and remaining energy in a MARL-based manner. Meanwhile, by adopting MARL, DR-FL can not only make the trade-off between training performance and energy consumption, thus ensuring energy-efficient FL training to accommodate various energy-constrained environments. This paper makes the following three major contributions:

  • We establish a novel, lightweight cloud-based FL framework named DR-FL, which can be easily implemented and enables various heterogeneous DNNs to share knowledge without compromising their data privacy in FL for heterogeneous devices by layer-wise model aggregation.

  • We propose a dual-selection approach based on MARL to control energy-efficient learning from the perspectives of both layer-wise models and participating clients, which can maximize the efficacy of the entire AIoT system.

  • Experimental results obtained from both simulation and real test-bed platforms show that, compared with various state-of-the-art approaches, DR-FL can not only achieve better inference performance within various non-IID scenarios, but also have superior scalability for large-scale AIoT systems.

The rest of this paper is organized as follows. Section 2 discusses related work on heterogeneous FL and energy-aware FL training. After giving the preliminaries of FL and multi-agent reinforcement learning in section 3, section 4 details our proposed DR-FL method. Section 5 presents experimental results on well-known benchmarks. Finally, section 6 concludes the paper.

2. Related Work

Although FL is good at knowledge sharing without compromising the data privacy of devices in AIoT system design, due to the homogeneous assumption that all the involved devices should have local DL models with the same architecture, Vanilla FL methods inevitably suffer from the problems of low inference accuracy and invalid energy consumption, thus impeding the deployment of FL methods in large-scale AIoT system designs (Khan et al., 2020; Nguyen et al., 2021; Xia et al., 2022; Zhu et al., 2021), especially for non-IID scenarios.

To enable collaborative learning among heterogeneous device models, various solutions have been extensively studied, which can be primarily classified into two categories, i.e., subnetwork aggregation-based methods and knowledge distillation-based methods. The basic idea of subnetwork aggregation-based methods is to allow knowledge aggregation on top of subnetworks of local device models, which enables knowledge sharing among heterogeneous device models. For instance, Diao et al. (Diao et al., 2021) presented an effective heterogeneous FL framework named HeteroFL, which can train heterogeneous local models with varying computation complexities but still produce a single global inference model, assuming that device models are subnetworks of the global model. By integrating FL and width-adjustable slimmable neural networks, Yun et al. (Yun et al., 2023) proposed a novel learning framework named ScaleFL, which jointly utilizes superposition coding for global model aggregation and superposition training for updating local models. In (Xia et al., 2022), Xia et al. developed a novel framework named PervasiveFL, which utilizes a small uniform model (i.e., “modellet”) to enable heterogeneous FL. Although all the above heterogeneous FL methods are promising, most focus on improving inference performance. Few of them take the issues of real-time training and energy efficiency into account.

Since a large-scale FL-based AIoT application typically involves a variety of devices that are powered by batteries, how to conduct energy-efficient FL training is becoming an important issue (Shi et al., 2021; Zhang and Tao, 2020). To address this issue, various methods have been investigated to reduce the energy consumed by FL training and device-server communication. For example, Hamdi et al. (Hamdi et al., 2022) studied the FL deployment problem in an energy-harvesting wireless network, where a certain number of users may be unable to participate in FL due to interference and energy constraints. They formalized such a deployment scenario as a joint energy management and user scheduling problems over wireless systems, and solved it efficiently. In (Sun et al., 2019), Sun et al. presented an online energy-aware dynamic worker scheduling policy, which can maximize the average number of workers scheduled for gradient update under a long-term energy constraint. In (Yang et al., 2019), Yang et al. formulated the energy-efficient transmission and computation resource allocation for FL over wireless communication networks as a joint learning and communication problem. To minimize system energy consumption under a latency constraint, they presented an iterative algorithm that can derive optimal solutions considering various factors (e.g., bandwidth allocation, power control, computation frequency, and learning accuracy). Although all the above energy-saving methods can effectively reduce energy consumption in both FL training and communication, few of them can guarantee the training time requirement of FL training within a complex dynamic environment.

To the best of our knowledge, DR-FL is the first attempt to investigate the dual selection by both layer-wise models and the participated clients based on MARL to enable fine-grained heterogeneous FL, where heterogeneous devices can adaptively and efficiently make contributions to the global model based on their computing capabilities and remaining energy. Compared with state-of-the-art heterogeneous FL methods, DR-FL can not only maximize the knowledge sharing among various heterogeneous models under energy constraints but also significantly improve both the model performance of each involved device and the energy efficacy of the entire FL system.

3. Preliminaries

3.1. Federated Learning

With the prosperity of distributed machine learning technologies (Verbraeken et al., 2019), privacy-aware FL is proposed to effectively solve the problem of data silos, where multiple AIoT devices can achieve knowledge sharing without leaking their data privacy. Since the physical environment is volatile (i.e., high latency network and unstable connection) in real AIoT scenarios, Vanilla FL randomly selects a number of AIoT devices for each communication round of training a homogeneous DNN model. Suppose there are N𝑁Nitalic_N devices selected at the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT communication round in FL. After the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT communication round, the update process of each device model is defined as follows

(1) 𝕎t+1n𝕎tnη𝕎tn,superscriptsubscript𝕎𝑡1𝑛superscriptsubscript𝕎𝑡𝑛𝜂superscriptsubscript𝕎𝑡𝑛\mathbb{W}_{t+1}^{n}\leftarrow\mathbb{W}_{t}^{n}-\eta\nabla\mathbb{W}_{t}^{n},blackboard_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ← blackboard_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - italic_η ∇ blackboard_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ,

where 𝕎tnsuperscriptsubscript𝕎𝑡𝑛\mathbb{W}_{t}^{n}blackboard_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝕎t+1nsuperscriptsubscript𝕎𝑡1𝑛\mathbb{W}_{t+1}^{n}blackboard_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represent the global models at round t𝑡titalic_t and round t+1𝑡1t+1italic_t + 1 in the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT device, respectively. η𝜂\etaitalic_η indicates the learning rate and 𝕎tnsuperscriptsubscript𝕎𝑡𝑛\nabla\mathbb{W}_{t}^{n}∇ blackboard_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the gradient obtained by the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT device model after the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT training round. To protect data privacy, at the end of each communication round, FL uploads the weight differences (i.e., model gradients) of each device instead of the newly updated models to the cloud for aggregation. After gathering the gradients from all the participating devices, the cloud updates the parameters of the shared-global model based on the Fedavg (McMahan et al., 2016b) algorithm, which is defined as follows:

(2) 𝕎t+1𝕎t+n=1N𝕃n𝕎tnN,subscript𝕎𝑡1subscript𝕎𝑡superscriptsubscript𝑛1𝑁subscript𝕃𝑛superscriptsubscript𝕎𝑡𝑛𝑁\mathbb{W}_{t+1}\leftarrow\mathbb{W}_{t}+\frac{\sum_{n=1}^{N}\mathbb{L}_{n}% \nabla\mathbb{W}_{t}^{n}}{N},blackboard_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← blackboard_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∇ blackboard_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG ,

where n=1N𝕎tnNsuperscriptsubscript𝑛1𝑁superscriptsubscript𝕎𝑡𝑛𝑁\frac{\sum_{n=1}^{N}\nabla\mathbb{W}_{t}^{n}}{N}divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ blackboard_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG denotes the average gradient of N𝑁Nitalic_N participating devices in communication round t𝑡titalic_t, 𝕎tsubscript𝕎𝑡\mathbb{W}_{t}blackboard_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝕎t+1subscript𝕎𝑡1\mathbb{W}_{t+1}blackboard_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT represent the global models after tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT and t+1th𝑡superscript1𝑡t+1^{th}italic_t + 1 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT communication round, respectively, and 𝕃nsubscript𝕃𝑛\mathbb{L}_{n}blackboard_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT means the training data size of device n𝑛nitalic_n. Although Vanilla FL methods (e.g., FedAvg) perform remarkably in distributed machine learning, they cannot be directly applied to AIoT scenarios. This is because the heterogeneous AIoT devices will lead to different training speeds for Vanilla FL, resulting in additional energy waste, which is unacceptable for an energy-constrained system.

3.2. Multi-Agent Reinforcement Learning

In cooperative Multi-Agent Reinforcement Learning (MARL), a set of N𝑁Nitalic_N agents is trained to produce optimal actions that lead to maximum team rewards. Specifically, at each timestamp t𝑡titalic_t, each agent n𝑛nitalic_n (where 1nN1𝑛𝑁1\leq n\leq N1 ≤ italic_n ≤ italic_N) observes its state stnsuperscriptsubscript𝑠𝑡𝑛s_{t}^{n}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and selects an action atnsuperscriptsubscript𝑎𝑡𝑛a_{t}^{n}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT based on stnsuperscriptsubscript𝑠𝑡𝑛s_{t}^{n}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. After all agents have completed their actions, the team receives a joint reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and transitions to the next state st+1nsuperscriptsubscript𝑠𝑡1𝑛s_{t+1}^{n}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The goal is to maximize the total expected discounted reward R=t=1Tγrt𝑅superscriptsubscript𝑡1𝑇𝛾subscript𝑟𝑡R=\sum_{t=1}^{T}\gamma r_{t}italic_R = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by selecting the optimal actions for each agent, where γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is the discount factor.

Recently, QMIX (Rashid et al., 2018) has emerged as a promising solution for jointly training agents in cooperative MARL. In QMIX, each agent n𝑛nitalic_n employs a Deep Neural Network (DNN) to infer its actions. This DNN implements the Q𝑄Qitalic_Q-function Qθ(s,a)=E[Rt|stn=s,atn=a]superscript𝑄𝜃𝑠𝑎𝐸delimited-[]formulae-sequenceconditionalsubscript𝑅𝑡superscriptsubscript𝑠𝑡𝑛𝑠superscriptsubscript𝑎𝑡𝑛𝑎Q^{\theta}(s,a)=E[R_{t}|s_{t}^{n}=s,a_{t}^{n}=a]italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_s , italic_a ) = italic_E [ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_a ], where θ𝜃\thetaitalic_θ represents the parameters of the DNN, and Rt=i=tTγrisubscript𝑅𝑡superscriptsubscript𝑖𝑡𝑇𝛾subscript𝑟𝑖R_{t}=\sum_{i=t}^{T}\gamma r_{i}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the total discounted team reward received at t𝑡titalic_t. During MARL execution, each agent n𝑛nitalic_n selects the action asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with the highest Q𝑄Qitalic_Q-value (i.e., a=argmaxaQθ(sn,a)superscript𝑎subscript𝑎superscript𝑄𝜃subscript𝑠𝑛𝑎a^{*}=\arg\max_{a}Q^{\theta}(s_{n},a)italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a )).

To train the QMIX, a replay buffer is employed to store transition tuples (stn,atn,st+1n,rt)superscriptsubscript𝑠𝑡𝑛superscriptsubscript𝑎𝑡𝑛superscriptsubscript𝑠𝑡1𝑛subscript𝑟𝑡(s_{t}^{n},a_{t}^{n},s_{t+1}^{n},r_{t})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for each agent n𝑛nitalic_n. The joint Q𝑄Qitalic_Q-function, Qtot()subscript𝑄totQ_{\text{tot}}(\cdot)italic_Q start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT ( ⋅ ), is represented as the element-wise summation of all individual Q𝑄Qitalic_Q-functions (i.e., Qtot(st,at)=nQnθ(stn,atn)subscript𝑄totsubscript𝑠𝑡subscript𝑎𝑡subscript𝑛subscriptsuperscript𝑄𝜃𝑛superscriptsubscript𝑠𝑡𝑛superscriptsubscript𝑎𝑡𝑛Q_{\text{tot}}(s_{t},a_{t})=\sum_{n}Q^{\theta}_{n}(s_{t}^{n},a_{t}^{n})italic_Q start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )), where st={stn}subscript𝑠𝑡superscriptsubscript𝑠𝑡𝑛s_{t}=\{s_{t}^{n}\}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } and at={atn}subscript𝑎𝑡superscriptsubscript𝑎𝑡𝑛a_{t}=\{a_{t}^{n}\}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } are the states and actions collected from all agents nN𝑛𝑁n\in Nitalic_n ∈ italic_N at timestamp t𝑡titalic_t. The agent DNNs can be recursively trained by minimizing the loss L=Est,at,rt,st+1[ytQtot(st,at)]2𝐿subscript𝐸subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1superscriptdelimited-[]subscript𝑦𝑡subscript𝑄totsubscript𝑠𝑡subscript𝑎𝑡2L=E_{s_{t},a_{t},r_{t},s_{t+1}}[y_{t}-Q_{\text{tot}}(s_{t},a_{t})]^{2}italic_L = italic_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where yt=rt+γnmaxaQnθ(st+1n,a)subscript𝑦𝑡subscript𝑟𝑡𝛾subscript𝑛subscript𝑎subscriptsuperscript𝑄superscript𝜃𝑛superscriptsubscript𝑠𝑡1𝑛𝑎y_{t}=r_{t}+\gamma\sum_{n}\max_{a}Q^{\theta^{\prime}}_{n}(s_{t+1}^{n},a)italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_a ) and θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represents the parameters of the target network, which are periodically copied from θ𝜃\thetaitalic_θ during the training phase.

4. Method

4.1. Problem Formulation

Assuming that an energy-constrained FL system contains a cloud server and N𝑁Nitalic_N heterogeneous AIoT devices, which can be represented as D={D1,,Dn,,DN}𝐷subscript𝐷1subscript𝐷𝑛subscript𝐷𝑁D=\left\{D_{1},...,D_{n},...,D_{N}\right\}italic_D = { italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. All these heterogeneous AIoT devices can be classified into three categories according to their computing capability, i.e., small, middle and large, where the small, middle and large mean the level of device computing resources and storage resources. In this paper, the performance of the entire FL system is significantly influenced by three key factors: running time, energy consumption, and model accuracy. Running time determines the training efficiency of the FL system in a real scenario. Moreover, energy consumption is also a significant factor, particularly for AIoT devices powered by limited energy resources. Lastly, model accuracy ensures that the system produces reliable and valuable predictions. Therefore, to optimize the overall performance of the FL system, it is crucial to make a balance between three factors.

Running Time Model: Considering the differences in network delay and computing resources of heterogeneous AIoT devices, the energy-constrained FL system aims to minimize the total running time Tallsubscript𝑇𝑎𝑙𝑙T_{all}italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT among all the devices, which is shown as

(3) Tall=maxnTallDn.subscript𝑇𝑎𝑙𝑙subscriptfor-all𝑛superscriptsubscript𝑇𝑎𝑙𝑙subscript𝐷𝑛T_{all}=\max_{\forall n}T_{all}^{D_{n}}.\vspace{-0.05in}italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT ∀ italic_n end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

Let TcomDnsuperscriptsubscript𝑇𝑐𝑜𝑚subscript𝐷𝑛T_{com}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and TtraDnsuperscriptsubscript𝑇𝑡𝑟𝑎subscript𝐷𝑛T_{tra}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the communication time of the device Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the training time of the layer-wise model on device Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, respectively. Note that due to the abundant computing resources in the cloud server, its running time is negligible compared to that on devices. The total running time for each device TallDnsuperscriptsubscript𝑇𝑎𝑙𝑙subscript𝐷𝑛T_{all}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is defined as

(4) TallDn=TcomDn+TtraDn.superscriptsubscript𝑇𝑎𝑙𝑙subscript𝐷𝑛superscriptsubscript𝑇𝑐𝑜𝑚subscript𝐷𝑛superscriptsubscript𝑇𝑡𝑟𝑎subscript𝐷𝑛T_{all}^{D_{n}}=T_{com}^{D_{n}}+T_{tra}^{D_{n}}.\vspace{-0.04in}italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_T start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_T start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

Here, the communication time for each device TcomDnsuperscriptsubscript𝑇𝑐𝑜𝑚subscript𝐷𝑛T_{com}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT can be regarded as the ratio of the size of a model SDnsubscript𝑆subscript𝐷𝑛S_{D_{n}}italic_S start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT with different layers and the speed of bandwidth Vnetsubscript𝑉𝑛𝑒𝑡V_{net}italic_V start_POSTSUBSCRIPT italic_n italic_e italic_t end_POSTSUBSCRIPT. Since the training time of each device TtraDnsuperscriptsubscript𝑇𝑡𝑟𝑎subscript𝐷𝑛T_{tra}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is determined by the computation capability of local devices CDnsubscript𝐶subscript𝐷𝑛C_{D_{n}}italic_C start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the training data size in a device LDnsubscript𝐿subscript𝐷𝑛L_{D_{n}}italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we formalize communication time TcomDnsuperscriptsubscript𝑇𝑐𝑜𝑚subscript𝐷𝑛T_{com}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and training time TtraDnsuperscriptsubscript𝑇𝑡𝑟𝑎subscript𝐷𝑛T_{tra}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as

(5) TcomDn=SDnVnet,TtraDn=LDnCDn,matrixformulae-sequencesuperscriptsubscript𝑇𝑐𝑜𝑚subscript𝐷𝑛subscript𝑆subscript𝐷𝑛subscript𝑉𝑛𝑒𝑡superscriptsubscript𝑇𝑡𝑟𝑎subscript𝐷𝑛subscript𝐿subscript𝐷𝑛subscript𝐶subscript𝐷𝑛\begin{matrix}T_{com}^{D_{n}}=\frac{S_{D_{n}}}{V_{net}},\qquad T_{tra}^{D_{n}}% =\frac{L_{D_{n}}}{C_{D_{n}}},\end{matrix}\vspace{-0.025in}start_ARG start_ROW start_CELL italic_T start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = divide start_ARG italic_S start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_V start_POSTSUBSCRIPT italic_n italic_e italic_t end_POSTSUBSCRIPT end_ARG , italic_T start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = divide start_ARG italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG , end_CELL end_ROW end_ARG

where ODnsubscript𝑂subscript𝐷𝑛O_{D_{n}}italic_O start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is reflected by the computation capability of the device CDnsubscript𝐶subscript𝐷𝑛C_{D_{n}}italic_C start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Assuming that the network transmission speed can be kept relatively stable.

Energy Consumption Model: The energy consumed by the overall FL system plays an important role in ensuring the system operates smoothly. The calculation of the total remaining energy can be expressed as

(6) Eall=n=1N(EremainDnEtraDnEcomDn).subscript𝐸𝑎𝑙𝑙superscriptsubscript𝑛1𝑁superscriptsubscript𝐸𝑟𝑒𝑚𝑎𝑖𝑛subscript𝐷𝑛superscriptsubscript𝐸𝑡𝑟𝑎subscript𝐷𝑛superscriptsubscript𝐸𝑐𝑜𝑚subscript𝐷𝑛E_{all}=\sum_{n=1}^{N}\left(E_{remain}^{D_{n}}-E_{tra}^{D_{n}}-E_{com}^{D_{n}}% \right).\vspace{-0.04in}italic_E start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_r italic_e italic_m italic_a italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_E start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_E start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) .

Note that both training and communication energy consumption are all decided by two factors, i.e., the size of the training model and the power mode of AIoT devices. The training energy consumption EtraDnsuperscriptsubscript𝐸𝑡𝑟𝑎subscript𝐷𝑛E_{tra}^{D_{n}}italic_E start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and communication energy consumption EcomDnsuperscriptsubscript𝐸𝑐𝑜𝑚subscript𝐷𝑛E_{com}^{D_{n}}italic_E start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of device Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are calculated as

(7) EtraDn=Ptrain×TtraDn,EcomDn=Pcom×TcomDn,matrixformulae-sequencesuperscriptsubscript𝐸𝑡𝑟𝑎subscript𝐷𝑛subscript𝑃𝑡𝑟𝑎𝑖𝑛superscriptsubscript𝑇𝑡𝑟𝑎subscript𝐷𝑛superscriptsubscript𝐸𝑐𝑜𝑚subscript𝐷𝑛subscript𝑃𝑐𝑜𝑚superscriptsubscript𝑇𝑐𝑜𝑚subscript𝐷𝑛\begin{matrix}E_{tra}^{D_{n}}=P_{train}\times T_{tra}^{D_{n}},\qquad E_{com}^{% D_{n}}=P_{com}\times T_{com}^{D_{n}},\end{matrix}\vspace{-0.03in}start_ARG start_ROW start_CELL italic_E start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_P start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_P start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , end_CELL end_ROW end_ARG

where Ptrainsubscript𝑃𝑡𝑟𝑎𝑖𝑛P_{train}italic_P start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT is the energy consumption per unit training time, and Pcomsubscript𝑃𝑐𝑜𝑚P_{com}italic_P start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT is the energy consumption per unit network transmission time. Note that since actual energy consumption is intrinsically related to the size of the trained model, variations in the size of the model lead to fluctuations in the energy consumed during both the training and communication processes. Therefore, it is of utmost importance to consider these energy dynamics when addressing the optimization model.

Model Accuracy: In a heterogeneous scenario, how to effectively leverage the heterogeneity in heterogeneous models and devices to enhance the performance of aggregated models is an urgent issue that needs to be solved in FL. Furthermore, resource-constrained heterogeneous AIoT devices that participate in aggregation pose a considerable obstacle to the application of energy-constrained FL. Inspired by the work (Li et al., 2019) where the performance of model inference is affected by the number of successful aggregations for its device, we can deduce that the accuracy of heterogeneous models is proportional to the total number of aggregated models participating in each round. However, since devices consume energy every time they participate in each round of aggregation, how to reasonably select aggregation devices in an energy-constrained environment to improve model accuracy has become a major challenge in the design of an FL framework.

Optimization Objective: Taking energy information into account, an optimization model is proposed for energy-constrained FL. This model aims to balance three objectives: minimizing total running time Tallsubscript𝑇𝑎𝑙𝑙T_{all}italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT, and maximizing model accuracy Maccsubscript𝑀𝑎𝑐𝑐M_{acc}italic_M start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT under total energy consumption Eallsubscript𝐸𝑎𝑙𝑙E_{all}italic_E start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT constraint, which is defined as follows

(8) minTall,maxMacc,s.t.EallE,matrixsubscript𝑇𝑎𝑙𝑙subscript𝑀𝑎𝑐𝑐s.t.subscript𝐸𝑎𝑙𝑙𝐸\begin{matrix}\min T_{all},\ \quad\max M_{acc},\\ \text{s.t.}\ E_{all}\leq E,\end{matrix}start_ARG start_ROW start_CELL roman_min italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT , roman_max italic_M start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL s.t. italic_E start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT ≤ italic_E , end_CELL end_ROW end_ARG

where E𝐸Eitalic_E is the energy budget of an FL system.

Refer to caption
Figure 2. Framework and workflow of our method.

4.2. Workflow of DR-FL

In DR-FL, heterogeneous AIoT devices and a cloud server cooperate to achieve high performance of various layer-wise models deployed on edge devices. Before training, all devices participating in DR-FL will initialize and install a layer-wise model, which is a subset layer of the global model in the cloud server. Then, the cloud server sends a part of the global model to AIoT devices for local training. At the end of local training, DR-FL performs layer-wise model aggregation on the cloud server. Note that hot-plug AIoT devices are permissible in DR-FL, where newly involved devices only inherit the parameters of the global model in the cloud server. Figure 2 shows the workflow of the DR-FL, which consists of five steps as follows.

Step 1 (Battery or Model Information Upload): During the initialization step of DR-FL, each device intending to participate in FL should upload its device information to the cloud, which includes the power, computing, and storage capabilities of devices and the overclocking potential of models. This collected information is used for energy-aware dual-selection for the layer-wise model and client in subsequent steps to optimize the entire system’s energy efficiency.

Step 2 (Layer-Wise Model Aggregation): After receiving the participating devices’ local model gradients, this step will layer-align averaging (The same parts of the network will be aggregated.) such gradients and use the previous round global model stored on the server to construct a new global model.

Step 3 (Energy-Aware MARL-based Dual-Selection): Then, to prevent selected devices from drop** out of the FL process due to energy limitations, we design a MARL-based selector that can choose an appropriate model for each AIoT device based on its remaining energy and computing capabilities, which can not only improve the efficiency of the device resource usage but also ensure their active participation in FL (see more details in Section 4.3). Furthermore, apart from selecting a layer-wised model for each AIoT device, the selector can also adjust the computing capability of AIoT devices, aiming to achieve a trade-off between energy consumption and computing efficiency.

Step 4 (Layer-Wise Model Dispatching): Based on an energy-aware MARL-based dual-selection strategy, the cloud server dispatches part of the global model parameters to each heterogeneous AIoT device.

Step 5 (Local Training): Based on the received global model parameters, each heterogeneous AIoT device builds an initial local model (i.e., layer-wise model), which is trained using cross-entropy loss based on local training samples to obtain the gradients of the local model for gradient upload.

DR-FL repeats all five steps above until the global model and all its local models converge.

4.3. Dual-Selection for Local Model and Client

Refer to caption
Figure 3. Maximum Q Value Guided Dual-selection. There are two networks here, i.e., the model selection network and the device evaluation network. The model selection network is calculated through the value O𝑂Oitalic_O observed by the agent from the environment and the action set At1subscript𝐴𝑡1A_{t-1}italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT of the previous round, thereby obtaining the latest action and its corresponding Q value. The device evaluation network obtains the Q values of all devices and then uses the hybrid network to combine all Q values and the current timestamp state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through a two-layer weight matrix into an overall Q value Qtotsubscript𝑄𝑡𝑜𝑡Q_{tot}italic_Q start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT. Then, the network uses the discounted rewards given by the environment for MARL, thereby multi-agents can obtain their own rewards from the environment. hhitalic_h means the MLP for extracting deep representations of states or actions. |||\cdot|| ⋅ | means the dot product.

4.3.1. MARL Training Process:

In our DR-FL, each device uses an energy-aware MARL-based dual-selection method to select the participated device and the layers of its corresponding local model running on devices. To better capture connections between long-term/short-term rewards and strategies, each MARL is designed with two Multi-Layer Perceptions (MLP) and a Gated Recurrent Unit (GRU) (Cho et al., 2014), respectively, as shown in Figure 3. During the training procedure of MARL, each agent acquires its current state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and selects an action atnsuperscriptsubscript𝑎𝑡𝑛a_{t}^{n}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for each client. Based on both client selection and layer-wise model considerations, the central server computes team rewards by considering the validation accuracy improvement of the global model Maccsubscript𝑀𝑎𝑐𝑐M_{acc}italic_M start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT, the total runtime Tallsubscript𝑇𝑎𝑙𝑙T_{all}italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT, the computation capabilities C𝐶Citalic_C and the remaining energy of each device Eallsubscript𝐸𝑎𝑙𝑙E_{all}italic_E start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT. The MARL agents are then trained with the QMIX algorithm (Rashid et al., 2018) to maximize the system rewards (See the design details in Section 4.3.4).

4.3.2. MARL Agent State Design:

The state of each MARL agent Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is comprised of three components: the remaining energy EallDnsuperscriptsubscript𝐸𝑎𝑙𝑙subscript𝐷𝑛E_{all}^{D_{n}}italic_E start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the computation capability of each communication round CDnsubscript𝐶subscript𝐷𝑛C_{D_{n}}italic_C start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and the size of the local training dataset LDnsubscript𝐿subscript𝐷𝑛L_{D_{n}}italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT. At each training round t𝑡titalic_t, each agent initially conducts the training procedure and transmits its gradients to the central server. Furthermore, to estimate the current training and communication delays at client device n𝑛nitalic_n, each MARL agent is equipped with a record of training latency TtraDnsuperscriptsubscript𝑇𝑡𝑟𝑎subscript𝐷𝑛T_{tra}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and communication latency TcomDnsuperscriptsubscript𝑇𝑐𝑜𝑚subscript𝐷𝑛T_{com}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where TtraDnsuperscriptsubscript𝑇𝑡𝑟𝑎subscript𝐷𝑛T_{tra}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and TcomDnsuperscriptsubscript𝑇𝑐𝑜𝑚subscript𝐷𝑛T_{com}^{D_{n}}italic_T start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the latency in local training and model uploading for agent n𝑛nitalic_n during the communication round t𝑡titalic_t. As shown in Figure 3, the parameter τ𝜏\tauitalic_τ represents the trajectory of historical data from training, and hhitalic_h represents the MLP layer for knowledge extraction. Moreover, each MARL agent n𝑛nitalic_n also calculates the energy consumption of training and communication based on Equation 7. This inclusion is crucial as the energy costs contribute to the overall energy cost, while the remaining energy of the agent influences both training latency and model accuracy. The state vector stnsuperscriptsubscript𝑠𝑡𝑛s_{t}^{n}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT of agent n𝑛nitalic_n in communication round t𝑡titalic_t is defined as:

(9) stn=[Ltn,CDn,EDn,t].superscriptsubscript𝑠𝑡𝑛superscriptsubscript𝐿𝑡𝑛subscript𝐶subscript𝐷𝑛subscript𝐸subscript𝐷𝑛𝑡s_{t}^{n}=[L_{t}^{n},C_{D_{n}},E_{D_{n}},t].italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = [ italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t ] .

Finally, to decrease storage overhead and accelerate the speed of agent convergence, all MLPs and GRUs within the MARL agents share their weights.

4.3.3. Agent Action Design:

Given the input state shown in Equation 9, each MARL agent n𝑛nitalic_n determines which layers of the local model should be used for the local training process on each device. Specifically, the MARL agent will generate Q𝑄Qitalic_Q values for the current action set [a0,,aM]superscript𝑎0superscript𝑎𝑀[a^{0},\ldots,a^{M}][ italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ], where M𝑀Mitalic_M represents the number of model selections available to the client. Note that when the selected action is zero, the client device will run the first model, and when the selected action is M𝑀Mitalic_M, the client will not participate in the FL. After selecting the layer-wise model for each heterogeneous device, all the Q values obtained by the agents will select the device with the highest Q value through the Top-K algorithm to participate in the FL process.

4.3.4. Reward Function Design:

To optimize the objective described in Equation 8, the reward function should reflect the changes in the model accuracy, processing latency (training, communication and waiting latency), and processing energy consumption after executing the dual-selection strategy generated by MARL agents. The reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at training round t𝑡titalic_t is defined as follows:

(10) rt=w1(MAcctMAcct1)w2(Eallt1Eallt)w3max1nNTallt,n.subscript𝑟𝑡subscript𝑤1superscriptsubscript𝑀𝐴𝑐𝑐𝑡superscriptsubscript𝑀𝐴𝑐𝑐𝑡1subscript𝑤2superscriptsubscript𝐸𝑎𝑙𝑙𝑡1superscriptsubscript𝐸𝑎𝑙𝑙𝑡subscript𝑤3subscript1𝑛𝑁superscriptsubscript𝑇𝑎𝑙𝑙𝑡𝑛r_{t}=w_{1}\cdot(M_{Acc}^{t}-M_{Acc}^{t-1})-w_{2}\cdot(E_{all}^{t-1}-E_{all}^{% t})-w_{3}\cdot\max_{1\leq n\leq N}T_{all}^{t,n}.italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ ( italic_M start_POSTSUBSCRIPT italic_A italic_c italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_M start_POSTSUBSCRIPT italic_A italic_c italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) - italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ( italic_E start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT - italic_E start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ roman_max start_POSTSUBSCRIPT 1 ≤ italic_n ≤ italic_N end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_n end_POSTSUPERSCRIPT .

Here, max1nNTallt,nsubscript1𝑛𝑁superscriptsubscript𝑇𝑎𝑙𝑙𝑡𝑛\max_{1\leq n\leq N}T_{all}^{t,n}roman_max start_POSTSUBSCRIPT 1 ≤ italic_n ≤ italic_N end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_n end_POSTSUPERSCRIPT represents the total time needed for local training of all selected devices. The MARL agents utilize the evaluation accuracy calculated by a small tiny dataset on the cloud server to select the layer-wise model that will be dispatched to the local device and continue the local training and upload their model updates. Moreover, w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 111We used w1=1000,w2=0.01,w3=1formulae-sequence𝑤11000formulae-sequence𝑤20.01𝑤31w1=1000,w2=0.01,w3=1italic_w 1 = 1000 , italic_w 2 = 0.01 , italic_w 3 = 1 in our experiments. are the norm ratios to control all the reward plays the same role in the entire reward. Ealltsuperscriptsubscript𝐸𝑎𝑙𝑙𝑡E_{all}^{t}italic_E start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the total remaining energy of tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT communication round as defined in Equation 6. The MARL agents are trained using QMIX as described in Figure 3.

Table 1. Test accuracy (%) comparison for different models and dataset settings under specific energy constraints with 40 clients.
Dataset CIFAR10
Methods HeteroFL (Diao et al., 2021) ScaleFL (Ilhan et al., 2023) DR-FL (Ours)
Distribution α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0 α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0 α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0
Model_1 30.46±1.10plus-or-minus30.461.1030.46\pm 1.1030.46 ± 1.10 46.11 ±plus-or-minus\pm± 3.32 65.23 ±plus-or-minus\pm± 1.45 29.25 ±plus-or-minus\pm± 1.17 54.44 ±plus-or-minus\pm± 0.87 58.15 ±plus-or-minus\pm± 4.32 58.69 ±plus-or-minus\pm± 0.73 59.01 ±plus-or-minus\pm± 0.85 76.46 ±plus-or-minus\pm± 0.12
Model_2 48.41 ±plus-or-minus\pm± 1.24 62.55 ±plus-or-minus\pm± 3.45 62.10 ±plus-or-minus\pm± 3.24 41.66 ±plus-or-minus\pm± 5.43 55.46 ±plus-or-minus\pm± 3.87 71.48 ±plus-or-minus\pm± 1.23 65.31 ±plus-or-minus\pm± 1.54 75.93 ±plus-or-minus\pm± 0.62 77.43 ±plus-or-minus\pm± 2.77
Model_3 34.85 ±plus-or-minus\pm± 5.79 65.01 ±plus-or-minus\pm± 1.79 74.78 ±plus-or-minus\pm± 2.76 39.92 ±plus-or-minus\pm± 2.75 60.07 ±plus-or-minus\pm± 0.68 70.83 ±plus-or-minus\pm± 1.43 72.71 ±plus-or-minus\pm± 0.58 70.64 ±plus-or-minus\pm± 1.40 71.54 ±plus-or-minus\pm± 1.54
Model_4 45.26 ±plus-or-minus\pm± 3.68 69.65 ±plus-or-minus\pm± 2.99 75.14 ±plus-or-minus\pm± 1.13 46.59 ±plus-or-minus\pm± 3.43 70.60 ±plus-or-minus\pm± 4.54 73.90 ±plus-or-minus\pm± 1.17 70.76 ±plus-or-minus\pm± 1.30 69.37 ±plus-or-minus\pm± 0.45 72.27 ±plus-or-minus\pm± 1.73
Dataset CIFAR100
Methods HeteroFL (Diao et al., 2021) ScaleFL (Ilhan et al., 2023) DR-FL (Ours)
Distribution α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0 α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0 α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0
Model_1 11.86 ±plus-or-minus\pm± 0.78 22.56 ±plus-or-minus\pm± 2.13 25.66 ±plus-or-minus\pm± 1.13 13.14 ±plus-or-minus\pm± 1.96 21.39 ±plus-or-minus\pm± 1.59 17.58 ±plus-or-minus\pm± 0.43 26.25 ±plus-or-minus\pm± 0.23 33.59 ±plus-or-minus\pm± 3.32 39.65 ±plus-or-minus\pm± 1.35
Model_2 16.33 ±plus-or-minus\pm± 3.34 25.98 ±plus-or-minus\pm± 1.72 28.68 ±plus-or-minus\pm± 0.57 12.67 ±plus-or-minus\pm± 2.13 28.77 ±plus-or-minus\pm± 4.33 29.84 ±plus-or-minus\pm± 1.39 17.83 ±plus-or-minus\pm±0.75 39.50 ±plus-or-minus\pm± 1.08 33.55 ±plus-or-minus\pm± 0.45
Model_3 14.18 ±plus-or-minus\pm± 0.29 31.99 ±plus-or-minus\pm± 0.53 31.31 ±plus-or-minus\pm± 3.34 17.12 ±plus-or-minus\pm± 2.88 30.04 ±plus-or-minus\pm± 1.91 33.92 ±plus-or-minus\pm± 2.34 26.46 ±plus-or-minus\pm± 0.24 32.10 ±plus-or-minus\pm± 1.12 33.40 ±plus-or-minus\pm± 0.13
Model_4 15.66 ±plus-or-minus\pm± 0.78 29.33 ±plus-or-minus\pm± 0.85 35.44 ±plus-or-minus\pm± 1.54 19.24 ±plus-or-minus\pm± 1.22 30.29 ±plus-or-minus\pm± 1.03 33.23 ±plus-or-minus\pm± 1.32 22.55 ±plus-or-minus\pm± 0.73 32.55 ±plus-or-minus\pm± 1.45 33.80 ±plus-or-minus\pm± 1.25
Dataset SVHN
Methods HeteroFL (Diao et al., 2021) ScaleFL (Ilhan et al., 2023) DR-FL (Ours)
Distribution α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0 α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0 α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0
Model_1 60.08 ±plus-or-minus\pm± 3.23 46.02 ±plus-or-minus\pm± 3.32 60.38 ±plus-or-minus\pm± 1.39 47.90 ±plus-or-minus\pm± 0.53 85.79 ±plus-or-minus\pm± 2.22 88.91 ±plus-or-minus\pm± 1.11 67.19 ±plus-or-minus\pm± 0.32 91.58 ±plus-or-minus\pm± 0.21 68.78 ±plus-or-minus\pm± 1.33
Model_2 65.11 ±plus-or-minus\pm± 4.32 54.83 ±plus-or-minus\pm± 1.28 68.90 ±plus-or-minus\pm± 2.87 50.26 ±plus-or-minus\pm± 2.21 86.82 ±plus-or-minus\pm± 2.51 85.16 ±plus-or-minus\pm± 4.13 79.86 ±plus-or-minus\pm± 0.87 85.30 ±plus-or-minus\pm± 1.19 91.72 ±plus-or-minus\pm± 0.94
Model_3 65.93 ±plus-or-minus\pm± 4.56 69.20 ±plus-or-minus\pm± 4.19 75.97 ±plus-or-minus\pm± 1.84 76.73 ±plus-or-minus\pm± 2.23 84.91 ±plus-or-minus\pm± 0.68 88.70 ±plus-or-minus\pm± 3.25 91.47 ±plus-or-minus\pm± 0.17 88.61 ±plus-or-minus\pm± 1.72 93.45 ±plus-or-minus\pm± 0.37
Model_4 66.31 ±plus-or-minus\pm± 3.09 71.34 ±plus-or-minus\pm± 0.79 76.14 ±plus-or-minus\pm± 1.90 55.27 ±plus-or-minus\pm± 3.23 86.10 ±plus-or-minus\pm± 3.56 92.47 ±plus-or-minus\pm± 0.51 91.11 ±plus-or-minus\pm± 1.32 89.26 ±plus-or-minus\pm± 0.75 92.78 ±plus-or-minus\pm± 0.54
Dataset Fashion-MNIST
Methods HeteroFL (Diao et al., 2021) ScaleFL (Ilhan et al., 2023) DR-FL (Ours)
Distribution α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0 α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0 α𝛼\alphaitalic_α=0.1 α𝛼\alphaitalic_α=0.5 α𝛼\alphaitalic_α=1.0
Model_1 45.06 ±plus-or-minus\pm± 2.01 85.58 ±plus-or-minus\pm± 1.31 87.00 ±plus-or-minus\pm± 1.93 53.78 ±plus-or-minus\pm± 0.98 74.26 ±plus-or-minus\pm± 2.34 87.29 ±plus-or-minus\pm± 0.93 80.15 ±plus-or-minus\pm± 0.23 82.25 ±plus-or-minus\pm± 0.19 87.10 ±plus-or-minus\pm± 0.37
Model_2 59.76 ±plus-or-minus\pm± 0.46 85.75 ±plus-or-minus\pm± 0.63 88.60 ±plus-or-minus\pm± 0.34 57.19 ±plus-or-minus\pm± 3.13 85.32 ±plus-or-minus\pm± 2.51 87.44 ±plus-or-minus\pm± 0.55 82.10 ±plus-or-minus\pm± 0.39 88.76 ±plus-or-minus\pm± 0.23 85.22 ±plus-or-minus\pm± 0.34
Model_3 57.25 ±plus-or-minus\pm± 0.98 83.26 ±plus-or-minus\pm± 3.27 87.75 ±plus-or-minus\pm± 1.25 62.26 ±plus-or-minus\pm± 1.34 87.69 ±plus-or-minus\pm± 1.07 88.47 ±plus-or-minus\pm± 0.97 86.88 ±plus-or-minus\pm± 0.23 89.34 ±plus-or-minus\pm± 0.62 90.52 ±plus-or-minus\pm± 0.13
Model_4 56.32 ±plus-or-minus\pm± 4.07 87.82 ±plus-or-minus\pm± 1.28 87.83 ±plus-or-minus\pm± 0.56 55.85 ±plus-or-minus\pm± 1.51 86.78 ±plus-or-minus\pm± 3.27 88.40 ±plus-or-minus\pm± 0.69 85.80 ±plus-or-minus\pm± 0.17 89.36 ±plus-or-minus\pm± 0.11 89.60 ±plus-or-minus\pm± 0.29

5. Experimental Results

To evaluate the performance of our proposed method, we implemented the DR-FL algorithm using PyTorch (version 1.4.0). Similar to FedAvg, we assume that only 10% of AIoT devices were involved in each round of FL communication during the training period. For DR-FL and other heterogeneous FL methods, we set the small batch size to 32. The number of local training epochs and the initial learning rate were 5 and 0.05, respectively. To simulate a variety of energy-constrained scenarios, we assume that each device is powered by a battery with a maximum capacity of 7,560 joules. In other words, each battery capacity is 1500 mA at a rated voltage of 5.04V. We conducted comprehensive experiments to answer the following four Research Questions (RQs).

RQ1: (Superiority of DR-FL ): What advantages can DR-FL achieve compared with state-of-the-art heterogeneous FL methods?

RQ2: (Benefits of MARL-based Dual-Selection): What benefits does MARL-based Dual-Selection provide during DR-FL learning, especially under constraints such as device energy and overall training time, compared with other SOTA heterogenous FL methods?

RQ3: (Scalability of DR-FL): How does the number of AIoT devices participating in knowledge sharing affect the performance of DR-FL?

RQ4: (Exploration of the Validation Data Ratio): How does the proportion of validation data in MARL affect the performance of DR-FL?

5.1. Experimental Settings

5.1.1. Model Settings

We compared our DR-FL method with two typical state-of-the-art heterogeneous FL methods, i.e., HeteroFL (Diao et al., 2021) and ScaleFL (Ilhan et al., 2023), which belong to subnetwork aggregation-based methods and knowledge distillation-based methods, respectively. We set the ResNet-18 model (He et al., 2015) as the backbone, where each block of the ResNet-18 model is followed by a new pair of the bottleneck and classifier, thus forming four new heterogeneous layer-wise models to simulate four types of heterogeneous models (i.e., Models 1-4 shown in Table 1). Note that each layer-wise model can be reused with the same backbone for the purpose of model inference.

5.1.2. Dataset Settings

To evaluate the effectiveness of DR-FL, we considered four training datasets: i.e., CIFAR10, CIFAR100 (Krizhevsky, 2009), Street View House Numbers (SVHN) (Netzer et al., 2011), Fashion-MNIST (Xiao et al., 2017). CIFAR10: The CIFAR10 dataset consists of 60,000 32×32323232\times 3232 × 32 colour images across ten classes, with 6,000 images per class. The dataset is split into 50,000 training images and 10,000 testing images. CIFAR100: The CIFAR100 dataset is similar to CIFAR10 but contains 100 classes instead of 10, with 600 images per class. The dataset also comprises 50,000 training images and 10,000 testing images. SVHN: The SVHN dataset is a real-world image dataset derived from house numbers in Google Street View images. It contains over 600,000 labelled digit images, where each image is a 32×32 colour image representing a single digit (0-9). Fashion-MNIST: The Fashion-MNIST dataset is a dataset of Zalando’s article images, consisting of 70,000 28×28282828\times 2828 × 28 grayscale images of 10 different fashion categories. In subsequent experiments, we investigated three non-Independent and Identically Distributed (non-IID) distributions for each dataset. Similar to the work of HeteroFL in (Diao et al., 2021), we constructed non-IID local training datasets using heterogeneous data splits following a Dirichlet distribution controlled by a variable α𝛼\alphaitalic_α. Typically, a smaller value of α𝛼\alphaitalic_α represents a higher degree of a corresponding non-IID distribution. Meanwhile, we used the same data augmentation technologies to fully utilize natural image datasets as the ones used in HeteroFL (Diao et al., 2021). To enable MARL training on a cloud server in DR-FL, we used 4% of the overall training data as the validation set on the server. Note that the validation set on the server does not overlap with local training datasets hosted by AIoT devices.

Refer to caption
(a) AIoT devices
Refer to caption
(b) The server
Figure 4. Real test-bed platform for our experiment.

5.1.3. Test-bed Settings

Besides simulation-based evaluation, we constructed a physical test-bed platform as shown in Figure 4 to check the performance of our DR-FL in a real-world environment. The test-bed consists of four parts: i) the cloud server that is built on top of an Ubuntu workstation equipped with an Intel i9 CPU, 32G memory, and a GTX3090 GPU; ii) the Jetson Nano boards, where each of them has a quad-core ARM A57 CPU, a 128-core NVIDIA Maxwell GPU, and 4GB LPDDR4 RAM; iii) the Jetson AGX Xavier boards, where each of them is equipped with an 8-core CPU and a 512-core Volta GPU; and iv) an HP 9800 power meter (see the top-left part in Figure 4(a)) produced by Shenzhen HOPI Electronic Technology Ltd. Note that, along with the federated training process, we used the power meter to record the energy consumption of all the AIoT devices every second for the MARL environment construction.

5.2. Accuracy Comparison (RQ1)

To evaluate the effectiveness of our proposed DR-FL, Table 1 presents the best test accuracy information for HeteroFL, ScaleFL and our DR-FL under the specific energy constraints along the FL processes based on the four datasets, assuming all the device batteries are initialized to be full. For each dataset and FL method combination, we considered three kinds of data distributions for all local AIoT devices, where the non-IID settings follow the Dirichlet distributions controlled by α𝛼\alphaitalic_α. Note that the baseline approaches (HeteroFL and ScaleFL) do not consider the energetic constraints in their FL procedure. To make a fair comparison, we added the greedy algorithm for energy awareness in this experiment (model selection will select the maximum model that can be trained for FL) into the two baseline algorithms for comparison. The experiments were repeated five times to calculate the mean and variance.

From Table 1, it is evident that within the constraint of the restricted battery energy conditions set for each device, DR-FL exhibits superior inference performance, surpassing results in 29 out of the 36 evaluated scenarios in comparison with other baseline algorithms. Specifically, no matter which data set, in the scenario of α=0.1𝛼0.1\alpha=0.1italic_α = 0.1, our method shows superior performance in comparison with other baseline algorithms. Moreover, the performance of some models at α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 in DR-FL has exceeded the performance of two baselines at α=0.5𝛼0.5\alpha=0.5italic_α = 0.5. As an example shown in the non-IID scenario of SVHN with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1, the test accuracy of DR-FL reaches 91.47%, while HeteroFL only attains 66.31% and ScaleFL only gets 76.73% on Model_3. This is because our MARL-based dual-selection method can efficiently utilize the available energy of devices by assigning specific layer-wise models to participating devices that are more suitable for heterogeneous federated learning.

5.3. Comparison of Energy Consumption (RQ2)

To validate the performance of our DR-FL method in terms of energy consumption and running time, we conducted an experiment involving a total of 40 devices (i.e., 20 Jetson Nano boards and 20 AGX Xavier boards). Figure 5 compares the total remaining energy variation and running time in the federated training processes using HeteroFL (ScaleFL is the same energy consumption and running time in the greedy algorithm ) and DR-FL, respectively. For each subfigure, we use the notion X_Y𝑋_𝑌X\_Yitalic_X _ italic_Y to represent the total result of all the devices of type Y𝑌Yitalic_Y using method X𝑋Xitalic_X. If Y𝑌Yitalic_Y is omitted, the notion X𝑋Xitalic_X denotes the total result involving all the devices. For example, in Figure 5(a), the legend DR-FL denotes the overall remaining energy of all 40 devices, while DR-FL_Nano represents the overall remaining energy of all the 20 Jetson Nano boards.

Refer to caption
(a) Total energy variation
Refer to caption
(b) Total running time
Figure 5. Comparison of total energy consumption and running time.

Figure 5(a) shows that our method can have more training rounds under the same energy constraints, thus leading to better overall test accuracy and energy efficacy. For example, for HeteroFL, the Jetson AGX Xavier-based devices ran out of batteries in the 12th round. However, for DR-FL, the Jetson AGX Xavier-based devices ran out of batteries in the 18th round. Moreover, in Figure 5(b), we can clearly find an inflexion point in the 12th round for HeteroFL, after which only Jetson Nano-based devices are involved in federated training. However, for DR-FL, we can observe an inflexion point in the 15th round, indicating the effectiveness of the MARL algorithm in controlling the energy waste of the device with reduced useless wait and training time.

5.4. Scalability Analysis (RQ3)

Figure  6 compares the test accuracy of three methods (i.e., HeteroFL, ScaleFL, and DR-FL) for various non-IID scenarios with different numbers of devices under specific energy constraints. From this figure, we can observe that when more heterogeneous devices participate in FL, the superiority of DR-FL becomes more significant than that of the other two methods. For example, for the non-IID scenario of CIFAR10 (with α𝛼\alphaitalic_α=0.1), DR-FL consistently achieves higher test accuracy than ScaleFL and HeteroFL.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 6. Learning curves of DR-FL and other baselines in AIoT systems with different numbers of devices under limited energy constraints.

5.5. Ablation Study (RQ4)

To explore the role of the validation set proportion in our method, the validation set with different proportions (1%-10%) is selected for the experiment of this paper, and the non-independent data set CIFAR10 (α=0.1𝛼0.1\alpha=0.1italic_α = 0.1) is selected as the exploration scenario. From Table 2, we can see, with the number of validation set increases, in the initial overall test accuracy rise, and with the proportion of validation sets more than 4%, the accuracy decreases. This phenomenon shows that it can be used as an effective tuning knob to explore the trade-off between the proportion of cloud validation data and the entire DR-FL performance. We found that the setup validation data ratio of 4% provided a reasonable balance. We picked 4% and used it in all experiments.

Table 2. Average model accuracy with different percentages of the validation dataset
Percentage 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
Accuracy (acc) 57.72 63.23 64.35 65.04 63.16 59.18 58.86 52.21 54.9975 55.69

6. Conclusion

Federated Learning (FL) is expected to enable privacy-preserving collaborative learning among Artificial Intelligence of Things (AIoT) devices. However, due to various heterogeneous settings (e.g., non-IID data, device models with different architectures) and device resource constraints (e.g., computing power and energy capacity), existing FL-based AIoT design greatly suffers from the problems of low inference accuracy, rapid battery consumption and long training time. To address these issues, this paper introduces a novel FL framework that enables efficient knowledge sharing between heterogeneous devices under specific energy constraints. Based on our proposed layer-wise aggregation method and MARL-based dual selection mechanism, AIoT devices with different computational and energy capabilities can adaptively select appropriate local models to participate in global model training, where devices can effectively learn from each other through appropriate parts belonging to different layer-wise models. Comprehensive experiments performed on well-known datasets demonstrate the effectiveness of DR-FL for inference performance, energy consumption, and scalability.

References

  • (1)
  • Baghersalimi et al. (2021) Saleh Baghersalimi, Tomás Teijeiro, David Atienza Alonso, and Amir Aminifar. 2021. Personalized Real-Time Federated Learning for Epileptic Seizure Detection. IEEE Journal of Biomedical and Health Informatics 26 (2021), 898–909. https://api.semanticscholar.org/CorpusID:235786959
  • Bhardwaj et al. (2020) Kartikeya Bhardwaj, Wei Chen, and Radu Marculescu. 2020. INVITED: New Directions in Distributed Deep Learning: Bringing the Network at Forefront of IoT Design. Proceedings of 57th ACM/IEEE Design Automation Conference (DAC) (2020), 1–6. https://api.semanticscholar.org/CorpusID:221293302
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of Conference on Empirical Methods in Natural Language Processing.
  • Cui et al. (2022) Yangguang Cui, Kun Cao, Junlong Zhou, and Tongquan Wei. 2022. HELCFL: High-Efficiency and Low-Cost Federated Learning in Heterogeneous Mobile-Edge Computing. 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE) (2022), 1227–1232. https://api.semanticscholar.org/CorpusID:248922002
  • Diao et al. (2021) Enmao Diao, Jie Ding, and Vahid Tarokh. 2021. HeteroFL: Computation and communication efficient federated learning for heterogeneous clients. In Proceedings of International Conference on Learning Representations (ICLR).
  • Hamdi et al. (2022) Rami Hamdi, Mingzhe Chen, Ahmed Ben Said, Marwa Qaraqe, and H. Vincent Poor. 2022. Federated Learning Over Energy Harvesting Wireless Networks. IEEE Internet of Things Journal 9, 1 (2022), 92–103.
  • He et al. (2015) Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778 pages.
  • Ilhan et al. (2023) Fatih Ilhan, Gong Su, and Ling Liu. 2023. ScaleFL: Resource-Adaptive Federated Learning with Heterogeneous Clients. In Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Khan et al. (2020) Latif Ullah Khan, Walid Saad, Zhu Han, Ekram Hossain, and Choong Seon Hong. 2020. Federated Learning for Internet of Things: Recent Advances, Taxonomy, and Open Challenges. IEEE Communications Surveys & Tutorials 23 (2020), 1759–1799. https://api.semanticscholar.org/CorpusID:221970627
  • Krizhevsky (2009) Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. https://api.semanticscholar.org/CorpusID:18268744
  • Li et al. (2020) Liang Li, Dian Shi, Ronghui Hou, Hui Li, Miao Pan, and Zhu Han. 2020. To Talk or to Work: Flexible Communication Compression for Energy Efficient Federated Learning over Heterogeneous Mobile Edge Devices. IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, 1–10. https://api.semanticscholar.org/CorpusID:229349304
  • Li et al. (2019) Li Li, Haoyi Xiong, Zhishan Guo, Jun Wang, and Chengzhong Xu. 2019. SmartPC: Hierarchical Pace Control in Real-Time Federated Learning System. 2019 IEEE Real-Time Systems Symposium (RTSS) (2019), 406–418. https://api.semanticscholar.org/CorpusID:203582658
  • McMahan et al. (2016a) H. B. McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2016a. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of International Conference on Artificial Intelligence and Statistics. https://api.semanticscholar.org/CorpusID:14955348
  • McMahan et al. (2016b) H. B. McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2016b. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of International Conference on Artificial Intelligence and Statistics.
  • Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, A. Bissacco, Bo Wu, and A. Ng. 2011. Reading Digits in Natural Images with Unsupervised Feature Learning. https://api.semanticscholar.org/CorpusID:16852518
  • Nguyen et al. (2021) Dinh C. Nguyen, Ming Ding, Pubudu N. Pathirana, Aruna Prasad Seneviratne, Jun Li, and Fellow Ieee H. Vincent Poor. 2021. Federated Learning for Internet of Things: A Comprehensive Survey. IEEE Communications Surveys & Tutorials 23 (2021), 1622–1658. https://api.semanticscholar.org/CorpusID:233289549
  • Rashid et al. (2018) Tabish Rashid, Mikayel Samvelyan, C. S. D. Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. 2018. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. ArXiv abs/1803.11485 (2018). https://api.semanticscholar.org/CorpusID:4533648
  • Samarjit and Faruque (2016) Samarjit and Al Faruque. 2016. Automotive Cyber-Physical Systems: A Tutorial Introduction. https://api.semanticscholar.org/CorpusID:247235211
  • Shi et al. (2021) Dian Shi, Liang Li, Rui Chen, Pavana Prakash, Miao Pan, and Yuguang Fan. 2021. Toward Energy-Efficient Federated Learning Over 5G+ Mobile Devices. IEEE Wireless Communications 29 (2021), 44–51. https://api.semanticscholar.org/CorpusID:231592874
  • Sun et al. (2019) Yuxuan Sun, Sheng Zhou, and Deniz Gündüz. 2019. Energy-Aware Analog Aggregation for Federated Learning with Redundant Data. In ICC 2020 - 2020 IEEE International Conference on Communications (ICC). 1–7. https://api.semanticscholar.org/CorpusID:207869996
  • Teerapittayanon et al. (2016) Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. 2016. BranchyNet: Fast inference via early exiting from deep neural networks. In Proceedings of 23rd International Conference on Pattern Recognition (ICPR). 2464–2469 pages.
  • Verbraeken et al. (2019) Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2019. A Survey on Distributed Machine Learning. ACM Computing Surveys (CSUR) 53 (2019), 1–33. https://api.semanticscholar.org/CorpusID:209439571
  • Wu et al. (2022) Yawen Wu, Dewen Zeng, Zhepeng Wang, Yi Sheng, Lei Yang, Alaina J. James, Yiyu Shi, and **gtong Hu. 2022. Federated Contrastive Learning for Dermatological Disease Diagnosis via On-device Learning. ArXiv abs/2202.07470 (2022). https://api.semanticscholar.org/CorpusID:245446614
  • Xia et al. (2022) Jun Xia, Tian Liu, Zhiwei Ling, Ting Wang, Xin Fu, and Mingsong Chen. 2022. PervasiveFL: Pervasive Federated Learning forHeterogeneous IoT Systems. IEEE Transactions on Computer Aided Design of Integrated Circuits Systems 41, 11 (2022), 4100–4111.
  • Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. ArXiv:1708.07747 (2017).
  • Yang et al. (2019) Zhaohui Yang, Mingzhe Chen, Walid Saad, Choong Seon Hong, and Mohammad R. Shikh-Bahaei. 2019. Energy Efficient Federated Learning Over Wireless Communication Networks. IEEE Transactions on Wireless Communications 20 (2019), 1935–1949. https://api.semanticscholar.org/CorpusID:207880723
  • Yun et al. (2023) Won Joon Yun, Yunseok Kwak, Hankyul Baek, Soyi Jung, Mingyue Ji, Mehdi Bennis, Jihong Park, and Joongheon Kim. 2023. SlimFL: Federated Learning With Superposition Coding Over Slimmable Neural Networks. IEEE/ACM Transactions on Networking (TON) 31, 6 (2023), 2499–2514.
  • Zhang and Tao (2020) **g Zhang and Dacheng Tao. 2020. Empowering Things With Intelligence: A Survey of the Progress, Challenges, and Opportunities in Artificial Intelligence of Things. IEEE Internet of Things Journal 8 (2020), 7789–7817. https://api.semanticscholar.org/CorpusID:226975900
  • Zhang et al. (2021a) Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. 2021a. Self-Distillation: Towards Efficient and Compact Neural Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8 (2021), 4388–4403. https://api.semanticscholar.org/CorpusID:232302458
  • Zhang et al. (2021b) Xinqian Zhang, Ming Hu, Jun Xia, Tongquan Wei, Mingsong Chen, and Shiyan Hu. 2021b. Efficient Federated Learning for Cloud-Based AIoT Applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 40, 11 (2021), 221–2223. https://doi.org/10.1109/TCAD.2020.3046665
  • Zhu et al. (2021) Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. 2021. Data-Free Knowledge Distillation for Heterogeneous Federated Learning. Proceedings of machine learning research 139 (2021), 12878–12889. https://api.semanticscholar.org/CorpusID:235125689