Towards Energy-Aware Federated Learning via MARL: A Dual-Selection Approach for Model and Client

Jun Xia University of Notre DameNotre Dame, INUSA [email protected] and Yiyu Shi University of Notre DameNotre Dame, INUSA [email protected]

Abstract.

Although Federated Learning (FL) is promising in knowledge sharing for heterogeneous Artificial Intelligence of Thing (AIoT) devices, their training performance and energy efficacy are severely restricted in practical battery-driven scenarios due to the “wooden barrel effect” caused by the mismatch between homogeneous model paradigms and heterogeneous device capability. As a result, due to various kinds of differences among devices, it is hard for existing FL methods to conduct training effectively in energy-constrained scenarios, such as battery constraints of devices. To tackle the above issues, we propose an energy-aware FL framework named DR-FL, which considers the energy constraints in both clients and heterogeneous deep learning models to enable energy-efficient FL. Unlike Vanilla FL, DR-FL adopts our proposed Muti-Agents Reinforcement Learning (MARL)-based dual-selection method, which allows participated devices to make contributions to the global model effectively and adaptively based on their computing capabilities and energy capacities in a MARL-based manner. Experiments on various well-known datasets show that DR-FL can not only maximize knowledge sharing among heterogeneous models under the energy constraint of large-scale AIoT systems but also improve the model performance of each involved heterogeneous device.

1. Introduction

The increasing popularity of Artificial Intelligence (AI) techniques, especially for Deep Learning (DL), accelerates the significant evolution of Internet of Things (IoT) toward Artificial Intelligence of Things (AIoT), where various AIoT devices are equipped with DL models to enable accurate perception and intelligent control (Bhardwaj et al., 2020). Although AIoT systems (e.g., autonomous driving, intelligent control (Samarjit and Faruque, 2016), and healthcare systems (Wu et al., 2022; Baghersalimi et al., 2021)) play an important role in various safety-critical domains, due to both the limited classification capabilities of local device models and the restricted access to private local data, it is hard to guarantee the training and inference performance of AIoT devices in Federated Learning (FL) (McMahan et al., 2016a), especially when they are powered by batteries and deployed within an uncertain dynamic environment (Cui et al., 2022). To quickly figure out the training procedure inference perception of devices, more and more large-scale AIoT systems have the aid of cloud computing (Zhang et al., 2021b), which has tremendous computing power and flexible device management schemes. However, such a cloud-based architecture still cannot fundamentally improve the inference accuracy of AIoT devices, since they are not allowed to transmit private local data to each other. Due to concerns about data privacy, both training and inference performance of local models are greatly suppressed.

As a promising collaborative machine learning paradigm, FL allows local DL model training among various devices without compromising their local data privacy. Instead of sharing local sensitive data among devices, FL only needs to send gradients or weights of local device models to a cloud server for knowledge aggregation, thus enhancing both the training and inference capability of local models. Although FL is promising in knowledge sharing, it faces the problems of both large-scale deployment and quick adaption to dynamic environments, where local models are required to be frequently trained to accommodate an ever-changing world. In practice, such problems are hard to be solved, since classic Federated Averaging (e.g., FedAvg) methods require that all devices should have homogeneous local models with the same architecture.

According to the well-known “wooden barrel effect” caused by homogeneous assumption as shown in Figure 1, the energy consumption waste in Vanilla FL is usually due to the following two reasons, i.e., the mismatch between computing power and homogeneous model, and the mismatch between power consumption and homogeneous model. The former uses device energy for waiting time, while the latter uses device energy for useless training time (only enough power to support training but not support communication). Thus, such a homogeneous model assumption strongly limits the overall energy efficiency of the entire FL system. This is because energy usages in the entire system are mainly determined by how much power is used in the effective model learning other than waiting, which consumes energy to wait other than training or communication.

Refer to caption — Figure 1. The energy consumption waste of the “wooden barrel effect” in Vanilla FL is usually due to the following two reasons, i.e., the mismatch between computing power and homogeneous model, and the mismatch between power consumption and homogeneous model. The former uses device energy for waiting time, while the latter uses device energy for useless training time (only enough power to support training but not support communication).

Typically, an AIoT system involves various types of devices with different settings (i.e., computing power and remaining power). If all devices have been equipped with homogeneous local models, the inference potential of devices with superior computing power will be eclipsed. Things become even worse when the devices of AIoT applications are powered by batteries. In this case, the devices with less battery energy will be reluctant to participate in frequent interactions with the cloud server. Otherwise, if one device runs out of power at an early stage of the FL training, it is hard for the global model to achieve an expected inference performance. Meanwhile, the overall inference performance of the global model will be strongly deteriorated due to the absence of such an exhausted devices in the following training process. Therefore, how to fully explore the potential of energy-constrained heterogeneous devices to enable high-performance and energy-efficient FL is becoming a major bottleneck in the design of an AIoT system.

Although various heterogeneous FL methods (e.g., HeteroFL (Diao et al., 2021), Scale-FL (Ilhan et al., 2023), PervasiveFL (Xia et al., 2022)) and energy-saving techniques (Li et al., 2020, 2019) have been investigated to address the above issue, most of them focus on either enabling effective knowledge sharing between heterogeneous models or reducing the energy consumption of devices. Based on the coarse-grained FedAvg operations, few of the existing FL methods can substantially address the above challenges to quickly adapt to new environments within an energy-constrained scenario. Inspired by the concepts of BranchyNet (Teerapittayanon et al., 2016) and multi-agent reinforcement learning (Zhang et al., 2021a), in this paper, we propose a novel FL framework named DR-FL, which takes both the layer-wise structure information of DL models and the remaining energy of each client into account to enable energy-efficient federated training. Unlike traditional FedAvg-based FL method that relies on homogeneous device models, DR-FL maintains a layer-wise global model on the cloud server, while each device only installs a subset layer-wise model according to its computing power and remaining battery. In this way, all the heterogeneous local models can effectively make contributions to the global model based on their computing capabilities and remaining energy in a MARL-based manner. Meanwhile, by adopting MARL, DR-FL can not only make the trade-off between training performance and energy consumption, thus ensuring energy-efficient FL training to accommodate various energy-constrained environments. This paper makes the following three major contributions:

•

We establish a novel, lightweight cloud-based FL framework named DR-FL, which can be easily implemented and enables various heterogeneous DNNs to share knowledge without compromising their data privacy in FL for heterogeneous devices by layer-wise model aggregation.
•

We propose a dual-selection approach based on MARL to control energy-efficient learning from the perspectives of both layer-wise models and participating clients, which can maximize the efficacy of the entire AIoT system.
•

Experimental results obtained from both simulation and real test-bed platforms show that, compared with various state-of-the-art approaches, DR-FL can not only achieve better inference performance within various non-IID scenarios, but also have superior scalability for large-scale AIoT systems.

The rest of this paper is organized as follows. Section 2 discusses related work on heterogeneous FL and energy-aware FL training. After giving the preliminaries of FL and multi-agent reinforcement learning in section 3, section 4 details our proposed DR-FL method. Section 5 presents experimental results on well-known benchmarks. Finally, section 6 concludes the paper.

2. Related Work

Although FL is good at knowledge sharing without compromising the data privacy of devices in AIoT system design, due to the homogeneous assumption that all the involved devices should have local DL models with the same architecture, Vanilla FL methods inevitably suffer from the problems of low inference accuracy and invalid energy consumption, thus impeding the deployment of FL methods in large-scale AIoT system designs (Khan et al., 2020; Nguyen et al., 2021; Xia et al., 2022; Zhu et al., 2021), especially for non-IID scenarios.

To enable collaborative learning among heterogeneous device models, various solutions have been extensively studied, which can be primarily classified into two categories, i.e., subnetwork aggregation-based methods and knowledge distillation-based methods. The basic idea of subnetwork aggregation-based methods is to allow knowledge aggregation on top of subnetworks of local device models, which enables knowledge sharing among heterogeneous device models. For instance, Diao et al. (Diao et al., 2021) presented an effective heterogeneous FL framework named HeteroFL, which can train heterogeneous local models with varying computation complexities but still produce a single global inference model, assuming that device models are subnetworks of the global model. By integrating FL and width-adjustable slimmable neural networks, Yun et al. (Yun et al., 2023) proposed a novel learning framework named ScaleFL, which jointly utilizes superposition coding for global model aggregation and superposition training for updating local models. In (Xia et al., 2022), Xia et al. developed a novel framework named PervasiveFL, which utilizes a small uniform model (i.e., “modellet”) to enable heterogeneous FL. Although all the above heterogeneous FL methods are promising, most focus on improving inference performance. Few of them take the issues of real-time training and energy efficiency into account.

Since a large-scale FL-based AIoT application typically involves a variety of devices that are powered by batteries, how to conduct energy-efficient FL training is becoming an important issue (Shi et al., 2021; Zhang and Tao, 2020). To address this issue, various methods have been investigated to reduce the energy consumed by FL training and device-server communication. For example, Hamdi et al. (Hamdi et al., 2022) studied the FL deployment problem in an energy-harvesting wireless network, where a certain number of users may be unable to participate in FL due to interference and energy constraints. They formalized such a deployment scenario as a joint energy management and user scheduling problems over wireless systems, and solved it efficiently. In (Sun et al., 2019), Sun et al. presented an online energy-aware dynamic worker scheduling policy, which can maximize the average number of workers scheduled for gradient update under a long-term energy constraint. In (Yang et al., 2019), Yang et al. formulated the energy-efficient transmission and computation resource allocation for FL over wireless communication networks as a joint learning and communication problem. To minimize system energy consumption under a latency constraint, they presented an iterative algorithm that can derive optimal solutions considering various factors (e.g., bandwidth allocation, power control, computation frequency, and learning accuracy). Although all the above energy-saving methods can effectively reduce energy consumption in both FL training and communication, few of them can guarantee the training time requirement of FL training within a complex dynamic environment.

To the best of our knowledge, DR-FL is the first attempt to investigate the dual selection by both layer-wise models and the participated clients based on MARL to enable fine-grained heterogeneous FL, where heterogeneous devices can adaptively and efficiently make contributions to the global model based on their computing capabilities and remaining energy. Compared with state-of-the-art heterogeneous FL methods, DR-FL can not only maximize the knowledge sharing among various heterogeneous models under energy constraints but also significantly improve both the model performance of each involved device and the energy efficacy of the entire FL system.

3. Preliminaries

3.1. Federated Learning

With the prosperity of distributed machine learning technologies (Verbraeken et al., 2019), privacy-aware FL is proposed to effectively solve the problem of data silos, where multiple AIoT devices can achieve knowledge sharing without leaking their data privacy. Since the physical environment is volatile (i.e., high latency network and unstable connection) in real AIoT scenarios, Vanilla FL randomly selects a number of AIoT devices for each communication round of training a homogeneous DNN model. Suppose there are $N$ devices selected at the $t^{th}$ communication round in FL. After the $t^{th}$ communication round, the update process of each device model is defined as follows

(1)

\mathbb{W}_{t+1}^{n}\leftarrow\mathbb{W}_{t}^{n}-\eta\nabla\mathbb{W}_{t}^{n},

where $\mathbb{W}_{t}^{n}$ and $\mathbb{W}_{t+1}^{n}$ represent the global models at round $t$ and round $t+1$ in the $n^{th}$ device, respectively. $\eta$ indicates the learning rate and $\nabla\mathbb{W}_{t}^{n}$ is the gradient obtained by the $n^{th}$ device model after the $t^{th}$ training round. To protect data privacy, at the end of each communication round, FL uploads the weight differences (i.e., model gradients) of each device instead of the newly updated models to the cloud for aggregation. After gathering the gradients from all the participating devices, the cloud updates the parameters of the shared-global model based on the Fedavg (McMahan et al., 2016b) algorithm, which is defined as follows:

(2)

\mathbb{W}_{t+1}\leftarrow\mathbb{W}_{t}+\frac{\sum_{n=1}^{N}\mathbb{L}_{n}% \nabla\mathbb{W}_{t}^{n}}{N},

where $\frac{\sum_{n=1}^{N}\nabla\mathbb{W}_{t}^{n}}{N}$ denotes the average gradient of $N$ participating devices in communication round $t$ , $\mathbb{W}_{t}$ and $\mathbb{W}_{t+1}$ represent the global models after $t^{th}$ and $t+1^{th}$ communication round, respectively, and $\mathbb{L}_{n}$ means the training data size of device $n$ . Although Vanilla FL methods (e.g., FedAvg) perform remarkably in distributed machine learning, they cannot be directly applied to AIoT scenarios. This is because the heterogeneous AIoT devices will lead to different training speeds for Vanilla FL, resulting in additional energy waste, which is unacceptable for an energy-constrained system.

3.2. Multi-Agent Reinforcement Learning

In cooperative Multi-Agent Reinforcement Learning (MARL), a set of $N$ agents is trained to produce optimal actions that lead to maximum team rewards. Specifically, at each timestamp $t$ , each agent $n$ (where $1\leq n\leq N$ ) observes its state $s_{t}^{n}$ and selects an action $a_{t}^{n}$ based on $s_{t}^{n}$ . After all agents have completed their actions, the team receives a joint reward $r_{t}$ and transitions to the next state $s_{t+1}^{n}$ . The goal is to maximize the total expected discounted reward $R=\sum_{t=1}^{T}\gamma r_{t}$ by selecting the optimal actions for each agent, where $\gamma\in[0,1]$ is the discount factor.

Recently, QMIX (Rashid et al., 2018) has emerged as a promising solution for jointly training agents in cooperative MARL. In QMIX, each agent $n$ employs a Deep Neural Network (DNN) to infer its actions. This DNN implements the $Q$ -function $Q^{\theta}(s,a)=E[R_{t}|s_{t}^{n}=s,a_{t}^{n}=a]$ , where $\theta$ represents the parameters of the DNN, and $R_{t}=\sum_{i=t}^{T}\gamma r_{i}$ is the total discounted team reward received at $t$ . During MARL execution, each agent $n$ selects the action $a^{*}$ with the highest $Q$ -value (i.e., $a^{*}=\arg\max_{a}Q^{\theta}(s_{n},a)$ ).

To train the QMIX, a replay buffer is employed to store transition tuples $(s_{t}^{n},a_{t}^{n},s_{t+1}^{n},r_{t})$ for each agent $n$ . The joint $Q$ -function, $Q_{\text{tot}}(\cdot)$ , is represented as the element-wise summation of all individual $Q$ -functions (i.e., $Q_{\text{tot}}(s_{t},a_{t})=\sum_{n}Q^{\theta}_{n}(s_{t}^{n},a_{t}^{n})$ ), where $s_{t}=\{s_{t}^{n}\}$ and $a_{t}=\{a_{t}^{n}\}$ are the states and actions collected from all agents $n\in N$ at timestamp $t$ . The agent DNNs can be recursively trained by minimizing the loss $L=E_{s_{t},a_{t},r_{t},s_{t+1}}[y_{t}-Q_{\text{tot}}(s_{t},a_{t})]^{2}$ , where $y_{t}=r_{t}+\gamma\sum_{n}\max_{a}Q^{\theta^{\prime}}_{n}(s_{t+1}^{n},a)$ and $\theta^{\prime}$ represents the parameters of the target network, which are periodically copied from $\theta$ during the training phase.

4. Method

4.1. Problem Formulation

Assuming that an energy-constrained FL system contains a cloud server and $N$ heterogeneous AIoT devices, which can be represented as $D=\left\{D_{1},...,D_{n},...,D_{N}\right\}$ . All these heterogeneous AIoT devices can be classified into three categories according to their computing capability, i.e., small, middle and large, where the small, middle and large mean the level of device computing resources and storage resources. In this paper, the performance of the entire FL system is significantly influenced by three key factors: running time, energy consumption, and model accuracy. Running time determines the training efficiency of the FL system in a real scenario. Moreover, energy consumption is also a significant factor, particularly for AIoT devices powered by limited energy resources. Lastly, model accuracy ensures that the system produces reliable and valuable predictions. Therefore, to optimize the overall performance of the FL system, it is crucial to make a balance between three factors.

Running Time Model: Considering the differences in network delay and computing resources of heterogeneous AIoT devices, the energy-constrained FL system aims to minimize the total running time $T_{all}$ among all the devices, which is shown as

(3)

T_{all}=\max_{\forall n}T_{all}^{D_{n}}.\vspace{-0.05in}

Let $T_{com}^{D_{n}}$ and $T_{tra}^{D_{n}}$ be the communication time of the device $D_{n}$ and the training time of the layer-wise model on device $D_{n}$ , respectively. Note that due to the abundant computing resources in the cloud server, its running time is negligible compared to that on devices. The total running time for each device $T_{all}^{D_{n}}$ is defined as

(4)

T_{all}^{D_{n}}=T_{com}^{D_{n}}+T_{tra}^{D_{n}}.\vspace{-0.04in}

Here, the communication time for each device $T_{com}^{D_{n}}$ can be regarded as the ratio of the size of a model $S_{D_{n}}$ with different layers and the speed of bandwidth $V_{net}$ . Since the training time of each device $T_{tra}^{D_{n}}$ is determined by the computation capability of local devices $C_{D_{n}}$ , the training data size in a device $L_{D_{n}}$ , we formalize communication time $T_{com}^{D_{n}}$ and training time $T_{tra}^{D_{n}}$ as

(5)

\begin{matrix}T_{com}^{D_{n}}=\frac{S_{D_{n}}}{V_{net}},\qquad T_{tra}^{D_{n}}% =\frac{L_{D_{n}}}{C_{D_{n}}},\end{matrix}\vspace{-0.025in}

where $O_{D_{n}}$ is reflected by the computation capability of the device $C_{D_{n}}$ . Assuming that the network transmission speed can be kept relatively stable.

Energy Consumption Model: The energy consumed by the overall FL system plays an important role in ensuring the system operates smoothly. The calculation of the total remaining energy can be expressed as

(6)

E_{all}=\sum_{n=1}^{N}\left(E_{remain}^{D_{n}}-E_{tra}^{D_{n}}-E_{com}^{D_{n}}% \right).\vspace{-0.04in}

Note that both training and communication energy consumption are all decided by two factors, i.e., the size of the training model and the power mode of AIoT devices. The training energy consumption $E_{tra}^{D_{n}}$ and communication energy consumption $E_{com}^{D_{n}}$ of device $D_{n}$ are calculated as

(7)

\begin{matrix}E_{tra}^{D_{n}}=P_{train}\times T_{tra}^{D_{n}},\qquad E_{com}^{% D_{n}}=P_{com}\times T_{com}^{D_{n}},\end{matrix}\vspace{-0.03in}

where $P_{train}$ is the energy consumption per unit training time, and $P_{com}$ is the energy consumption per unit network transmission time. Note that since actual energy consumption is intrinsically related to the size of the trained model, variations in the size of the model lead to fluctuations in the energy consumed during both the training and communication processes. Therefore, it is of utmost importance to consider these energy dynamics when addressing the optimization model.

Model Accuracy: In a heterogeneous scenario, how to effectively leverage the heterogeneity in heterogeneous models and devices to enhance the performance of aggregated models is an urgent issue that needs to be solved in FL. Furthermore, resource-constrained heterogeneous AIoT devices that participate in aggregation pose a considerable obstacle to the application of energy-constrained FL. Inspired by the work (Li et al., 2019) where the performance of model inference is affected by the number of successful aggregations for its device, we can deduce that the accuracy of heterogeneous models is proportional to the total number of aggregated models participating in each round. However, since devices consume energy every time they participate in each round of aggregation, how to reasonably select aggregation devices in an energy-constrained environment to improve model accuracy has become a major challenge in the design of an FL framework.

Optimization Objective: Taking energy information into account, an optimization model is proposed for energy-constrained FL. This model aims to balance three objectives: minimizing total running time $T_{all}$ , and maximizing model accuracy $M_{acc}$ under total energy consumption $E_{all}$ constraint, which is defined as follows

(8)

\begin{matrix}\min T_{all},\ \quad\max M_{acc},\\ \text{s.t.}\ E_{all}\leq E,\end{matrix}

where $E$ is the energy budget of an FL system.

4.2. Workflow of DR-FL

In DR-FL, heterogeneous AIoT devices and a cloud server cooperate to achieve high performance of various layer-wise models deployed on edge devices. Before training, all devices participating in DR-FL will initialize and install a layer-wise model, which is a subset layer of the global model in the cloud server. Then, the cloud server sends a part of the global model to AIoT devices for local training. At the end of local training, DR-FL performs layer-wise model aggregation on the cloud server. Note that hot-plug AIoT devices are permissible in DR-FL, where newly involved devices only inherit the parameters of the global model in the cloud server. Figure 2 shows the workflow of the DR-FL, which consists of five steps as follows.

Step 1 (Battery or Model Information Upload): During the initialization step of DR-FL, each device intending to participate in FL should upload its device information to the cloud, which includes the power, computing, and storage capabilities of devices and the overclocking potential of models. This collected information is used for energy-aware dual-selection for the layer-wise model and client in subsequent steps to optimize the entire system’s energy efficiency.

Step 2 (Layer-Wise Model Aggregation): After receiving the participating devices’ local model gradients, this step will layer-align averaging (The same parts of the network will be aggregated.) such gradients and use the previous round global model stored on the server to construct a new global model.

Step 3 (Energy-Aware MARL-based Dual-Selection): Then, to prevent selected devices from drop** out of the FL process due to energy limitations, we design a MARL-based selector that can choose an appropriate model for each AIoT device based on its remaining energy and computing capabilities, which can not only improve the efficiency of the device resource usage but also ensure their active participation in FL (see more details in Section 4.3). Furthermore, apart from selecting a layer-wised model for each AIoT device, the selector can also adjust the computing capability of AIoT devices, aiming to achieve a trade-off between energy consumption and computing efficiency.

Step 4 (Layer-Wise Model Dispatching): Based on an energy-aware MARL-based dual-selection strategy, the cloud server dispatches part of the global model parameters to each heterogeneous AIoT device.

Step 5 (Local Training): Based on the received global model parameters, each heterogeneous AIoT device builds an initial local model (i.e., layer-wise model), which is trained using cross-entropy loss based on local training samples to obtain the gradients of the local model for gradient upload.

DR-FL repeats all five steps above until the global model and all its local models converge.

4.3. Dual-Selection for Local Model and Client

4.3.1. MARL Training Process:

In our DR-FL, each device uses an energy-aware MARL-based dual-selection method to select the participated device and the layers of its corresponding local model running on devices. To better capture connections between long-term/short-term rewards and strategies, each MARL is designed with two Multi-Layer Perceptions (MLP) and a Gated Recurrent Unit (GRU) (Cho et al., 2014), respectively, as shown in Figure 3. During the training procedure of MARL, each agent acquires its current state $S_{t}$ and selects an action $a_{t}^{n}$ for each client. Based on both client selection and layer-wise model considerations, the central server computes team rewards by considering the validation accuracy improvement of the global model $M_{acc}$ , the total runtime $T_{all}$ , the computation capabilities $C$ and the remaining energy of each device $E_{all}$ . The MARL agents are then trained with the QMIX algorithm (Rashid et al., 2018) to maximize the system rewards (See the design details in Section 4.3.4).

4.3.2. MARL Agent State Design:

The state of each MARL agent $D_{n}$ is comprised of three components: the remaining energy $E_{all}^{D_{n}}$ , the computation capability of each communication round $C_{D_{n}}$ , and the size of the local training dataset $L_{D_{n}}$ . At each training round $t$ , each agent initially conducts the training procedure and transmits its gradients to the central server. Furthermore, to estimate the current training and communication delays at client device $n$ , each MARL agent is equipped with a record of training latency $T_{tra}^{D_{n}}$ and communication latency $T_{com}^{D_{n}}$ , where $T_{tra}^{D_{n}}$ and $T_{com}^{D_{n}}$ denote the latency in local training and model uploading for agent $n$ during the communication round $t$ . As shown in Figure 3, the parameter $\tau$ represents the trajectory of historical data from training, and $h$ represents the MLP layer for knowledge extraction. Moreover, each MARL agent $n$ also calculates the energy consumption of training and communication based on Equation 7. This inclusion is crucial as the energy costs contribute to the overall energy cost, while the remaining energy of the agent influences both training latency and model accuracy. The state vector $s_{t}^{n}$ of agent $n$ in communication round $t$ is defined as:

(9)

s_{t}^{n}=[L_{t}^{n},C_{D_{n}},E_{D_{n}},t].

Finally, to decrease storage overhead and accelerate the speed of agent convergence, all MLPs and GRUs within the MARL agents share their weights.

4.3.3. Agent Action Design:

Given the input state shown in Equation 9, each MARL agent $n$ determines which layers of the local model should be used for the local training process on each device. Specifically, the MARL agent will generate $Q$ values for the current action set $[a^{0},\ldots,a^{M}]$ , where $M$ represents the number of model selections available to the client. Note that when the selected action is zero, the client device will run the first model, and when the selected action is $M$ , the client will not participate in the FL. After selecting the layer-wise model for each heterogeneous device, all the Q values obtained by the agents will select the device with the highest Q value through the Top-K algorithm to participate in the FL process.

4.3.4. Reward Function Design:

To optimize the objective described in Equation 8, the reward function should reflect the changes in the model accuracy, processing latency (training, communication and waiting latency), and processing energy consumption after executing the dual-selection strategy generated by MARL agents. The reward $r_{t}$ at training round $t$ is defined as follows:

(10)

r_{t}=w_{1}\cdot(M_{Acc}^{t}-M_{Acc}^{t-1})-w_{2}\cdot(E_{all}^{t-1}-E_{all}^{% t})-w_{3}\cdot\max_{1\leq n\leq N}T_{all}^{t,n}.

Here, $\max_{1\leq n\leq N}T_{all}^{t,n}$ represents the total time needed for local training of all selected devices. The MARL agents utilize the evaluation accuracy calculated by a small tiny dataset on the cloud server to select the layer-wise model that will be dispatched to the local device and continue the local training and upload their model updates. Moreover, $w_{1}$ , $w_{2}$ , and $w_{3}$ ¹¹1We used $w1=1000,w2=0.01,w3=1$ in our experiments. are the norm ratios to control all the reward plays the same role in the entire reward. $E_{all}^{t}$ is the total remaining energy of $t^{th}$ communication round as defined in Equation 6. The MARL agents are trained using QMIX as described in Figure 3.

Table 1. Test accuracy (%) comparison for different models and dataset settings under specific energy constraints with 40 clients.

Dataset	CIFAR10
Methods	HeteroFL (Diao et al., 2021)			ScaleFL (Ilhan et al., 2023)			DR-FL (Ours)
Distribution	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0
Model_1	$30.46\pm 1.10$	46.11 $\pm$ 3.32	65.23 $\pm$ 1.45	29.25 $\pm$ 1.17	54.44 $\pm$ 0.87	58.15 $\pm$ 4.32	58.69 $\pm$ 0.73	59.01 $\pm$ 0.85	76.46 $\pm$ 0.12
Model_2	48.41 $\pm$ 1.24	62.55 $\pm$ 3.45	62.10 $\pm$ 3.24	41.66 $\pm$ 5.43	55.46 $\pm$ 3.87	71.48 $\pm$ 1.23	65.31 $\pm$ 1.54	75.93 $\pm$ 0.62	77.43 $\pm$ 2.77
Model_3	34.85 $\pm$ 5.79	65.01 $\pm$ 1.79	74.78 $\pm$ 2.76	39.92 $\pm$ 2.75	60.07 $\pm$ 0.68	70.83 $\pm$ 1.43	72.71 $\pm$ 0.58	70.64 $\pm$ 1.40	71.54 $\pm$ 1.54
Model_4	45.26 $\pm$ 3.68	69.65 $\pm$ 2.99	75.14 $\pm$ 1.13	46.59 $\pm$ 3.43	70.60 $\pm$ 4.54	73.90 $\pm$ 1.17	70.76 $\pm$ 1.30	69.37 $\pm$ 0.45	72.27 $\pm$ 1.73
Dataset	CIFAR100
Methods	HeteroFL (Diao et al., 2021)			ScaleFL (Ilhan et al., 2023)			DR-FL (Ours)
Distribution	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0
Model_1	11.86 $\pm$ 0.78	22.56 $\pm$ 2.13	25.66 $\pm$ 1.13	13.14 $\pm$ 1.96	21.39 $\pm$ 1.59	17.58 $\pm$ 0.43	26.25 $\pm$ 0.23	33.59 $\pm$ 3.32	39.65 $\pm$ 1.35
Model_2	16.33 $\pm$ 3.34	25.98 $\pm$ 1.72	28.68 $\pm$ 0.57	12.67 $\pm$ 2.13	28.77 $\pm$ 4.33	29.84 $\pm$ 1.39	17.83 $\pm$ 0.75	39.50 $\pm$ 1.08	33.55 $\pm$ 0.45
Model_3	14.18 $\pm$ 0.29	31.99 $\pm$ 0.53	31.31 $\pm$ 3.34	17.12 $\pm$ 2.88	30.04 $\pm$ 1.91	33.92 $\pm$ 2.34	26.46 $\pm$ 0.24	32.10 $\pm$ 1.12	33.40 $\pm$ 0.13
Model_4	15.66 $\pm$ 0.78	29.33 $\pm$ 0.85	35.44 $\pm$ 1.54	19.24 $\pm$ 1.22	30.29 $\pm$ 1.03	33.23 $\pm$ 1.32	22.55 $\pm$ 0.73	32.55 $\pm$ 1.45	33.80 $\pm$ 1.25
Dataset	SVHN
Methods	HeteroFL (Diao et al., 2021)			ScaleFL (Ilhan et al., 2023)			DR-FL (Ours)
Distribution	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0
Model_1	60.08 $\pm$ 3.23	46.02 $\pm$ 3.32	60.38 $\pm$ 1.39	47.90 $\pm$ 0.53	85.79 $\pm$ 2.22	88.91 $\pm$ 1.11	67.19 $\pm$ 0.32	91.58 $\pm$ 0.21	68.78 $\pm$ 1.33
Model_2	65.11 $\pm$ 4.32	54.83 $\pm$ 1.28	68.90 $\pm$ 2.87	50.26 $\pm$ 2.21	86.82 $\pm$ 2.51	85.16 $\pm$ 4.13	79.86 $\pm$ 0.87	85.30 $\pm$ 1.19	91.72 $\pm$ 0.94
Model_3	65.93 $\pm$ 4.56	69.20 $\pm$ 4.19	75.97 $\pm$ 1.84	76.73 $\pm$ 2.23	84.91 $\pm$ 0.68	88.70 $\pm$ 3.25	91.47 $\pm$ 0.17	88.61 $\pm$ 1.72	93.45 $\pm$ 0.37
Model_4	66.31 $\pm$ 3.09	71.34 $\pm$ 0.79	76.14 $\pm$ 1.90	55.27 $\pm$ 3.23	86.10 $\pm$ 3.56	92.47 $\pm$ 0.51	91.11 $\pm$ 1.32	89.26 $\pm$ 0.75	92.78 $\pm$ 0.54
Dataset	Fashion-MNIST
Methods	HeteroFL (Diao et al., 2021)			ScaleFL (Ilhan et al., 2023)			DR-FL (Ours)
Distribution	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0	$\alpha$ =0.1	$\alpha$ =0.5	$\alpha$ =1.0
Model_1	45.06 $\pm$ 2.01	85.58 $\pm$ 1.31	87.00 $\pm$ 1.93	53.78 $\pm$ 0.98	74.26 $\pm$ 2.34	87.29 $\pm$ 0.93	80.15 $\pm$ 0.23	82.25 $\pm$ 0.19	87.10 $\pm$ 0.37
Model_2	59.76 $\pm$ 0.46	85.75 $\pm$ 0.63	88.60 $\pm$ 0.34	57.19 $\pm$ 3.13	85.32 $\pm$ 2.51	87.44 $\pm$ 0.55	82.10 $\pm$ 0.39	88.76 $\pm$ 0.23	85.22 $\pm$ 0.34
Model_3	57.25 $\pm$ 0.98	83.26 $\pm$ 3.27	87.75 $\pm$ 1.25	62.26 $\pm$ 1.34	87.69 $\pm$ 1.07	88.47 $\pm$ 0.97	86.88 $\pm$ 0.23	89.34 $\pm$ 0.62	90.52 $\pm$ 0.13
Model_4	56.32 $\pm$ 4.07	87.82 $\pm$ 1.28	87.83 $\pm$ 0.56	55.85 $\pm$ 1.51	86.78 $\pm$ 3.27	88.40 $\pm$ 0.69	85.80 $\pm$ 0.17	89.36 $\pm$ 0.11	89.60 $\pm$ 0.29

5. Experimental Results

To evaluate the performance of our proposed method, we implemented the DR-FL algorithm using PyTorch (version 1.4.0). Similar to FedAvg, we assume that only 10% of AIoT devices were involved in each round of FL communication during the training period. For DR-FL and other heterogeneous FL methods, we set the small batch size to 32. The number of local training epochs and the initial learning rate were 5 and 0.05, respectively. To simulate a variety of energy-constrained scenarios, we assume that each device is powered by a battery with a maximum capacity of 7,560 joules. In other words, each battery capacity is 1500 mA at a rated voltage of 5.04V. We conducted comprehensive experiments to answer the following four Research Questions (RQs).

RQ1: (Superiority of DR-FL ): What advantages can DR-FL achieve compared with state-of-the-art heterogeneous FL methods?

RQ2: (Benefits of MARL-based Dual-Selection): What benefits does MARL-based Dual-Selection provide during DR-FL learning, especially under constraints such as device energy and overall training time, compared with other SOTA heterogenous FL methods?

RQ3: (Scalability of DR-FL): How does the number of AIoT devices participating in knowledge sharing affect the performance of DR-FL?

RQ4: (Exploration of the Validation Data Ratio): How does the proportion of validation data in MARL affect the performance of DR-FL?

5.1. Experimental Settings

5.1.1. Model Settings

We compared our DR-FL method with two typical state-of-the-art heterogeneous FL methods, i.e., HeteroFL (Diao et al., 2021) and ScaleFL (Ilhan et al., 2023), which belong to subnetwork aggregation-based methods and knowledge distillation-based methods, respectively. We set the ResNet-18 model (He et al., 2015) as the backbone, where each block of the ResNet-18 model is followed by a new pair of the bottleneck and classifier, thus forming four new heterogeneous layer-wise models to simulate four types of heterogeneous models (i.e., Models 1-4 shown in Table 1). Note that each layer-wise model can be reused with the same backbone for the purpose of model inference.

5.1.2. Dataset Settings

To evaluate the effectiveness of DR-FL, we considered four training datasets: i.e., CIFAR10, CIFAR100 (Krizhevsky, 2009), Street View House Numbers (SVHN) (Netzer et al., 2011), Fashion-MNIST (Xiao et al., 2017). CIFAR10: The CIFAR10 dataset consists of 60,000 $32\times 32$ colour images across ten classes, with 6,000 images per class. The dataset is split into 50,000 training images and 10,000 testing images. CIFAR100: The CIFAR100 dataset is similar to CIFAR10 but contains 100 classes instead of 10, with 600 images per class. The dataset also comprises 50,000 training images and 10,000 testing images. SVHN: The SVHN dataset is a real-world image dataset derived from house numbers in Google Street View images. It contains over 600,000 labelled digit images, where each image is a 32×32 colour image representing a single digit (0-9). Fashion-MNIST: The Fashion-MNIST dataset is a dataset of Zalando’s article images, consisting of 70,000 $28\times 28$ grayscale images of 10 different fashion categories. In subsequent experiments, we investigated three non-Independent and Identically Distributed (non-IID) distributions for each dataset. Similar to the work of HeteroFL in (Diao et al., 2021), we constructed non-IID local training datasets using heterogeneous data splits following a Dirichlet distribution controlled by a variable $\alpha$ . Typically, a smaller value of $\alpha$ represents a higher degree of a corresponding non-IID distribution. Meanwhile, we used the same data augmentation technologies to fully utilize natural image datasets as the ones used in HeteroFL (Diao et al., 2021). To enable MARL training on a cloud server in DR-FL, we used 4% of the overall training data as the validation set on the server. Note that the validation set on the server does not overlap with local training datasets hosted by AIoT devices.

5.1.3. Test-bed Settings

Besides simulation-based evaluation, we constructed a physical test-bed platform as shown in Figure 4 to check the performance of our DR-FL in a real-world environment. The test-bed consists of four parts: i) the cloud server that is built on top of an Ubuntu workstation equipped with an Intel i9 CPU, 32G memory, and a GTX3090 GPU; ii) the Jetson Nano boards, where each of them has a quad-core ARM A57 CPU, a 128-core NVIDIA Maxwell GPU, and 4GB LPDDR4 RAM; iii) the Jetson AGX Xavier boards, where each of them is equipped with an 8-core CPU and a 512-core Volta GPU; and iv) an HP 9800 power meter (see the top-left part in Figure 4(a)) produced by Shenzhen HOPI Electronic Technology Ltd. Note that, along with the federated training process, we used the power meter to record the energy consumption of all the AIoT devices every second for the MARL environment construction.

5.2. Accuracy Comparison (RQ1)

To evaluate the effectiveness of our proposed DR-FL, Table 1 presents the best test accuracy information for HeteroFL, ScaleFL and our DR-FL under the specific energy constraints along the FL processes based on the four datasets, assuming all the device batteries are initialized to be full. For each dataset and FL method combination, we considered three kinds of data distributions for all local AIoT devices, where the non-IID settings follow the Dirichlet distributions controlled by $\alpha$ . Note that the baseline approaches (HeteroFL and ScaleFL) do not consider the energetic constraints in their FL procedure. To make a fair comparison, we added the greedy algorithm for energy awareness in this experiment (model selection will select the maximum model that can be trained for FL) into the two baseline algorithms for comparison. The experiments were repeated five times to calculate the mean and variance.

From Table 1, it is evident that within the constraint of the restricted battery energy conditions set for each device, DR-FL exhibits superior inference performance, surpassing results in 29 out of the 36 evaluated scenarios in comparison with other baseline algorithms. Specifically, no matter which data set, in the scenario of $\alpha=0.1$ , our method shows superior performance in comparison with other baseline algorithms. Moreover, the performance of some models at $\alpha=0.1$ in DR-FL has exceeded the performance of two baselines at $\alpha=0.5$ . As an example shown in the non-IID scenario of SVHN with $\alpha=0.1$ , the test accuracy of DR-FL reaches 91.47%, while HeteroFL only attains 66.31% and ScaleFL only gets 76.73% on Model_3. This is because our MARL-based dual-selection method can efficiently utilize the available energy of devices by assigning specific layer-wise models to participating devices that are more suitable for heterogeneous federated learning.

5.3. Comparison of Energy Consumption (RQ2)

To validate the performance of our DR-FL method in terms of energy consumption and running time, we conducted an experiment involving a total of 40 devices (i.e., 20 Jetson Nano boards and 20 AGX Xavier boards). Figure 5 compares the total remaining energy variation and running time in the federated training processes using HeteroFL (ScaleFL is the same energy consumption and running time in the greedy algorithm ) and DR-FL, respectively. For each subfigure, we use the notion $X\_Y$ to represent the total result of all the devices of type $Y$ using method $X$ . If $Y$ is omitted, the notion $X$ denotes the total result involving all the devices. For example, in Figure 5(a), the legend DR-FL denotes the overall remaining energy of all 40 devices, while DR-FL_Nano represents the overall remaining energy of all the 20 Jetson Nano boards.

Figure 5(a) shows that our method can have more training rounds under the same energy constraints, thus leading to better overall test accuracy and energy efficacy. For example, for HeteroFL, the Jetson AGX Xavier-based devices ran out of batteries in the 12^th round. However, for DR-FL, the Jetson AGX Xavier-based devices ran out of batteries in the 18^th round. Moreover, in Figure 5(b), we can clearly find an inflexion point in the 12^th round for HeteroFL, after which only Jetson Nano-based devices are involved in federated training. However, for DR-FL, we can observe an inflexion point in the 15^th round, indicating the effectiveness of the MARL algorithm in controlling the energy waste of the device with reduced useless wait and training time.

5.4. Scalability Analysis (RQ3)

Figure 6 compares the test accuracy of three methods (i.e., HeteroFL, ScaleFL, and DR-FL) for various non-IID scenarios with different numbers of devices under specific energy constraints. From this figure, we can observe that when more heterogeneous devices participate in FL, the superiority of DR-FL becomes more significant than that of the other two methods. For example, for the non-IID scenario of CIFAR10 (with $\alpha$ =0.1), DR-FL consistently achieves higher test accuracy than ScaleFL and HeteroFL.

5.5. Ablation Study (RQ4)

To explore the role of the validation set proportion in our method, the validation set with different proportions (1%-10%) is selected for the experiment of this paper, and the non-independent data set CIFAR10 ( $\alpha=0.1$ ) is selected as the exploration scenario. From Table 2, we can see, with the number of validation set increases, in the initial overall test accuracy rise, and with the proportion of validation sets more than 4%, the accuracy decreases. This phenomenon shows that it can be used as an effective tuning knob to explore the trade-off between the proportion of cloud validation data and the entire DR-FL performance. We found that the setup validation data ratio of 4% provided a reasonable balance. We picked 4% and used it in all experiments.

Table 2. Average model accuracy with different percentages of the validation dataset

Percentage	1%	2%	3%	4%	5%	6%	7%	8%	9%	10%
Accuracy (acc)	57.72	63.23	64.35	65.04	63.16	59.18	58.86	52.21	54.9975	55.69

6. Conclusion

Federated Learning (FL) is expected to enable privacy-preserving collaborative learning among Artificial Intelligence of Things (AIoT) devices. However, due to various heterogeneous settings (e.g., non-IID data, device models with different architectures) and device resource constraints (e.g., computing power and energy capacity), existing FL-based AIoT design greatly suffers from the problems of low inference accuracy, rapid battery consumption and long training time. To address these issues, this paper introduces a novel FL framework that enables efficient knowledge sharing between heterogeneous devices under specific energy constraints. Based on our proposed layer-wise aggregation method and MARL-based dual selection mechanism, AIoT devices with different computational and energy capabilities can adaptively select appropriate local models to participate in global model training, where devices can effectively learn from each other through appropriate parts belonging to different layer-wise models. Comprehensive experiments performed on well-known datasets demonstrate the effectiveness of DR-FL for inference performance, energy consumption, and scalability.

References

(1)
Baghersalimi et al. (2021) Saleh Baghersalimi, Tomás Teijeiro, David Atienza Alonso, and Amir Aminifar. 2021. Personalized Real-Time Federated Learning for Epileptic Seizure Detection. IEEE Journal of Biomedical and Health Informatics 26 (2021), 898–909. https://api.semanticscholar.org/CorpusID:235786959
Bhardwaj et al. (2020) Kartikeya Bhardwaj, Wei Chen, and Radu Marculescu. 2020. INVITED: New Directions in Distributed Deep Learning: Bringing the Network at Forefront of IoT Design. Proceedings of 57th ACM/IEEE Design Automation Conference (DAC) (2020), 1–6. https://api.semanticscholar.org/CorpusID:221293302
Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of Conference on Empirical Methods in Natural Language Processing.
Cui et al. (2022) Yangguang Cui, Kun Cao, Junlong Zhou, and Tongquan Wei. 2022. HELCFL: High-Efficiency and Low-Cost Federated Learning in Heterogeneous Mobile-Edge Computing. 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE) (2022), 1227–1232. https://api.semanticscholar.org/CorpusID:248922002
Diao et al. (2021) Enmao Diao, Jie Ding, and Vahid Tarokh. 2021. HeteroFL: Computation and communication efficient federated learning for heterogeneous clients. In Proceedings of International Conference on Learning Representations (ICLR).
Hamdi et al. (2022) Rami Hamdi, Mingzhe Chen, Ahmed Ben Said, Marwa Qaraqe, and H. Vincent Poor. 2022. Federated Learning Over Energy Harvesting Wireless Networks. IEEE Internet of Things Journal 9, 1 (2022), 92–103.
He et al. (2015) Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778 pages.
Ilhan et al. (2023) Fatih Ilhan, Gong Su, and Ling Liu. 2023. ScaleFL: Resource-Adaptive Federated Learning with Heterogeneous Clients. In Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Khan et al. (2020) Latif Ullah Khan, Walid Saad, Zhu Han, Ekram Hossain, and Choong Seon Hong. 2020. Federated Learning for Internet of Things: Recent Advances, Taxonomy, and Open Challenges. IEEE Communications Surveys & Tutorials 23 (2020), 1759–1799. https://api.semanticscholar.org/CorpusID:221970627
Krizhevsky (2009) Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. https://api.semanticscholar.org/CorpusID:18268744
Li et al. (2020) Liang Li, Dian Shi, Ronghui Hou, Hui Li, Miao Pan, and Zhu Han. 2020. To Talk or to Work: Flexible Communication Compression for Energy Efficient Federated Learning over Heterogeneous Mobile Edge Devices. IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, 1–10. https://api.semanticscholar.org/CorpusID:229349304
Li et al. (2019) Li Li, Haoyi Xiong, Zhishan Guo, Jun Wang, and Chengzhong Xu. 2019. SmartPC: Hierarchical Pace Control in Real-Time Federated Learning System. 2019 IEEE Real-Time Systems Symposium (RTSS) (2019), 406–418. https://api.semanticscholar.org/CorpusID:203582658
McMahan et al. (2016a) H. B. McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2016a. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of International Conference on Artificial Intelligence and Statistics. https://api.semanticscholar.org/CorpusID:14955348
McMahan et al. (2016b) H. B. McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2016b. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of International Conference on Artificial Intelligence and Statistics.
Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, A. Bissacco, Bo Wu, and A. Ng. 2011. Reading Digits in Natural Images with Unsupervised Feature Learning. https://api.semanticscholar.org/CorpusID:16852518
Nguyen et al. (2021) Dinh C. Nguyen, Ming Ding, Pubudu N. Pathirana, Aruna Prasad Seneviratne, Jun Li, and Fellow Ieee H. Vincent Poor. 2021. Federated Learning for Internet of Things: A Comprehensive Survey. IEEE Communications Surveys & Tutorials 23 (2021), 1622–1658. https://api.semanticscholar.org/CorpusID:233289549
Rashid et al. (2018) Tabish Rashid, Mikayel Samvelyan, C. S. D. Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. 2018. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. ArXiv abs/1803.11485 (2018). https://api.semanticscholar.org/CorpusID:4533648
Samarjit and Faruque (2016) Samarjit and Al Faruque. 2016. Automotive Cyber-Physical Systems: A Tutorial Introduction. https://api.semanticscholar.org/CorpusID:247235211
Shi et al. (2021) Dian Shi, Liang Li, Rui Chen, Pavana Prakash, Miao Pan, and Yuguang Fan. 2021. Toward Energy-Efficient Federated Learning Over 5G+ Mobile Devices. IEEE Wireless Communications 29 (2021), 44–51. https://api.semanticscholar.org/CorpusID:231592874
Sun et al. (2019) Yuxuan Sun, Sheng Zhou, and Deniz Gündüz. 2019. Energy-Aware Analog Aggregation for Federated Learning with Redundant Data. In ICC 2020 - 2020 IEEE International Conference on Communications (ICC). 1–7. https://api.semanticscholar.org/CorpusID:207869996
Teerapittayanon et al. (2016) Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. 2016. BranchyNet: Fast inference via early exiting from deep neural networks. In Proceedings of 23rd International Conference on Pattern Recognition (ICPR). 2464–2469 pages.
Verbraeken et al. (2019) Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2019. A Survey on Distributed Machine Learning. ACM Computing Surveys (CSUR) 53 (2019), 1–33. https://api.semanticscholar.org/CorpusID:209439571
Wu et al. (2022) Yawen Wu, Dewen Zeng, Zhepeng Wang, Yi Sheng, Lei Yang, Alaina J. James, Yiyu Shi, and **gtong Hu. 2022. Federated Contrastive Learning for Dermatological Disease Diagnosis via On-device Learning. ArXiv abs/2202.07470 (2022). https://api.semanticscholar.org/CorpusID:245446614
Xia et al. (2022) Jun Xia, Tian Liu, Zhiwei Ling, Ting Wang, Xin Fu, and Mingsong Chen. 2022. PervasiveFL: Pervasive Federated Learning forHeterogeneous IoT Systems. IEEE Transactions on Computer Aided Design of Integrated Circuits Systems 41, 11 (2022), 4100–4111.
Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. ArXiv:1708.07747 (2017).
Yang et al. (2019) Zhaohui Yang, Mingzhe Chen, Walid Saad, Choong Seon Hong, and Mohammad R. Shikh-Bahaei. 2019. Energy Efficient Federated Learning Over Wireless Communication Networks. IEEE Transactions on Wireless Communications 20 (2019), 1935–1949. https://api.semanticscholar.org/CorpusID:207880723
Yun et al. (2023) Won Joon Yun, Yunseok Kwak, Hankyul Baek, Soyi Jung, Mingyue Ji, Mehdi Bennis, Jihong Park, and Joongheon Kim. 2023. SlimFL: Federated Learning With Superposition Coding Over Slimmable Neural Networks. IEEE/ACM Transactions on Networking (TON) 31, 6 (2023), 2499–2514.
Zhang and Tao (2020) **g Zhang and Dacheng Tao. 2020. Empowering Things With Intelligence: A Survey of the Progress, Challenges, and Opportunities in Artificial Intelligence of Things. IEEE Internet of Things Journal 8 (2020), 7789–7817. https://api.semanticscholar.org/CorpusID:226975900
Zhang et al. (2021a) Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. 2021a. Self-Distillation: Towards Efficient and Compact Neural Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8 (2021), 4388–4403. https://api.semanticscholar.org/CorpusID:232302458
Zhang et al. (2021b) Xinqian Zhang, Ming Hu, Jun Xia, Tongquan Wei, Mingsong Chen, and Shiyan Hu. 2021b. Efficient Federated Learning for Cloud-Based AIoT Applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 40, 11 (2021), 221–2223. https://doi.org/10.1109/TCAD.2020.3046665
Zhu et al. (2021) Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. 2021. Data-Free Knowledge Distillation for Heterogeneous Federated Learning. Proceedings of machine learning research 139 (2021), 12878–12889. https://api.semanticscholar.org/CorpusID:235125689