FedPEAT: Convergence of 6G Enabled Federated Learning, Parameter-Efficient Fine Tuning, and Emulator Assisted Tuning for AI Foundation Models

Terence Jie Chua Nanyang Technological University, Graduate College, Singapore, 637335, Singapore These authors contributed equally to this work. Wenhan Yu Nanyang Technological University, Graduate College, Singapore, 637335, Singapore These authors contributed equally to this work. Yang Li Nanyang Technological University, Graduate College, Singapore, 637335, Singapore Jun Zhao Nanyang Technological University, School of Computer Science and Engineering, Singapore, 639798, Singapore [email protected]

Abstract

The advent of foundation models like GPT-3 and BERT has revolutionized artificial intelligence, providing unparalleled capabilities across various applications and the potential to transform industries from healthcare to entertainment. Deploying and fine-tuning these models pose unique challenges, making it imperative to address issues like model ownership, collaborative training, and computation and communication limitations for realizing their full potential. We generalize the offsite tuning approach to Emulator-Assisted Tuning (EAT) and combine it with Parameter-Efficient Fine-Tuning (PEFT) to create Parameter- Efficient Emulator-Assisted Tuning (PEAT), expanding its use into 6G-enabled Federated Learning (FL) as Federated Parameter- Efficient Emulator-Assisted Tuning (FedPEAT). The FedPEAT framework proposes a solution using adapters, emulators, and PEFT techniques for federated model fine-tuning. This approach enhances model privacy and streamlines downstream fine-tuning. Our approach, adaptable to diverse neural network architectures, incorporates an adaptive control mechanism utilizing the novel Single-Agent Action Branching Proximal Policy Optimization (SABPPO) algorithm. The proposed SABPPO is tailored for high-dimensional action spaces, featuring short training delays, essential for scalable FedPEAT involving a large number of users and variables to optimize. Our experimental results demonstrate the practicality and efficacy of our proposed framework and algorithm in addressing the complex challenges associated with large foundation model fine-tuning.

In the vibrant landscape of artificial intelligence (AI), colossal foundation models like GPT-3 [1], CLIP [2] and BERT [3] have revolutionized AI, venturing beyond traditional machine learning approaches. These models, trained on massive datasets, possess the uncanny ability to generate images, texts, and audio with unparalleled accuracy. With sizes reaching billions of parameters, they capture intricate linguistic nuances and showcase human-level proficiency across diverse applications. Large foundation models have garnered attention for their capacity to adapt to new tasks and domains through a transfer learning approach called fine-tuning [4, 5]. Leveraging these models offers an advantage in terms of time and resource savings as compared to training models from the ground up, especially for large models like GPT3 with 175B+ parameters. The dawn of 6G technologies, boasting broad bandwidths (1 THz to 3THz) and unprecedented data communication speeds [6], potentially reaching terabits per second, opens avenues for federated fine-tuning of these expansive models.

Refer to caption — Figure 1: Intersection of Federated learning (FL), Parameter-Efficient Fine-Tuning (PEFT), and Emulator-Assisted Tuning (EAT). Here we illustrate the intersection of FL, PEFT, and (EAT). The main contribution of our current paper is to introduce Federated Parameter-Efficient Emulator-Assisted Tuning (FedPEAT), as a convergence of EAT, PEFT, and FL, while EAT and Parameter-Efficient Emulator-Assisted Tuning (PEAT) are also terms coined by our paper.

One of the significant challenges associated with fine-tuning large language models lies in the distribution of data. Many real-world applications necessitate the utilization of data that resides on user devices, such as smartphones, laptops, and mobile edge devices, rather than on centralized servers. The need for decentralization hinders foundation model fine-tuning. Federated learning (FL) has emerged as promising solutions to these issues. Federated learning is a decentralized machine learning approach that enables privacy-preserving model training without the need to centralize data [7, 8, 9]. Instead of sending raw data to a central server, federated learning trains models directly on the user’s device. These models are then aggregated to create a global model, preserving data privacy while achieving the desired performance.

However, fine-tuning large language models is computationally intensive. Training a model with hundreds of millions or billions of parameters demands substantial computational resources, often beyond the reach of individual users or small organizations. This computational bottleneck can limit the widespread adoption of these models and impede their deployment in resource-constrained environments. Moreover, fine-tuning on local devices, such as smartphones or edge devices, is often not feasible due to their limited computational capabilities. Distributing the model fine-tuning process across devices while ensuring data privacy and model performance adds another layer of complexity. In response to these challenges, various methods have been explored to make fine-tuning of pre-trained models more efficient. Efforts in model tuning have extended to the realm of adapters [10, 11, 12], which encode task-specific representations within intermediate layers while preserving pre-training knowledge. Different Parameter-Efficient Fine Tuning (PEFT) techniques have been proposed, encompassing approaches such as Low Rank Adapters (LoRA) [13], prompt tuning [14, 15], prefix-tuning [16], adapters [12], P-tuning V2 [17], tuning embedding layer inputs [18], tuning hidden states [19], and more. These methods aim to update or add only a limited number of model parameters, reducing resource requirements and allowing for the sharing of parameters from the pre-trained model. Several authors of the works [20, 21, 22, 23, 24] noticed the prowess of PEFT techniques and proposed Federated-PEFT approaches.

However, large language models are often owned by research institutions or companies that bear the responsibility of maintaining and updating them. These model owners typically cannot directly share the entire model with external devices due to various reasons, including privacy concerns, intellectual property rights, and the potential for misuse. The lack of easy sharing mechanisms hampers the democratization of large language models and their use in applications that require continuous updates and fine-tuning. As a result, there is a need to develop mechanisms that allow model owners to collaborate with external parties or distribute portions of models securely. Federated fine-tuning for downstream tasks of local devices often necessitates knowledge of the entire model’s weights, potentially raising privacy concerns. Furthermore, the process of fine-tuning and deploying foundation models can pose significant resource challenges due to their substantial parameter sizes [25, 26]. Xiao et al. [27] proposed an approach to fine-tune large foundation models using the combination of an emulator, which is a compressed version of a subset of the original large foundation model, and an adapter, which are the trainable weight to be shared. Nevertheless, these authors do not consider the federated and collaborative tuning between devices. Ding et al. [28] introduced an approach that involves model compression and an emulator-adapter-like strategy for collaborative tuning of large vision models in a device-server setting. Kuang et al. [29] proposed FedOT, which is federated version of offsite-tuning. Although Kuang et al. [29] briefly touch upon an architecture similar to our proposed Federated Parameter-Efficient Emulator-Assisted Tuning (FedPEAT), they do not provide detailed discussions.

Overview of FedPEAT with adaptive control

Proposed EAT structure

In addressing the pressing issues of model and data privacy and ownership, as well as the imperative need for memory and computation-efficient downstream model fine-tuning, we propose a novel Emulator-Assisted Tuning (EAT) structure which generalizes the offsite tuning approach introduced by Xiao et al. [27] to encompass all possible combinations of adapter and emulator configurations for large foundation model fine-tuning. Our proposed EAT structure offers the flexibility to adapt the adapter and emulator to the specific requirements of a given application. The adapter and emulator can take any form, whether encompassing layers within a transformer architecture, multi-layer perceptron, or any other neural network structure. The emulator, can have variable number of neural network layers, variable number of nodes per layer, and even variable arrangements of transformer attention units. This adaptability ensures that the model can be fine-tuned efficiently across a wide spectrum of tasks, from simple to complex.

Expansion to PEAT architecture

In the field of model tuning, various Parameter-Efficient Fine Tuning (PEFT) methods such as Low-rank Adapters (LoRA) [13], prompt tuning [14], and adapters [10, 11, 12] have been explored to make fine-tuning of pre-trained models more efficient. We combine EAT and Parameter-Efficient Fine-Tuning (PEFT) to present Parameter-Efficient Emulator-Assisted Tuning (PEAT).

FedPEAT framework

We extend the use of the PEAT into the domain of Federated Learning (FL) and introduce a novel framework, Federated Parameter-Efficient Emulator-Assisted Tuning (FedPEAT). This unique integration not only addresses model and data privacy concerns by eliminating the need for the model owner to transmit the entire model to the client and the client to send their local data to the model owner but also substantially improves the memory and computational efficiency of collaborative, downstream federated model fine-tuning. We illustrate the intersection our proposed EAT, PEAT, and FedPEAT in Figure 1.

FedPEAT adaptive control mechanism

To optimize and streamline this adaptive combination of adapters and emulators, we propose coupling them with an adaptive control mechanism. This mechanism employs a deep reinforcement learning orchestrator to control critical hyper-parameters, such as emulator model compression ratios, adapter parameter-efficient fine-tuning parameters, and even device selection for participation in collaborative federated learning during each iteration (shown in Figure 2). This integration facilitates the efficient orchestration of resources, ensuring that the fine-tuning process remains memory, computation, and communication-efficient. This orchestration ensures that participating devices possess the necessary computational resources to carry out fine-tuning effectively. This contribution is essential in guaranteeing the successful application of our model adaptation and fine-tuning technique in real-world, resource-constrained environments.

Server-Device collaborative tuning

The FedPEAT framework is applicable to collaborative FL of various contribution nature. We note two distinct types of contribution cases. The first case involves FL where all data resides on mobile edge devices (i.e., clients), with no central server involvement. In this scenario, model tuning is entirely performed on the client, while the server’s role is restricted to aggregating adapter module parameters. The second case entails federated learning where data is distributed across both client devices and a central server. Fine-tuning occurs on both client devices and the server, presenting a more complex but realistic setting that highlights the adaptability and versatility of our proposed framework. In our experiments, we consider the special case of FedPEAT framework in which the server possesses data and partakes in the collaborative federated foundation model fine-tuning process instead of acting purely as an aggregator. Through these experiments, we aim to demonstrate the practical applicability and efficacy of our approach.

Parameter-Efficient Emulator-Assisted Tuning (PEAT) Sub-units

Emulator

The emulator represents a collection of neural network weights meticulously designed to mimic the behavior of the original foundation model. Through the compression of extensive neural network knowledge into a more compact architecture, emulators aim to deliver performance that closely rivals their larger counterparts while dramatically reducing computational and storage requirements. The decision to share emulators with client devices, rather than the original foundation model, serves a dual purpose: firstly, it safeguards the proprietary nature of model ownership by obviating the need to divulge the complete model to local devices; secondly, it empowers local devices to store and undertake model fine-tuning using a significantly smaller-sized emulator. In essence, an emulator serves as a streamlined and resource-efficient rendition of a more extensive model, crafted through techniques such as pruning [30], layer drop [31], or knowledge distillation [32]. Importantly, our approach employs emulators with fixed-parameter values, without fine-tuning, to encapsulate the bulk of knowledge and information derived from pre-trained foundation models.

Adapter

Adapters are modular additions to pre-existing foundation models like large language model (LLM), designed to facilitate task-specific adaptations with minimal modifications to the original model [27]. Essentially, adapters are a smaller set of neural network weights with tunable parameters so as to encode information at the user device for downstream task fine-tuning. The smaller adapter size serves two main purpose. Firstly, the adapter is designed to be a plug which can be conveniently placed at the end of the original foundation model at the server and also a plug at the end of the emulator on the local devices. Secondly, the smaller adapter size reduces adapter transmission costs. By only tuning the parameters of these added layers, one can harness the generalized capabilities of large models while efficiently tailoring them for specific tasks.

PEFT integration

PEFT methods like LoRA [13] and Adapter [12] can significantly reduce model size, consequently save memory, while achieving comparable model performance to a model which do not use PEFT approaches [27]. The integration of PEFT methods is seamless and can be directly applied on the adapter module in each federated learning iteration.

Federated Parameter-Efficient Emulator-Assisted Tuning (FedPEAT) Framework

Emulators and Adapters

The server houses a foundation model $M_{\theta_{g}}$ , while each user device (UEs) labeled by index $n$ receives and holds:

\displaystyle\begin{cases}\text{An adapter $A_{\phi_{n}}$ specifically tuned % for the downstream task.}\\ \text{An emulator $E_{\theta_{n}}$, which is a tailored version of the}\\ \text{\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ foundational model, represented by}\\ \text{\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ $E_{\theta_{n}}=f(M_{\theta_{g}}-A_{\phi_{n}})$% .}\end{cases}

The adapter, denoted as $A$ , aligns with the definition put forth by [27], comprising sets of layers embedded within the foundation model’s architecture. These layers feature tunable parameters, specifically designed to facilitate model fine-tuning by encoding new information from downstream tasks. On the contrary, the emulators denoted by $E$ encapsulate a version of the original foundation model that may have undergone modifications. The adaptation of the emulator occurs after the removal of the adapter layers and serves as a guiding framework for tuning the adapter parameters. The parameter value of the emulators are fixed and aims to emulate the large foundation models. The transformation function $f()$ , in this context, refers to model compression algorithms such as layer drop** [33], model pruning [30]. Let $\omega$ represent the weights that are collaboratively trainable on both the server and device. $\omega^{\prime}_{s}$ refers to the untrainable weights specific to the server, excluding $\omega$ . $\omega^{\prime}_{c}$ denotes the untrainable weights specific to a device, distinct from $\omega$ .

Given this, we can generalize emulator-assisted tuning to three cases:

•

Case 1: $\omega^{\prime}_{s}\neq\omega^{\prime}_{c}\neq\varnothing:$ This is our proposed, more generalized framework, in which we permit various user devices (UEs) to employ distinct emulators, denoted as $E_{\theta_{n}}$ . These emulators correspond to the untrainable weights on a device, $\omega^{\prime}_{c}$ , which are maintained at fixed values. Similarly, the subset of the model with untrainable parameters $M^{\prime}_{\theta}$ corresponds to $\omega^{\prime}_{s}$ . This flexibility is particularly important since UEs frequently operate with constrained storage and computational resources. Therefore, emphasizing the efficient decompression and adaptiveness of the foundation model becomes essential.
•

Case 2: $\omega^{\prime}_{s}=\omega^{\prime}_{c}\neq\varnothing:$ This scenario is a subset of Case 1. Here, the emulator designated for UE $n$ aligns with the static parameters of the overarching foundation model (i.e., $\omega^{\prime}_{s}=\omega^{\prime}_{c}$ ). In this setup, we synergize Federated Learning (FL) training with Parameter-Efficient Fine-Tuning (PEFT) techniques, reflecting strategies showcased in previous research such as [34, 35, 21, 23].
•

Case 3: $\omega^{\prime}_{s}=\omega^{\prime}_{c}=\varnothing:$ This is another specific instance within the purview of Case 1. In this scenario, all participants utilize the adjustable parameters of the global foundation model $\omega$ . Essentially, no weights remain untrainable beyond the collaboratively trainable ones (i.e., $\omega^{\prime}_{s}=\omega^{\prime}_{c}=\varnothing$ ). This methodology closely mirrors the conventional federated learning (FL) paradigm, where individual model parameters are amalgamated to shape the global model.

The details of the cases are further illustrated in Figure 3.

Tuning Process

The server model, denoted as $M_{\theta_{g}}$ , can be decomposed into two primary components: the untrainable subset of weights of the foundation model $M^{\prime}_{\theta}$ , and the adapter $A_{\phi}$ . After such decomposition, the server model is expressed as $M^{\prime}_{\theta}\circ A_{\phi}$ , with the symbol “ $\circ$ " signifying the neural network connections between $M^{\prime}_{\theta}$ and $A_{\phi}$ . It is important to note that the arrangement of layers $M^{\prime}_{\theta}$ and $A_{\phi}$ is flexible and can be configured in various orders. Emulator-assisted tuning (EAT) is an approach which generalizes all emulator-adapter based configurations, which include those proposed by [27] and extends it to cases beyond those proposed by [27] such as the “Vertical” splitting of the foundation model [28]. Furthermore, the term “offsite” in their work [27] only considers a single device tuning and does not consider a multiple device collaborative training scenario. Our proposed EAT approach generalizes the emulator and adapter approach to collaborative tuning between multiple devices. Furthermore, Xiao et al. [27] do not consider a collaborative fine-tuning scenario where there are datasets that are stored at the server and that the server is able to partake in collaborative fine-tuning as proposed by [28]. The emulator-to-be, represented as $M^{\prime}_{\theta}$ , can be customized to create emulator $E_{\theta_{n}}$ specific to UE $n$ taking into account UE $n$ ’s device hardware configurations and conditions of its environment. Subsequently, this tailored emulator, $E_{\theta_{n}}$ , is distributed to each respective UE. In our work, we extend our proposed PEAT approach to a collaborative, federated model fine-tuning context and establish the Federated Parameter-Efficient Emulator-Assisted Tuning (FedPEAT) framework.

We denote $\mathcal{N}=\{1,2,\ldots,N\}$ and $\mathcal{T}=\{1,2,\ldots,T\}$ as the UE and iteration set for accomplishing the training. At the start of the first iteration, each adapter $A^{0}_{\theta_{n}}$ with randomly initialized parameter values is disseminated to UE $n$ where the complete user device model for UE $n$ will be $E^{0}_{\theta_{n}}\circ A^{0}_{\theta_{n}}$ . And then, the server orchestrator will determine the user selection $\{U_{n}^{t}|\forall n\in\mathcal{N}\}$ for participating in the current update, where $U_{n}^{t}=0$ signifies non-participation, and $1$ indicates participation. Each selected UE will then carry out model-emulated-assisted fine-tuning with their local dataset $D^{t}_{n}$ and update the parameter values of their adapter to produce $A^{1}_{\theta_{n}}$ with the assistance of the emulator. Each user device UE $n$ will then upload their adapter parameters to the server for adapter parameter aggregation as follows:

\displaystyle A^{t+1}_{\phi_{g}}=\frac{1}{\sum\limits_{n\in\mathcal{N}:U_{n}^{% t}=1}|D^{t}_{n}|}\cdot\sum^{N}_{n=1}(|D^{t}_{n}|\cdot A^{t}_{\phi_{n}}),

(1)

where $|D^{t}_{n}|$ is the size of the data being trained on at UE $n$ . The server will then disseminate this global adapter $A^{t+1}_{\phi_{g}}$ . This above-mentioned tuning process proceeds for further iterations until model convergence, or as defined by a specific criterion. This process can be summarized in Algorithm 1.

FedPEAT Adaptive Control Mechanism

The FedPEAT framework facilitates the collaborative, federated fine-tuning of models for downstream tasks. However, successful adoption of the framework for downstream task fine-tuning has to be achieved through adaptive control on the hyper-parameters related to the FedPEAT framework, the PEFTs and the FL process. As there are potentially many variables-of-concerns, and diverse scenarios, we designed an adaptive control system which is able to handle the control of multiple variables. To illustrate the FedPEAT with Adaptive Control mechanism, we consider a situation where UEs are moving within a fixed geographic space where the channel gain between the user device and the server changes.

Algorithm 1 FedPEAT with Adaptive Control

0: device set

\mathcal{N}=\{1,2,\ldots,N\}

, initial global adapter

A_{\phi_{g}}^{0}

, foundation model

M_{\theta_{g}}

1: for iteration

t\in\{1,2...,T\}

2: Adaptive control to decide user selection, emulator sizes, downlink bandwidth, and transmission power resources for every user:

\{(U_{n}^{t},E_{\theta_{n}}^{t},B_{n}^{t},P_{n}^{t})|\forall n\in\mathcal{N}\}

, based on Section FedPEAT Adaptive Control Mechanism

3: Transmit global adapter

A_{\phi_{g}}^{t}

, and if the current emulator on user devices needs to be changed, also transmit changed emulators

\{E_{\theta_{n}}^{t}|\forall n\in\mathcal{N}:{E_{\theta_{n}}^{t}\neq E_{\theta% _{n}}^{t-1}}\}

to devices

4: for device

n\in\mathcal{N}

in parallel do

5: for epoch

\nu=1

V

A^{t}_{\phi_{n}}[\nu+1]=\text{ModelTuning(}D^{t}_{n},E^{t}_{\theta_{n}},A^{t}_% {\phi_{n}}[\nu]

7: end for

A^{t}_{\phi_{n}}\leftarrow A^{t}_{\phi_{n}}[V]

9: Transmit local

A^{t}_{\phi_{n}}

to server.

10: end for

11: Perform adapter parameter aggregation with equation (1) to obtain

A^{t+1}_{\phi_{g}}

12: end for

Adaptive Control Scenario

In each iteration, the server orchestration performs key tasks. It begins by selecting users ( ${U_{n}^{t}|n\in\mathcal{N}}$ ) for participation in the FL process. Next, it determines emulators for each user ( ${E_{\theta_{n}}^{t}|n\in\mathcal{N}}$ ), arranges downlink bandwidth resources ( ${B_{n}^{t}|n\in\mathcal{N}}$ ), and allocates downlink transmission power ( ${P_{n}^{t}|n\in\mathcal{N}}$ ) for effective UE engagement. Frequency Division Multiple Access (FDMA) communication technique is adopted to mitigate the interference between UEs associated with different edge servers. Similar to the setting in [36], we assume the central server allocates its dedicated bandwidth to the UEs it is associated with. According to Shannon’s formula, the achievable transmission rate of UE $n$ and the central server can be formulated as

\displaystyle r_{n}^{t}(B_{n}^{t},P_{n}^{t})=B_{n}^{t}\log_{2}(1+\frac{g_{n}^{% t}P_{n}^{t}}{B_{n}^{t}\sigma_{0}^{2}}),

(2)

where $r_{n}^{t}(B_{n}^{t},P_{n}^{t})$ means transmission rate $r_{n}^{t}$ is a function of $B_{n}^{t},P_{n}^{t}$ . $g_{n}$ is the channel gain between UE $n$ and the central server, with Rician fading being the small-scale fading [37], and $\sigma_{0}$ is the power spectral density of additive white Gaussian noise. Note that the total bandwidth the central server can allocate is $B_{\max}$ , so we have $\sum_{n\in\mathcal{N}:U_{n}^{t}=1}B_{n}^{t}\leq B_{\max}$ . We also optimize the power allocated by the central server for the downlink transmission of emulator and adapters. Note that the total power the central server can allocate is $P_{\max}$ , so we have $\sum_{n\in\mathcal{N}:U_{n}^{t}=1}P_{n}\leq P_{\max}$ . As the size of the adapter is small and negligible in the context of emulator assisted-tuning, we assume the adapter is transmitted via a dedicated channel, and ignore the uplink energy and time overhead for adapters. Only if the current emulator is designated for modification, do we proceed to transmit the updated emulator. We introduce an indicator function $\chi[x]$ that equals 1 when event $x$ occurs and 0 otherwise. Then the transmission delay from server to UE $n$ within one iteration can be given as:

\displaystyle d_{n,trans}^{t}(U_{n}^{t},E_{\theta_{n}}^{t},B_{n}^{t},P_{n}^{t}% )=U_{n}^{t}\times\chi[E_{\theta_{n}}^{t}\neq E_{\theta_{n}}^{t-1}]\times\frac{% D(E_{\theta_{n}}^{t})}{r_{n}^{t}},

(3)

where $D(E_{\theta_{n}}^{t}$ is the allocated emulator size. Then, the time for one round of local training and model transmission for UE $n$ is $Q_{n}^{t}=d_{n,comp}^{t}+d_{n,trans}^{t}$ , where $d_{n,comp}^{t}$ is the model fine-tuning time taken for iteration $t$ of local training at UE $n$ , computed empirically.

Therefore, we formulate the problem as follows:

$\displaystyle\min_{\{(U_{n}^{t},E_{\theta_{n}}^{t},B_{n}^{t},P_{n}^{t})\|% \forall n\in\mathcal{N},\forall t\in\mathcal{T}\}}$	$\displaystyle\Bigg{\{}\xi_{p}\cdot\frac{1}{N}\sum_{t=1}^{T}\sum_{n=1}^{N}p^{t}% _{n}+\xi_{f}\cdot\sum_{t=1}^{T}\max_{n\in\mathcal{N}}Q^{t}_{n}+\xi_{s}\cdot% \frac{1}{N}\cdot\sum_{t=1}^{T}\sum_{n=1}^{N}\chi[E_{\theta_{n}}^{t}\neq E_{% \theta_{n}}^{t-1}]\Bigg{\}},$	(Adaptive Control Scenario)
Subject to:	$\displaystyle m^{t}_{n}\leq\frac{1}{q}\cdot m^{t}_{\max,n},\leavevmode\nobreak% \ \forall n\in\mathcal{N},$	(4a)
	$\displaystyle\sum_{t=2}^{T}\chi[E_{\theta_{n}}^{t}\neq E_{\theta_{n}}^{t-1}]% \leq T\cdot\frac{1}{c},\leavevmode\nobreak\ \forall n\in\mathcal{N},% \leavevmode\nobreak\ \forall t\in\mathcal{T},$	(4b)
	$\displaystyle\sum_{n\in\mathcal{N}}B_{n}^{t}\leq B_{\max},\leavevmode\nobreak% \ \leavevmode\nobreak\ \forall t\in\mathcal{T},$	(4c)
	$\displaystyle\sum_{n\in\mathcal{N}}P_{n}^{t}\leq P_{\max},\leavevmode\nobreak% \ \leavevmode\nobreak\ \forall t\in\mathcal{T}.$	(4d)

In the objective function (Adaptive Control Scenario) above, $p^{t}_{n}$ stands for the perplexity score achieved by UE $n$ at iteration $t$ , where it is a performance measure for how well a language model predicts a set of data, $Q^{t}_{n}$ stands for the log of total time taken for a single round of adapter and emulator transmission. $\chi[E_{\theta_{n}}^{t}\neq E_{\theta_{n}}^{t-1}]$ represents the emulator exchange count. $\xi_{p}$ , $\xi_{f}$ , and $\xi_{s}$ stand for the weight balancing parameters for these objectives. $m^{t}_{n}$ represents the memory space taken for the model assigned to UE $n$ , while $m^{t}_{\max,n}$ represents the available memory capacity of device $n$ at iteration $t$ . $c$ and $q$ are numerical constants. Constraint (4a) ensures that the total memory consumed for any device in each round falls well below a predefined fraction of its memory capacity. Constraint (4b) prevents excessive emulator switch counts to reduce transmission costs. Constraint (4c), (4d) are the limits of total bandwidth and power resources from the server. Essentially, the objective function (Adaptive Control Scenario) aims to minimize the sum of perplexity scores across $T$ iterations which is synonymous with achieving a quicker rate of model tuning convergence, minimizing the maximum training time amongst all devices for the federated fine-tuning process, and emulator exchange count, via optimizing the emulator compression parameter, device selection vector, bandwidth selection vector, and downlink power selection vector. For the sake of simplicity in our demonstration, we use $\frac{1}{N}\sum_{n=1}^{N}p^{t}_{n}$ as an estimate of global $p^{t}$ . The rationale behind such a formulation is to expedite model convergence, all while maintaining the constraint of the maximum total transmission and computation delay among UEs. Additionally, this approach ensures that the sizes of both the emulator and adapters remain within a practical fraction of the local devices’ memory capacity.

Deep reinforcement learning approach

We have devised a deep reinforcement learning approach as our driver behind our adaptive control mechanism to tackle our proposed problem as the problem is highly sequential and is a mixed-integer non-linear programming problem.

State

To effectively execute the FedPEAT approach, we included the following variables within the state: (1) user device-server channel gain $g_{n}^{t}$ which is required for the computation of $r^{t}_{n}$ , (2) user device available memory capacity $m^{t}_{n}$ , (3) FedPEAT UE $n$ emulator exchange count $\chi[E_{\theta_{n}}^{t}\neq E_{\theta_{n}}^{t-1}]$ which keeps track of the number of times UE $n$ has undergone emulator exchange. Output information from each successive actor branch is appended to the state to be fed into the next actor branch as shown in Figure 4. These additional information include user-selection $U^{t}_{n}$ , bandwidth selection $B^{t}_{n}$ and power selection $P^{t}_{n}$ at the current time step.

Action

In this study, we have 4 actions to include in the agent action space: (1) UE selection vector $\{U_{n}^{t}|\forall n\in\mathcal{N}\}$ , (2) downlink bandwidth selection vector $\{B_{n}^{t}|\forall n\in\mathcal{N}\}$ , (3) downlink power selection vector $\{P_{n}^{t}|\forall n\in\mathcal{N}\}$ (4) choice of emulator compression parameter $\{E_{\theta_{n}}^{t}|\forall n\in\mathcal{N}\}$ for each device, stored in a vector.

Reward

We formulate our reward function as per our objective function, where we assign our reinforcement learning agent the reward as follows in each iteration:

\displaystyle R^{t}_{d}=-\xi_{f}\cdot\frac{1}{T}\sum_{t=1}^{T}\max_{n\in% \mathcal{N}}Q^{t}_{n},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ R^{t}_{p}=-\xi_{p}\frac{1}{TN}\sum_{t=1}^{T}\sum_{n=1}^{N}p^{t}_{n},% \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ R^{t}_{s}=-\xi_% {s}\cdot\frac{1}{TN}\cdot\sum_{t=1}^{T}\sum_{n=1}^{N}\chi[E_{\theta_{n}}^{t}% \neq E_{\theta_{n}}^{t-1}].

(5)

In addition, we assign the agent very large penalties $\varkappa$ when (1) the memory size of emulator $E_{\theta_{n}}^{t}$ and adapter $A_{\phi_{n}}^{t}$ exceeds an allowable fraction of the local device $n$ ’s memory capacity, in accordance to constraint (4a), and (2) the emulator exchange count $\chi[E_{\theta_{n}}^{t}\neq E_{\theta_{n}}^{t-1}]$ exceeds a given fraction of the total iteration, in accordance to constraint (4b).

Reinforcement Learning Algorithm

We adopted the Proximal Policy Optimization (PPO) algorithm, developed by OpenAI [38], which stands as an advancement over traditional policy gradient algorithms. In the domain of sequential problems such as reinforcement learning, even minor adjustments to parameters can have a profound impact on performance, making parameter fine-tuning a challenging endeavor. PPO tackles the issue of delicate and noisy advantage estimates by implementing a cautious approach. It incorporates a Kullback–Leibler (KL) divergence penalty to regulate policy adjustments. Furthermore, PPO makes use of an importance sampling technique [39] by employing asynchronous policies for training and data collection, enhancing overall efficiency. The loss function for the Actor is formally defined as follows [38]:

\displaystyle L^{CLIP}(\varphi)=\mathbb{E}^{t}\left[\min\left(\mathfrak{r}^{t}% ({\varphi})\varpi^{t},\text{clip}\left(\mathfrak{r}^{t}(\varphi),1-\epsilon,1+% \epsilon\right)\varpi^{t}\right)\right].

In this context, $\varphi$ represents the policy. $\mathbb{E}^{t}$ signifies empirical expectations over the trajectory. $\mathfrak{r}^{t}$ represents the ratio of the current policy to the old policy. $\varpi^{t}$ denotes the estimated advantage at time $t$ and $\epsilon$ denotes the clip value. This clip** mechanism acts as a safeguard, preventing significant bias and ensuring that the policy remains within a trusted range.

Algorithm 2 SABPPO adaptive control algorithm

0: critic parameter

\phi

, critic target parameter

\phi^{\prime}

\mathcal{A}

actor parameter

\varphi

and data-sampling parameter

\varphi^{{}^{\prime}}

, initialize state

s^{t}_{g}=s^{1}_{g}

;

1: for iteration =

1,2,...

2: for action =

1,2,...

\mathcal{A_{\text{action}}}

choose action vector

a^{t}_{\text{action}}

according to

\pi_{\varphi^{\prime}}(a^{t}_{\text{action}}|s^{t}_{\text{action}})

, based on SABPPO section.

s^{t}_{\text{action}+1}

= concatenate

\{s^{t}_{\text{action}},a^{t}_{\text{action}}\}

5: end for

6: Get

R^{{t}}_{d}

R^{{t}}_{p}

R^{{t}}_{s}

based on equation (5) and next state

s^{t+1}_{1}

from the environment.

7: Collect trajectories:

\tau

\{s^{t}_{g},\mathfrak{a}^{{t}},s^{t+1}_{g},R^{{t}}_{d},R^{{t}}_{p},R^{{t}}_{s}

} iteratively till end of episode.

s^{t}_{g}\leftarrow s^{t+1}_{g}

;

9: Compute advantage

\varpi^{{{\color[rgb]{0,0,1}t}}}

based on equation (8)

10: for

o

1,2,...,O

11: Group trajectories into batches

12: for each batch do

13: Compute gradient for actor:

\triangledown\varphi

based on equation (6)

14: Apply gradient ascent on

\varphi

using

\triangledown\varphi

15: Update critic model through back-propagation of loss using equation based on (7)

16: end for

17: Update parameters of critic target network

\phi^{\prime}

with parameters of critic network

\phi

, every

C

number of iterations, where

C

denotes the interval for critic parameter update;

18: end for

19: end for

Single-Agent Action Branching Proximal Policy Optimization (SABPPO)

To ensure scalable federated fine-tuning of foundation models system, multiple variables require optimization. However, as the number of optimization variables increase, the number of actions that need to be explicitly represented grows exponentially with increasing action dimensionality, where the total number of actions equates to $\prod^{\mathfrak{D}}_{\mathfrak{d=1}}|\mathfrak{a}_{\mathfrak{d}}|$ , where $\mathfrak{D}$ is the total number of action dimensions, $\mathfrak{a}_{\mathfrak{d}}$ is the action space of action $\mathfrak{d}$ . However, traditional deep reinforcement learning architectures do not handle the exponentially growing action dimension well. We propose a novel Single-Agent Action Branching Proximal Policy Optimization (SABPPO) algorithm which is inspired by the action branching approaches proposed by [40]. The SABPPO architecture builds on state-of-the-art Proximal Policy Optimization algorithm [38] and distributes the representation of the action controllers across individual network branches, meanwhile, maintaining a shared decision module among them to encode a latent representation of the input and help with the coordination of the branches. This proposed approach enables the linear growth of the total number of network outputs with increasing action dimensionality as opposed to the combinatorial growth in current discrete-action algorithms. SABPPO extends the PPO architecture with a single critic and actor. In each FL iteration, the actor’s user-selection branch takes the state $s^{t}_{1}$ as input, producing user-selection information for concatenation with the state to form $s^{t}_{2}$ . This concatenated state is then input to the actor’s bandwidth-selection branch, generating bandwidth selection information for concatenation with the state to form $s^{t}_{3}$ . The same process occurs with the power-selection branch, producing power selection information for concatenation with the state to form $s^{t}_{4}$ . Lastly, the concatenated state is fed into the actor’s emulator-compression selection branch, yielding emulator compression selection information (shown in Figure 4). The SABPPO actor is updated as follow:

\displaystyle\Delta\varphi=\mathbb{E}^{t}[\nabla_{\varphi}\min\{\mathfrak{r}^{% t}(\varphi)\varpi^{t},\text{clip}(\mathfrak{r}^{t}(\varphi),1-\epsilon,1+% \epsilon)\varpi^{t}\}],

(6)

while the SABPPO critic is updated as follows:

\displaystyle L^{t}(\phi)=[V_{\phi}(s^{t}_{g})-(\varpi^{t}+V_{\phi^{\prime}}(s% ^{t}_{g}))]^{2}.

(7)

$s^{t}_{g}$ is the global state, which is the concatenation of all states, $V$ is the state-value function and $\phi$ and $\phi^{\prime}$ are the state-value function parameter and target state-value function parameter, respectively. Here, $\mathfrak{r}^{t}(\theta)$ represents the ratio between the two policies: $\mathfrak{r}^{t}(\theta)=\frac{\pi_{\theta}(a^{t}_{1},a^{t}_{2},a^{t}_{3},a^{t% }_{4}|s^{t}_{g})}{\pi_{{\theta^{\prime}}}(a^{t}_{1},a^{t}_{2},a^{t}_{3},a^{t}_% {4}|s^{t}_{g})}$ . And the advantage function $\varpi^{t}$ is calculated via Generalized Advantage Estimation (GAE) [41]:

\displaystyle\varpi^{t}=\delta^{t}+(\gamma\lambda)\delta^{t+1}+...+(\gamma% \lambda)^{\bar{T}-1}\delta^{t+\bar{T}-1},\leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ \leavevmode% \nobreak\ \delta^{t}=R^{t}+\gamma V_{\phi^{\prime}}(s^{t+1}_{g})-V_{\phi^{% \prime}}(s^{t}_{g}),

(8)

$\bar{T}$ is the trajectory segment, $\lambda$ is the trace decay parameter and $\gamma$ is the discount rate.

Numerical Experiments

Experiment configuration

We substantiate our study with several experiments by showing that the FedPEAT framework with adaptive control works, and compare our proposed framework against Federated full model fine-tuning (Fed-FT). To simplify our workflow and for the ease of demonstration, we utilized numerical solutions from the works by [27] to facilitate our experiment. We utilized the OPT-1.3B [2] large language model as the foundation model, which has 1208 million parameters and is of approximately 2.63 gigabytes (GB) in storage memory. We utilized the layer-drop approach [33] for the emulator compression. We adopted the perplexity-layer drop retention numerical solution from the works by [27] and established the function to be approximated by $P=25.2\varrho^{2}-43.1\varrho+31.9$ for $0<\varrho\leq 1$ , with an $R^{2}$ score of $0.97$ , where $\varrho$ stands for the layer drop retention ratio and varies from 0 to 1. We also adopted perplexity improvements of using LoRA from the works by [27], and establish the perplexity improvement upon application of LoRA to be $-0.78$ . We designed the trainable layers of the adapter to be 2 layers at both top and bottom layers of the neural network. We assume model storage memory usage to follow a linear relationship with the number of parameters of the model. We set $T$ which is the number of federated fine-tuning rounds in an episode to be 100. We consider a scenario with 1 main server and 10 user devices. In each round of federated fine-tuning, our adaptive control orchestrator selects 5 devices for fine-tuning. To facilitate the collaboratively training scenario, the server holds $30\%$ of the total data to be trained. We set our large penalty $\varkappa$ to be -50.

As we consider the communications to be over 6G networks, we assign the bandwidth $B$ to be selected from a range between $7$ and $20$ Ghz and noise $\sigma^{2}$ to be $-174$ dBm. We initialize and constrain main server power output to $(0.0,15.0)$ Watt. User-device channel gain are calculated based on path-loss [42] and user distance from the server. $\xi_{p}$ , $\xi_{f}$ , and $\xi_{s}$ are set to 5, -10, and 25, respectively, and these numbers are empirically derived with the aim of balancing the variables in the objective function. We adopt the ADAM optimizer [43] for the algorithms implemented in our study. The models are trained for 5,000,000 steps and evaluated at every 5000 steps.

Experiment results

Our findings reveal that FedPEAT with adaptive control outperforms Fed-FT significantly in terms of both communication and computation delays. Specifically, in a single round of computation and communication, Fed-FT demonstrates a delay 4.60 $\times$ longer than that of FedPEAT with adaptive control, as illustrated in Figure 5(a). This notable improvement by FedPEAT with adaptive control is achieved despite the need for an emulator exchange, where emulators are exchanged 2.10 $\times$ on average in each 10 iterations (Figure 5(b)), a process not required in Fed-FT. The ability of FedPEAT with adaptive control to mitigate delays underscores its effectiveness in enhancing the efficiency of FL systems, even when accounting for additional emulator exchanges. However, it is essential to note that FedPEAT with Adaptive control exhibits a perplexity score 3.49 points higher than that of Fed-FL.

Subsequently, we extend our analysis to compare our proposed SABPPO adaptive control algorithm with baseline algorithms, namely iterative Reinforcement Learning Agents (iterRL) and Heterogeneous Action Proximal Policy Optimization (HAPPO). IterRL employs independent actors and critics, while HAPPO, based on Centralized Training and Decentralized Execution (CTDE) [44], features three separate actors sharing a single critic model. Our experimental results, as depicted in Figure 5(h), showcase the superiority of the SABPPO algorithm in terms of model convergence and reward. Specifically, SABPPO achieves a reward of -168, outperforming HAPPO and iterRL, which attain -225 and -214, respectively, after 5,000,000 training steps. This superior performance is corroborated by lower log(delay) values (Figure 5(e)) of -4.02 for SABPPO compared to -1.33 and -3.27 for HAPPO and iterRL, respectively. Additionally, SABPPO exhibits fewer emulator exchanges (2.10 $\times$ ) compared to HAPPO (4.91 $\times$ ) and iterRL (3.95 $\times$ ) (Figure 5(f)), as well as lower perplexity scores (15.03 compared to 17.86 and 16.60 for HAPPO and iterRL, respectively), as seen in Figure 5(g)).

Furthermore, the SABPPO framework demonstrates a significantly shorter delay in model training (0.308 seconds) compared to HAPPO (0.534 seconds) and iterRL (0.549 seconds), as highlighted in our results. These findings collectively underscore the efficiency gains achieved by the SABPPO algorithm across various performance metrics when compared to baseline algorithms in federated learning scenarios.

Discussion and Conclusion

In summary, the deployment and refinement of large foundation models present multifaceted challenges, necessitating solutions that address collaborative training, model ownership, and computational constraints to fully unlock their potential. In response to these challenges, we extend the offsite tuning paradigm to introduce Emulator-Assisted Tuning (EAT) and integrate it with Parameter-Efficient Fine-Tuning (PEFT), resulting in the creation of Parameter-Efficient Emulator-Assisted Tuning (PEAT). This novel approach is further extended to the FL domain, resulting in Federated Parameter-Efficient Emulator-Assisted Tuning (FedPEAT).

Our proposed FedPEAT framework, featuring adaptive control and a unique fusion of adapters and emulators, represents a pioneering avenue for advancing model privacy and optimizing memory-efficient downstream federated fine-tuning. Adapters, endowed with trainable neural network parameters, tailor models for specific tasks, while emulators offer compressed, fixed-parameter representations. This innovative approach not only addresses concerns regarding model privacy by eliminating the need to transmit complete models to edge devices but also substantially enhances memory and computational efficiency. Its adaptability to diverse neural network architectures is complemented by an adaptive control mechanism, employing deep reinforcement learning to optimize critical hyper-parameters, thus ensuring efficient resource orchestration.

In the broader context, our FedPEAT framework, empowered by the SABPPO-Adaptive control optimizer, facilitates Federated fine-tuning by leveraging Parameter-Efficient Fine-Tuning (PEFT) and Emulator-Assisted Tuning (EAT) methodologies. This framework upholds user data privacy through Federated Learning and protects model owner intellectual property (IP) through EAT. Furthermore, our experimental results demonstrate that FedPEAT with Adaptive control significantly outperforms traditional Federated Learning (Fed-FT) in terms of communication and computation efficiency as the use of FedPEAT reduces the foundation model memory footprint and number of parameters to tune. This efficiency gain enables the inclusion of low-resource devices in federated fine-tuning of foundation models. While FedPEAT with adaptive control exhibits a slightly higher perplexity score compared to Fed-FT, the marginal discrepancy in performance is overshadowed by the substantial reduction in communication and computation overhead. Notably, the adaptive control mechanism can be fine-tuned to prioritize higher perplexity scores should specific preferences or requirements dictate such adjustments.

Moreover, our investigation reveals that the training time required for the proposed SABPPO adaptive control optimizer is significantly lower than that achieved by the HAPPO and iterRL algorithms. This efficiency gain is attributed to the streamlined training process of a single actor and a single critic in SABPPO, as opposed to the more resource-intensive training requirements of multiple actors and critics in HAPPO and iterRL. This reduction in the number of neural networks trained concurrently contributes significantly to the observed decrease in training time. In conclusion, our comprehensive framework, FedPEAT with adaptive control, stands as a pioneering solution, providing a nuanced balance between model performance, privacy, and resource efficiency in the complex landscape of federated learning and large model fine-tuning.

References

[1] Brown, T. et al. Language models are few-shot learners. \JournalTitleAdvances in Neural Information Processing Systems 33, 1877–1901 (2020).
[2] Radford, A. et al. Language models are unsupervised multitask learners. \JournalTitleOpenAI blog 1, 9 (2019).
[3] Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. \JournalTitlearXiv preprint arXiv:1810.04805 (2018).
[4] Wei, J. et al. Finetuned language models are zero-shot learners. \JournalTitlearXiv preprint arXiv:2109.01652 (2021).
[5] Muennighoff, N. et al. Crosslingual generalization through multitask finetuning. \JournalTitlearXiv preprint arXiv:2211.01786 (2022).
[6] Letaief, K. B., Chen, W., Shi, Y., Zhang, J. & Zhang, Y.-J. A. The roadmap to 6g: Ai empowered wireless networks. \JournalTitleIEEE communications magazine 57, 84–90 (2019).
[7] McMahan, B., Moore, E., Ramage, D., Hampson, S. & y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, 1273–1282 (PMLR, 2017).
[8] Konečnỳ, J. et al. Federated learning: Strategies for improving communication efficiency. \JournalTitlearXiv preprint arXiv:1610.05492 (2016).
[9] Bonawitz, K. et al. Towards federated learning at scale: System design. \JournalTitleProceedings of Machine Learning and Systems 1, 374–388 (2019).
[10] Rebuffi, S.-A., Bilen, H. & Vedaldi, A. Learning multiple visual domains with residual adapters. \JournalTitleAdvances in Neural Information Processing Systems 30 (2017).
[11] He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T. & Neubig, G. Towards a unified view of parameter-efficient transfer learning. \JournalTitlearXiv preprint arXiv:2110.04366 (2021).
[12] Houlsby, N. et al. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, 2790–2799 (PMLR, 2019).
[13] Hu, E. J. et al. LoRA: Low-rank adaptation of large language models. \JournalTitlearXiv preprint arXiv:2106.09685 (2021).
[14] Qin, G. & Eisner, J. Learning how to ask: Querying LMs with mixtures of soft prompts. \JournalTitlearXiv preprint arXiv:2104.06599 (2021).
[15] Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. \JournalTitlearXiv preprint arXiv:2104.08691 (2021).
[16] Li, X. L. & Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. \JournalTitlearXiv preprint arXiv:2101.00190 (2021).
[17] Liu, X. et al. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. \JournalTitlearXiv preprint arXiv:2110.07602 (2021).
[18] An, S. et al. Input-tuning: Adapting unfamiliar inputs to frozen pretrained models. \JournalTitlearXiv preprint arXiv:2203.03131 (2022).
[19] Liu, H. et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. \JournalTitleAdvances in Neural Information Processing Systems 35, 1950–1965 (2022).
[20] Zhang, Z. et al. FedPETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models. In Annual Meeting of the Association of Computational Linguistics 2023, 9963–9977 (Association for Computational Linguistics (ACL), 2023).
[21] Zhao, H., Du, W., Li, F., Li, P. & Liu, G. FedPrompt: Communication-efficient and privacy-preserving prompt tuning in federated learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2023).
[22] Zhang, J. et al. Towards building the federated gpt: Federated instruction tuning. \JournalTitlearXiv preprint arXiv:2305.05644 (2023).
[23] Guo, T., Guo, S., Wang, J., Tang, X. & Xu, W. PromptFL: Let federated participants cooperatively learn prompts instead of models-federated learning in age of foundation model. \JournalTitleIEEE Transactions on Mobile Computing (2023).
[24] Cai, D., Wu, Y., Wang, S., Lin, F. X. & Xu, M. FedAdapter: Efficient federated learning for modern NLP. In ACM 29th Annual International Conference on Mobile Computing and Networking (MobiCom) (2023).
[25] Smith, S. et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. \JournalTitlearXiv preprint arXiv:2201.11990 (2022).
[26] Xiao, G. et al. SmoothQuant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, 38087–38099 (PMLR, 2023).
[27] Xiao, G., Lin, J. & Han, S. Offsite-tuning: Transfer learning without full model. \JournalTitlearXiv preprint arXiv:2302.04870 (2023).
[28] Ding, Y. et al. DC-CCL: Device-cloud collaborative controlled learning for large vision models. \JournalTitlearXiv preprint arXiv:2303.10361 (2023).
[29] Kuang, W. et al. FederatedScope-LLM: A comprehensive package for fine-tuning large language models in federated learning. \JournalTitlearXiv preprint arXiv:2309.00363 (2023).
[30] Han, S., Mao, H. & Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. \JournalTitlearXiv preprint arXiv:1510.00149 (2015).
[31] Sajjad, H., Dalvi, F., Durrani, N. & Nakov, P. On the effect of drop** layers of pre-trained transformer models. \JournalTitleComputer Speech & Language 77, 101429 (2023).
[32] Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. \JournalTitlearXiv preprint arXiv:1503.02531 (2015).
[33] Zhang, M. & He, Y. Accelerating training of transformer-based language models with progressive layer drop**. \JournalTitleAdvances in Neural Information Processing Systems 33, 14011–14023 (2020).
[34] Zhang, Z. et al. FedPETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2023, 9963–9977, DOI: 10.18653/v1/2023.findings-acl.632 (Association for Computational Linguistics, Toronto, Canada, 2023).
[35] Zhang, J. et al. Towards building the federated GPT: Federated instruction tuning. \JournalTitlearXiv preprint arXiv:2305.05644 (2023).
[36] Lim, W. Y. B. et al. Dynamic edge association and resource allocation in self-organizing hierarchical federated learning networks. \JournalTitleIEEE Journal on Selected Areas in Communications 39, 3640–3653 (2021).
[37] Xiao, C., Zheng, Y. R. & Beaulieu, N. C. Statistical simulation models for rayleigh and rician fading. In IEEE International Conference on Communications, 2003. ICC’03., vol. 5, 3524–3529 (IEEE, 2003).
[38] Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. \JournalTitlearXiv preprint arXiv:1707.06347 (2017).
[39] Kahn, H. & Harris, T. E. Estimation of particle transmission by random sampling. \JournalTitleNational Bureau of Standards Applied Mathematics Series 12, 27–30 (1951).
[40] Tavakoli, A., Pardo, F. & Kormushev, P. Action branching architectures for deep reinforcement learning. In Proceedings of the aaai conference on artificial intelligence, vol. 32 (2018).
[41] Schulman, J., Moritz, P., Levine, S., Jordan, M. & Abbeel, P. High-dimensional continuous control using generalized advantage estimation. \JournalTitlearXiv preprint arXiv:1506.02438 (2015).
[42] Erceg, V. et al. An empirically based path loss model for wireless channels in suburban environments. \JournalTitleIEEE Journal on selected areas in communications 17, 1205–1211 (1999).
[43] Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. \JournalTitlearXiv preprint arXiv:1412.6980 (2014).
[44] Lowe, R. et al. Multi-agent actor-critic for mixed cooperative-competitive environments. \JournalTitleAdvances in neural information processing systems 30 (2017).

sectionAcknowledgements

This research is supported in part by Nanyang Technological University (NTU), the NTU-Wallenberg AI, Autonomous Systems and Software Program (WASP) Joint Project; NTU Startup Grant; the Sin- gapore Ministry of Education Academic Research Fund under Grant Tier 1 RG97/20, Grant Tier 1 RG24/20 and Grant Tier 2 MOE2019-T2-1-176.

Author contributions statement

T.J.C, WH.Y, Y.L, and J.Z contributed equally and wrote the main manuscript text and software of programming.

Competing interests statement

The authors declare no competing interests.

Legends

Figure 1. Intersection of Federated learning (FL), Parameter-Efficient Fine-Tuning (PEFT), and Emulator-Assisted Tuning (EAT). Here we illustrate the intersection of FL, PEFT, and (EAT). The main contribution of our current paper is to introduce Federated Parameter-Efficient Emulator-Assisted Tuning (FedPEAT), as a convergence of EAT, PEFT, and FL, while EAT and Parameter-Efficient Emulator-Assisted Tuning (PEAT) are also terms coined by our paper.

Figure 2. FedPEAT with Adaptive control overview. This figure shows how the Adaptive control orchestrator makes decisions on important parameters, such as device selection, emulator compression parameter, transmission bandwidth and power to facilitate the FedPEAT process.

Figure 3. Emulator-Assisted Tuning generalized to three cases. Figure illustrates how the neural network structures at the server and local devices differ in each case. Case 1 represents our proposed FedPEAT framework. Case 2 represents the integration of Federated Learning and PEFT. Case 3 represents a traditional Federated Learning scenario.

Figure 4. Our proposed SABPPO algorithm and architecture. Figure illustrates the underlying actor and critic architecture, their interaction with the environment and model update process.

Figure 5. Comparison between FL and FedPEAT, and Comparison between Adaptive control algorithms. Figure 5(a), 5(b), 5(c) illustrates the performance difference between FL and FedPEAT with regards to delay, emulator exchange count, and perplexity, respectively. Figure 5(d) illustrates the time taken for model training for each adaptive control algorithm. Figure 5(e), 5(f), 5(g), 5(h) illustrate the performance of each adaptive control algorithm in across the training process, in terms of log(delay), emulator exchange count, perplexity, and reward.

Algorithm 1. FedPEAT with adaptive control mechanism.

Algorithm 2. SABPPO adaptive control algorithm.