Decoupling General and Personalized Knowledge in Federated Learning via Additive and Low-rank Decomposition

Xinghao Wu1,∗, Xuefeng Liu1,2, Jianwei Niu1,2,12{}^{1,2,\text{\textdagger}}start_FLOATSUPERSCRIPT 1 , 2 , † end_FLOATSUPERSCRIPT, Haolin Wang1, Shaojie Tang3,
Guogang Zhu1, and Hao Su1
1State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Bei**g, China
2Zhongguancun Laboratory, Bei**g, China
3**dal School of Management, University of Texas at Dallas
Abstract.

To address data heterogeneity, the key strategy of Personalized Federated Learning (PFL) is to decouple general knowledge (shared among clients) and client-specific knowledge, as the latter can have a negative impact on collaboration if not removed. Existing PFL methods primarily adopt a parameter partitioning approach, where the parameters of a model are designated as one of two types: parameters shared with other clients to extract general knowledge and parameters retained locally to learn client-specific knowledge. However, as these two types of parameters are put together like a jigsaw puzzle into a single model during the training process, each parameter may simultaneously absorb both general and client-specific knowledge, thus struggling to separate the two types of knowledge effectively. In this paper, we introduce FedDecomp, a simple but effective PFL paradigm that employs parameter additive decomposition to address this issue. Instead of assigning each parameter of a model as either a shared or personalized one, FedDecomp decomposes each parameter into the sum of two parameters: a shared one and a personalized one, thus achieving a more thorough decoupling of shared and personalized knowledge compared to the parameter partitioning method. In addition, as we find that retaining local knowledge of specific clients requires much lower model capacity compared with general knowledge across all clients, we let the matrix containing personalized parameters be low rank during the training process. Moreover, a new alternating training strategy is proposed to further improve the performance. Experimental results across multiple datasets and varying degrees of data heterogeneity demonstrate that FedDecomp outperforms state-of-the-art methods up to 4.9%.

Personalized Federated Learning, Data Heterogeneity, Parameter Decomposition
ccs: Computing methodologies Distributed artificial intelligence

1. Introduction

footnotetext: [email protected]footnotetext: {}^{\text{\textdagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPTCorresponding author

Federated learning (FL) (McMahan et al., 2016) allows clients to collaboratively train a global model without directly sharing their raw data. It has garnered widespread attention in the design of multimedia artificial intelligence systems (Che et al., 2023; Xiong et al., 2023; Zhang et al., 2023b; Lao et al., 2023). A central challenge in FL is data heterogeneity, where the data distributions across diverse clients are not independently and identically distributed (non-IID). Such disparities in data distributions hamper the training of the global model, leading to a decrease in the performance of FL (Zhang et al., 2021a; Gong et al., 2021; Li et al., 2021a).

To confront this challenge, the concept of Personalized Federated Learning (PFL) has been introduced. Within PFL studies, it is widely accepted that the knowledge learned by a client should be decoupled into two categories: the general knowledge across all clients and client-specific knowledge for this client (Wu et al., 2023; Zhang et al., 2023a; Chen and Chao, 2022). The former is used for sharing among clients to promote collaboration, while the latter is retained locally to keep personalization and reduce the impact of data heterogeneity on collaboration. This understanding prompts mainstream PFL research to propose a partition based method where each parameter in the client’s personalized model is designated as one of two types before training begins: parameter shared with other clients to extract general knowledge and parameter retained in personalization to learn client-specific knowledge. A multitude of studies have emerged. For instance, FedPer (Arivazhagan et al., 2019) focuses on personalizing the classifier, whereas FedBN (Li et al., 2021c) targets the personalization of Batch Normalization layers. FedCAC (Wu et al., 2023) proposes to select personalized parameters based on measurable metrics.

Refer to caption

Figure 1. A toy example to illustrate the partition based method.

Although the aforementioned methods have achieved some success and attracted widespread attention, they still haven’t effectively separated the two types of knowledge. This is because the personalized parameters and shared parameters of a model learn from local data together as a whole, allowing each parameter to potentially absorb both general and client-specific knowledge simultaneously. This means that personalized parameters may contain some general knowledge that should be shared, and shared parameters may contain some client-specific knowledge that should be personalized. Fig. 1 is a toy example that illustrates this point intuitively. The square on the far left represents a personalized model parameter matrix containing nine parameters. The blue diagonal rectangle and the orange diagonal rectangle within each parameter represent the general knowledge and personalized knowledge contained in that parameter, respectively, with their area size indicating the amount of knowledge. It can be seen that each parameter contains both general and personalized knowledge. The partition based method divides some parameters into shared parameters (as shown by the blue squares), and the rest into personalized parameters (as shown by the orange squares). Obviously, this ”black or white” partitioning method cannot achieve knowledge decoupling within parameters, leading to some client-specific knowledge in shared parameters being shared, thereby reducing the degree of model personalization; while some shared knowledge in personalized parameters is personalized, thus reducing the level of client collaboration.

Refer to caption

Figure 2. A toy example to illustrate our method. The depth of blue/orange in the shared/personalized parameters indicates the amount of knowledge from the corresponding parameters in the original parameter matrix.

In this paper, we propose a new PFL paradigm based on parameter additive decomposition, called FedDecomp, to address the aforementioned issues. Unlike methods based on partitioning, which classify a parameter as either personalized or shared, FedDecomp decomposes each parameter into the sum of two parameters before training begins: one shared to facilitate knowledge exchange among clients and one retained locally to maintain personalization. Furthermore, we find that in PFL, general knowledge should be retained in shared parameters with high model capacity to cover all clients, while a client’s specific knowledge can be learned by personalized parameters with lower model capacity as a supplement to general knowledge. Hence, FedDecomp constrains the matrix containing personalized parameters to be low rank. This allows the personalized part to focus its learning on the most critical aspects of the local knowledge and reduce the overfitting to the local knowledge. Consequently, it helps to retain a significant portion of the general knowledge acquired from other clients, thereby enhancing generalization. Fig. 2 illustrates our proposed method. It can be seen that in our method, both the general knowledge and client-specific knowledge in each parameter can be decoupled and accordingly captured by shared and personalized parameters, achieving more efficient client collaboration and personalization.

In addition, different from the current methods which simply train personalized parameters and shared parameters simultaneously, we examine the training order of shared and personalized parameter matrices during local updates. Specifically, we propose to initially train the personalized low-rank part to mitigate the influence of non-IID data, followed by training the shared full-rank part. Our findings suggest that adopting an alternating approach, unlike concurrent training methods, yields greater benefits.

Our primary contribution in this paper can be summarized as follows:

  • We introduce a new method of decomposing shared and personalized parameters in PFL, namely FedDecomp. Specifically, we decompose each layer of the personalized model into the sum of a shared full-rank part to preserve general knowledge and a personalized low-rank part to preserve client-specific knowledge.

  • We introduce an innovative training strategy designed to optimize FedDecomp, effectively mitigating the implications of non-IID data and significantly boosting performance.

  • We evaluate FedDecomp across multiple datasets and under varied non-IID conditions. Our findings underscore the efficacy of the FedDecomp method we propose.

2. Related Work

PFL has emerged as a prevalent research direction to handle the non-IID problem in FL. Current PFL methods can be mainly divided into meta-learning-based methods (Fallah et al., 2020; Acar et al., 2021), fine-tuning-based methods (** et al., 2022; Chen et al., 2023), clustering-based methods (Sattler et al., 2020; Cai et al., 2023), model-regularization-based methods (T Dinh et al., 2020; Li et al., 2021b), personalized-aggregation-based methods (Huang et al., 2021; Zhang et al., 2021b), and parameter-partition-based methods. Among these methods, the parameter-partition-based method has attracted a lot of attention due to its simplicity and effectiveness.

Parameter-partition-based method. The core idea of this kind of method is to share part of the original model’s parameters while personalizing the other part. Representative works include selecting specific layers for personalization, such as FedPer (Arivazhagan et al., 2019), FedRep (Collins et al., 2021), and GPFL (Zhang et al., 2023a) proposing to personalize classifiers. FedBN (Li et al., 2021c) and MTFL (Mills et al., 2021) suggest to personalize the Batch Normalization (BN) layers. LG-FedAvg (Liang et al., 2020) and FedGH (Yi et al., 2023) propose to personalize feature extractor. Other works employ Deep Reinforcement Learning (DRL) or hypernetworks technologies to automate the selection of specific layers for personalization (Sun et al., 2021; Ma et al., 2022). Still, some other research no longer selects personalized parameters based on layers but on each individual parameter, making more fine-grained choices to personalize parameters sensitive to non-IID data (Wu et al., 2023). In recent years, some studies propose another kind of personalized parameter partitioning method. Unlike the previous method, the core idea of this method is to add additional personalized layers to the original model. For example, ChannelFed (Zheng et al., 2022) introduces a personalized attention layer to redistribute weights for different channels in a personalized manner. (Pillutla et al., 2022) proposes to add a bottleneck module for personalization after each feedforward layer.

Parameter-decomposition-based method. A few PFL works also utilize parameter decomposition techniques. For instance, Fedpara (Hyeon-Woo et al., 2022) decomposes the personalized model parameter matrix into the Hadamard product of two low-rank matrices. It has been proven that the parameter matrix after the Hadamard product still possesses a high rank, thus not sacrificing model capacity. This way, it only requires uploading the low-rank matrix with fewer parameters during training, thereby reducing communication overhead. Factorized-FL (Jeong and Hwang, 2022) decomposes the model parameter matrix into the product of a column vector and a row vector. During training, it personalizes the row vector while sharing the column vector. This is essentially a low-rank decomposition technique aimed at reducing communication overhead by only uploading the column vector. FedSLR (Huang et al., 2023), on the other hand, performs a low-rank decomposition of the parameter matrix when the server distributes the model, thus reducing the communication overhead of the downlink.

It is evident that our approach differs significantly from the current decomposition-based methods, from objectives to methodologies. The current methods mainly focus on the communication issue in PFL by reducing the amount of communication between clients and the server through low-rank decomposition of parameters. Our paper, however, focuses on the decoupling and extraction of knowledge in PFL. Through additive decomposition, it decouples the learning of general knowledge and client-specific knowledge. By constraining the personalized matrix to be low-rank, it coordinates the relationship between general and client-specific knowledge.

3. Method

Refer to caption

Figure 3. Overview of one client in FedDecomp in one communication round.

3.1. Overview of FedDecomp

We first give an overview of FedDecomp. As illustrated in Fig. 3, each layer of client i𝑖iitalic_i’s personalized model is decomposed into the sum of a full-rank matrix and a low-rank matrix. The training process in each communication round can be summarized as follows: 1) each client i𝑖iitalic_i freezes its full-rank matrices 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and updates the low-rank matrices 𝝉isubscript𝝉𝑖\boldsymbol{\tau}_{i}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. 2) Then, each client i𝑖iitalic_i turns to update 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and freeze 𝝉isubscript𝝉𝑖\boldsymbol{\tau}_{i}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. After local updating, all clients upload the full-rank part to the server while kee** the low-rank part private. 3) The server receives clients’ full-rank matrices and aggregates them to generate a global model 𝝈¯bold-¯𝝈\boldsymbol{\overline{\sigma}}overbold_¯ start_ARG bold_italic_σ end_ARG. After doing this, the server sends 𝝈¯bold-¯𝝈\boldsymbol{\overline{\sigma}}overbold_¯ start_ARG bold_italic_σ end_ARG back to all clients. 4) Each client receives the global model and uses it to initialize the full-rank matrices.

3.2. Problem Definition of PFL

PFL, in contrast to traditional FL algorithms that train a general model for all clients, strives to develop a personalized model for each client i𝑖iitalic_i, denoted as wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, specializing in capturing the unique characteristics of its local data distribution Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In recent PFL research, there is a consensus that the knowledge acquired by individual clients comprises both general knowledge and client-specific knowledge. In non-IID scenarios, since different clients have distinct data distributions (i.e., DiDj,ijformulae-sequencesubscript𝐷𝑖subscript𝐷𝑗𝑖𝑗D_{i}\neq D_{j},i\neq jitalic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_i ≠ italic_j), it is difficult to extract general knowledge and thus brings challenges to client collaboration. To address this problem, PFL decouples wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into a shared part σ𝜎\sigmaitalic_σ and a personalized part τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to learn general knowledge and client-specific knowledge respectively. Formally, the training objective can be formulated as

(1) minσ,τ1,τ2,τNi=1NLi(σ,τi;Di),subscript𝜎subscript𝜏1subscript𝜏2subscript𝜏𝑁superscriptsubscript𝑖1𝑁subscript𝐿𝑖𝜎subscript𝜏𝑖subscript𝐷𝑖\min_{\sigma,\tau_{1},\tau_{2},...\tau_{N}}\sum_{i=1}^{N}L_{i}(\sigma,\tau_{i}% ;D_{i}),roman_min start_POSTSUBSCRIPT italic_σ , italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_τ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_σ , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where Li(σ;τi;Di)subscript𝐿𝑖𝜎subscript𝜏𝑖subscript𝐷𝑖L_{i}(\sigma;\tau_{i};D_{i})italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_σ ; italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the loss function of client i𝑖iitalic_i and N𝑁Nitalic_N is the total number of clients. To optimize the target function in Eq. (1), recent studies have put forth various PFL methods to partition τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σ𝜎\sigmaitalic_σ. While these endeavors have shown promise, the question of how to further refine the decomposition of these two parameter components still presents an unresolved challenge.

3.3. Low-rank Parameter Decomposition

We observe that shared parameters responsible for extracting general knowledge benefit from a high model capacity. In contrast, personalized parameters are tasked with learning knowledge that complements the general understanding for specific local tasks (i.e., client-specific knowledge), therefore, it is sufficient to use a low-rank matrix to represent these personalized parameters. Based on this observation, we propose FedDecomp, an additive low-rank decomposition technique. Details about this method are as follows.

Additive Decomposition of Personalized Models: Assume that each personalized model has a set of weights 𝜽𝒊={θik}k=1Lsubscript𝜽𝒊superscriptsubscriptsuperscriptsubscript𝜃𝑖𝑘𝑘1𝐿\boldsymbol{\theta_{i}}=\{\theta_{i}^{k}\}_{k=1}^{L}bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT = { italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, where θiksuperscriptsubscript𝜃𝑖𝑘\theta_{i}^{k}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is the weights for the k𝑘kitalic_k-th layer and L𝐿Litalic_L is the total layer number. Each weight matrix θiksuperscriptsubscript𝜃𝑖𝑘\theta_{i}^{k}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is originally full-rank. In FedDecomp, we decompose θiksuperscriptsubscript𝜃𝑖𝑘\theta_{i}^{k}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as

(2) θik=σik+τik,k[1,L],formulae-sequencesuperscriptsubscript𝜃𝑖𝑘superscriptsubscript𝜎𝑖𝑘superscriptsubscript𝜏𝑖𝑘𝑘1𝐿\theta_{i}^{k}=\sigma_{i}^{k}+\tau_{i}^{k},k\in[1,L],italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ∈ [ 1 , italic_L ] ,

where σik,k[1,L]superscriptsubscript𝜎𝑖𝑘𝑘1𝐿\sigma_{i}^{k},k\in[1,L]italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ∈ [ 1 , italic_L ] is a full-rank parameter matrix that is shared across all clients, and τik,k[1,L]superscriptsubscript𝜏𝑖𝑘𝑘1𝐿\tau_{i}^{k},k\in[1,L]italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ∈ [ 1 , italic_L ] is a personalized low-rank parameter matrix. In the following, we employ the notation 𝜽isubscript𝜽𝑖\boldsymbol{\theta}_{i}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝝉isubscript𝝉𝑖\boldsymbol{\tau}_{i}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the complete model parameter set, the full-rank parameter set, and the low-rank parameter set specific to client i𝑖iitalic_i, respectively. Additionally, we use θiksuperscriptsubscript𝜃𝑖𝑘\theta_{i}^{k}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, σiksuperscriptsubscript𝜎𝑖𝑘\sigma_{i}^{k}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and τiksuperscriptsubscript𝜏𝑖𝑘\tau_{i}^{k}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT to represent the parameter matrices for layer k𝑘kitalic_k within client i𝑖iitalic_i.

Next, we present the methods for imposing low-rank constraints on 𝝉isubscript𝝉𝑖\boldsymbol{\tau}_{i}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Low-rank Decomposition of Fully-Connected Layers: For fully-connected layers, the dimension of τiksuperscriptsubscript𝜏𝑖𝑘\tau_{i}^{k}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is I×O𝐼𝑂I\times Oitalic_I × italic_O, where I𝐼Iitalic_I and O𝑂Oitalic_O represent the input and output dimensions. We constrain τiksuperscriptsubscript𝜏𝑖𝑘\tau_{i}^{k}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT through a low-rank decomposition as follows:

(3) τik=BikAik,superscriptsubscript𝜏𝑖𝑘superscriptsubscript𝐵𝑖𝑘superscriptsubscript𝐴𝑖𝑘\displaystyle\tau_{i}^{k}=B_{i}^{k}A_{i}^{k},italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ,
whereBikI×(Rlmin(I,O))andAik(Rlmin(I,O))×O.wheresuperscriptsubscript𝐵𝑖𝑘superscript𝐼subscript𝑅𝑙𝐼𝑂andsuperscriptsubscript𝐴𝑖𝑘superscriptsubscript𝑅𝑙𝐼𝑂𝑂\displaystyle\text{where}\ B_{i}^{k}\in\mathbb{R}^{I\times(R_{l}\cdot\min(I,O)% )}\ \text{and}\ A_{i}^{k}\in\mathbb{R}^{(R_{l}\cdot\min(I,O))\times O}.where italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I × ( italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_min ( italic_I , italic_O ) ) end_POSTSUPERSCRIPT and italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_min ( italic_I , italic_O ) ) × italic_O end_POSTSUPERSCRIPT .

The Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT serves as a hyper-parameter designed to regulate the rank of τiksuperscriptsubscript𝜏𝑖𝑘\tau_{i}^{k}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT within fully-connected layers. Its value falls within the range of 0<Rl10subscript𝑅𝑙10<R_{l}\leq 10 < italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ 1.

Low-rank Decomposition of Convolutional Layers: In contrast to fully-connected layers, convolutional layers involve multiple kernels, resulting in τikI×O×K×Ksuperscriptsubscript𝜏𝑖𝑘𝐼𝑂𝐾𝐾\tau_{i}^{k}\in I\times O\times K\times Kitalic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ italic_I × italic_O × italic_K × italic_K dimensions. However, we can still apply a low-rank decomposition to constrain τiksuperscriptsubscript𝜏𝑖𝑘\tau_{i}^{k}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as follows:

(4) τik=BikAik(IK)×(OK),superscriptsubscript𝜏𝑖𝑘superscriptsubscript𝐵𝑖𝑘superscriptsubscript𝐴𝑖𝑘superscript𝐼𝐾𝑂𝐾\displaystyle\tau_{i}^{k*}=B_{i}^{k}A_{i}^{k}\in\mathbb{R}^{(I\cdot K)\times(O% \cdot K)},italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ∗ end_POSTSUPERSCRIPT = italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_I ⋅ italic_K ) × ( italic_O ⋅ italic_K ) end_POSTSUPERSCRIPT ,
Bik(IK)×(Rcmin(I,O)K)andAik(Rcmin(I,O)K)×(OK),superscriptsubscript𝐵𝑖𝑘superscript𝐼𝐾subscript𝑅𝑐𝐼𝑂𝐾andsuperscriptsubscript𝐴𝑖𝑘superscriptsubscript𝑅𝑐𝐼𝑂𝐾𝑂𝐾\displaystyle B_{i}^{k}\in\mathbb{R}^{(I\cdot K)\times(R_{c}\cdot\min(I,O)% \cdot K)}\ \text{and}\ A_{i}^{k}\in\mathbb{R}^{(R_{c}\cdot\min(I,O)\cdot K)% \times(O\cdot K)},italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_I ⋅ italic_K ) × ( italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ roman_min ( italic_I , italic_O ) ⋅ italic_K ) end_POSTSUPERSCRIPT and italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ roman_min ( italic_I , italic_O ) ⋅ italic_K ) × ( italic_O ⋅ italic_K ) end_POSTSUPERSCRIPT ,
τik=Reshape(τik)I×O×K×K.superscriptsubscript𝜏𝑖𝑘Reshapesuperscriptsubscript𝜏𝑖𝑘superscript𝐼𝑂𝐾𝐾\displaystyle\tau_{i}^{k}=\mbox{Reshape}(\tau_{i}^{k*})\in\mathbb{R}^{I\times O% \times K\times K}.italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = Reshape ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ∗ end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_I × italic_O × italic_K × italic_K end_POSTSUPERSCRIPT .

The Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is a hyper-parameter used to control the rank of τiksuperscriptsubscript𝜏𝑖𝑘\tau_{i}^{k}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT within convolutional layers. Its value is within the range of 0<Rc10subscript𝑅𝑐10<R_{c}\leq 10 < italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≤ 1.

During training, both B𝐵Bitalic_B and A𝐴Aitalic_A serve as trainable parameter matrices. We initialize A𝐴Aitalic_A with random Gaussian values and B𝐵Bitalic_B with zeros, which means τiksuperscriptsubscript𝜏𝑖𝑘\tau_{i}^{k}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT starts as zero at the beginning of training.

The hyper-parameters Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT play crucial roles in controlling the rank of parameters within fully connected and convolutional layers, respectively. As the rank increases, the learning capacity of personalized parameters within the model gradually improves. However, if the rank is set too low, 𝝉isubscript𝝉𝑖\boldsymbol{\tau}_{i}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT may struggle to effectively capture client-specific knowledge, making 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT highly susceptible to non-IID data distributions. This, in turn, negatively impacts collaboration among clients. In contrast, if the rank is too large, 𝝉isubscript𝝉𝑖\boldsymbol{\tau}_{i}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT may start to absorb some of the general knowledge that should be learned by 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, diminishing the level of collaboration among clients. For simplicity, in the FedDecomp approach, we apply the same Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to all convolutional layers and the same Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to all fully-connected layers. This simplification streamlines the model architecture and hyper-parameter tuning process.

3.4. Coordinate Training Between σ𝜎\mathbf{\sigma}italic_σ and τ𝜏\mathbf{\tau}italic_τ

Refer to caption

Figure 4. A toy example to illustrate the alternating training in FedDecomp.

To better extract general knowledge, in contrast to the common practice where personalized and shared parameters are trained simultaneously, we find that a more effective strategy is to initially train the low-rank parameters. This alternating approach helps mitigate the impact of non-IID data before proceeding to train the full-rank parameters. Formally, in each communication round t[1,T]𝑡1𝑇t\in[1,T]italic_t ∈ [ 1 , italic_T ], we first optimize the low-rank parameters 𝝉isubscript𝝉𝑖\boldsymbol{\tau}_{i}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT epochs by

(5) 𝝉it+1=argmin𝝉iLi(𝝉it,𝝈it;Di).superscriptsubscript𝝉𝑖𝑡1subscriptsubscript𝝉𝑖subscript𝐿𝑖superscriptsubscript𝝉𝑖𝑡superscriptsubscript𝝈𝑖𝑡subscript𝐷𝑖\boldsymbol{\tau}_{i}^{t+1}=\mathop{\arg\min}_{\boldsymbol{\tau}_{i}}L_{i}(% \boldsymbol{\tau}_{i}^{t},\boldsymbol{\sigma}_{i}^{t};D_{i}).bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Then optimize the full-rank parameters 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for Eglobalsubscript𝐸globalE_{\text{global}}italic_E start_POSTSUBSCRIPT global end_POSTSUBSCRIPT epochs by

(6) 𝝈it+1=argmin𝝈iLi(𝝉it+1,𝝈it;Di).superscriptsubscript𝝈𝑖𝑡1subscriptsubscript𝝈𝑖subscript𝐿𝑖superscriptsubscript𝝉𝑖𝑡1superscriptsubscript𝝈𝑖𝑡subscript𝐷𝑖\boldsymbol{\sigma}_{i}^{t+1}=\mathop{\arg\min}_{\boldsymbol{\sigma}_{i}}L_{i}% (\boldsymbol{\tau}_{i}^{t+1},\boldsymbol{\sigma}_{i}^{t};D_{i}).bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

We set Elora+Eglobal=Esubscript𝐸lorasubscript𝐸global𝐸E_{\text{lora}}+E_{\text{global}}=Eitalic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT global end_POSTSUBSCRIPT = italic_E, where E𝐸Eitalic_E is the total number of local update epochs in one round. These hyper-parameters play an important role in balancing the learning dynamics between two key components, 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝝉isubscript𝝉𝑖\boldsymbol{\tau}_{i}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. When Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT is set higher, it results in 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT learning less knowledge. Consequently, the degree of knowledge sharing among clients diminishes. In contrast, if Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT is set too low, 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ends up acquiring a significant amount of client-specific knowledge. This scenario increases the risk of clients sharing knowledge that is more susceptible to non-IID data. In special cases, when Elora=0subscript𝐸lora0E_{\text{lora}}=0italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT = 0, the FedDecomp framework degenerates into FedAvg. Similarly, when Eglobal=0subscript𝐸global0E_{\text{global}}=0italic_E start_POSTSUBSCRIPT global end_POSTSUBSCRIPT = 0, FedDecomp transforms into local training with low-rank parameters, without any collaborative efforts among clients.

After local updating, each client i𝑖iitalic_i uploads 𝝈it+1superscriptsubscript𝝈𝑖𝑡1\boldsymbol{\sigma}_{i}^{t+1}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT to the server while kee** the 𝝉it+1superscriptsubscript𝝉𝑖𝑡1\boldsymbol{\tau}_{i}^{t+1}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT private. The server computes a global model 𝝈¯t+1superscriptbold-¯𝝈𝑡1\boldsymbol{\overline{\sigma}}^{t+1}overbold_¯ start_ARG bold_italic_σ end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT by aggregating all clients’ 𝝈it+1superscriptsubscript𝝈𝑖𝑡1\boldsymbol{\sigma}_{i}^{t+1}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT through

(7) 𝝈¯t+1=1Ni=1N𝝈it+1,superscriptbold-¯𝝈𝑡11𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝝈𝑖𝑡1\boldsymbol{\overline{\sigma}}^{t+1}=\frac{1}{N}\sum_{i=1}^{N}\boldsymbol{% \sigma}_{i}^{t+1},overbold_¯ start_ARG bold_italic_σ end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ,

and sends it back to clients. The detailed training process is summarized in the Algorithm 1.

Algorithm 1 FedDecomp
  Input: Each client’s initial personalized parameter matrices 𝝉i1superscriptsubscript𝝉𝑖1\boldsymbol{\tau}_{i}^{1}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT; The global shared parameter matrices 𝝈¯1superscript¯𝝈1\overline{\boldsymbol{\sigma}}^{1}over¯ start_ARG bold_italic_σ end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT; Number of clients N𝑁Nitalic_N; Total communication round T𝑇Titalic_T; Global matrices update epoch number Eglobalsubscript𝐸globalE_{\text{global}}italic_E start_POSTSUBSCRIPT global end_POSTSUBSCRIPT; Low-rank matrices update epoch number Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT ;
  Output: Personalized model parameter matrices 𝜽iTsuperscriptsubscript𝜽𝑖𝑇\boldsymbol{\theta}_{i}^{T}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for each client.
  for t=1𝑡1t=1italic_t = 1 to T𝑇Titalic_T do
     Client-side:
     for i=1𝑖1i=1italic_i = 1 to N𝑁Nitalic_N in parallel do
        Initializing 𝝈itsuperscriptsubscript𝝈𝑖𝑡\boldsymbol{\sigma}_{i}^{t}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT with 𝝈¯tsuperscript¯𝝈𝑡\overline{\boldsymbol{\sigma}}^{t}over¯ start_ARG bold_italic_σ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.
        Updating 𝝉itsuperscriptsubscript𝝉𝑖𝑡\boldsymbol{\tau}_{i}^{t}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT by (5) for Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT epochs to obtain 𝝉it+1superscriptsubscript𝝉𝑖𝑡1\boldsymbol{\tau}_{i}^{t+1}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT.
        Updating 𝝈itsuperscriptsubscript𝝈𝑖𝑡\boldsymbol{\sigma}_{i}^{t}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT by (6) for Eglobalsubscript𝐸globalE_{\text{global}}italic_E start_POSTSUBSCRIPT global end_POSTSUBSCRIPT epochs to obtain 𝝈it+1superscriptsubscript𝝈𝑖𝑡1\boldsymbol{\sigma}_{i}^{t+1}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT.
        Sending 𝝈it+1superscriptsubscript𝝈𝑖𝑡1\boldsymbol{\sigma}_{i}^{t+1}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT to the server.
     end for
     Server-side:
     Aggregating a global model 𝝈¯t+1superscript¯𝝈𝑡1\overline{\boldsymbol{\sigma}}^{t+1}over¯ start_ARG bold_italic_σ end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT by (7).
     Sending 𝝈¯t+1superscript¯𝝈𝑡1\overline{\boldsymbol{\sigma}}^{t+1}over¯ start_ARG bold_italic_σ end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT to each client i𝑖iitalic_i.
  end for

To explain our intuition for proposing alternating training, we employ a toy example to illustrate the local update phase of each client’s personalized model within the parameter space. As shown in Fig. 4, the yellow \star and red \triangle denote the optimum points of the global model on all clients’ data (global loss minimum point) and the personalized model on the client’s data (local loss minimum point), respectively. Under the influence of non-IID, there is a big difference between global knowledge and local knowledge of clients. This makes the local minimum point far away from the global minimum point. The client’s personalized model 𝜽𝜽\boldsymbol{\theta}bold_italic_θ is decomposed into the sum of a shared part 𝝈𝝈\boldsymbol{\sigma}bold_italic_σ and a personalized part 𝝉𝝉\boldsymbol{\tau}bold_italic_τ. Since we first train the personalized part, the client-specific knowledge is mostly learned by 𝝉𝝉\boldsymbol{\tau}bold_italic_τ and the shift of 𝜽𝜽\boldsymbol{\theta}bold_italic_θ to the local minimum point is mainly done by 𝝉𝝉\boldsymbol{\tau}bold_italic_τ. Therefore, when training 𝝈𝝈\boldsymbol{\sigma}bold_italic_σ, it moves less towards the local minimum point (i.e., less affected by non-IID data), so it can better extract general knowledge. In Section 4.4, we conduct an experiment to further validate this intuition.

3.5. Training Cost Analysis

In this section, we analyze the memory usage, computation cost, and communication cost of FedDecomp in each client i𝑖iitalic_i compared to the baseline method FedAvg.

Memory usage: FedAvg needs to maintain a set of full-rank parameter set 𝜽𝒊subscript𝜽𝒊\boldsymbol{\theta_{i}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT. According to Eq. (2), Eq. (3) and Eq. (4), FedDecomp needs to maintain a full-rank parameter set 𝝈𝒊subscript𝝈𝒊\boldsymbol{\sigma_{i}}bold_italic_σ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT whose number of trainable parameters is equivalent to 𝜽𝒊subscript𝜽𝒊\boldsymbol{\theta_{i}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT, and a low-rank parameter set 𝝉𝒊subscript𝝉𝒊\boldsymbol{\tau_{i}}bold_italic_τ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT whose number of trainable parameters is much fewer than 𝜽𝒊subscript𝜽𝒊\boldsymbol{\theta_{i}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT. Therefore, the memory required by FedDecomp is slightly higher than that of FedAvg.

Computation cost: In one round, FedAvg updates 𝜽𝒊subscript𝜽𝒊\boldsymbol{\theta_{i}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT for E𝐸Eitalic_E local epochs. According to Eq. (5) and Eq. (6), FedDecomp updates 𝝉isubscript𝝉𝑖\boldsymbol{\tau}_{i}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT epochs and update 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for EElora𝐸subscript𝐸loraE-E_{\text{lora}}italic_E - italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT epochs. Because 𝝉isubscript𝝉𝑖\boldsymbol{\tau}_{i}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has fewer trainable parameters than 𝜽𝒊subscript𝜽𝒊\boldsymbol{\theta_{i}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT, the computation cost of FedDecomp is lower than that of FedAvg.

Communication cost: In one communication round, FedAvg needs to upload 𝜽𝒊subscript𝜽𝒊\boldsymbol{\theta_{i}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT while FedDecomp needs to upload 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. As 𝝈isubscript𝝈𝑖\boldsymbol{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has the same number of trainable parameters as 𝜽𝒊subscript𝜽𝒊\boldsymbol{\theta_{i}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT, FedDecomp has the same communication cost as FedAvg.

4. Experiments

4.1. Experiment Setup

Dataset. Our main experiments are conducted on three datasets: CIFAR-10 (Krizhevsky et al., 2010), CIFAR-100 (Krizhevsky et al., 2009), and Tiny ImageNet (Le and Yang, 2015). Experiments on larger datasets involving text modalities are included in the supplemental material. To evaluate the effectiveness of our approach in various scenarios, we adopt the Dirichlet non-IID setting, a commonly used framework in current FL research (Hsu et al., 2019; Lin et al., 2020; Wu et al., 2022). In this setup, each client’s data is generated from a Dirichlet distribution represented as Dir(α)𝐷𝑖𝑟𝛼Dir(\alpha)italic_D italic_i italic_r ( italic_α ). As the value of α𝛼\alphaitalic_α increases, the level of class imbalance in each client’s dataset gradually decreases. Consequently, the Dirichlet non-IID setting allows us to test the performance of our methods across a wide range of diverse non-IID scenarios. For a more intuitive understanding of this concept, we offer a visualization of the data partitioning in the supplemental material.

Baseline methods. To verify the efficacy of FedDecomp, we compare it with eight state-of-the-art (SOTA) methods: FedAMP (Huang et al., 2021), FedRep (Collins et al., 2021), FedBN (Li et al., 2021c), FedPer (Arivazhagan et al., 2019), FedRoD (Chen and Chao, 2022), pFedSD (** et al., 2022), pFedGate (Chen et al., 2023), and FedCAC (Wu et al., 2023). Among these methods, FedAMP forces clients with similar data distributions to learn from each other. FedBN, FedPer, FedRep, FedRoD, and FedCAC are parameter-partition-based methods that partially personalize parameters. pFedSD and pFedGate are fine-tuning-based methods, whose goal is to adapt the global model to the client’s local data. These methods cover the latest advancements in various directions of PFL.

Selection for hyper-parameters. We utilize the hyper-parameters specified in the respective papers for each SOTA method. For the FL general hyper-parameters, we set the client number N=40𝑁40N=40italic_N = 40, the local update epochs E=5𝐸5E=5italic_E = 5, the batch size B=100𝐵100B=100italic_B = 100, and the total communication round T=300𝑇300T=300italic_T = 300. Each client is assigned 500 training samples and 100 test samples with the same data distribution. We select the best mean accuracy across all clients as the performance metric. Each experiment is repeated using three seeds, and the mean and standard deviation are reported. We adopt the ResNet (He et al., 2016) network structure. Specifically, we utilize ResNet-8 for CIFAR-10 and ResNet-10 for CIFAR-100 and Tiny ImageNet. In FedDecomp, we adopt the SGD optimizer with learning rate equals 0.1.

4.2. Comparison with SOTA methods

Table 1. Comparison results under Dirichlet non-IID on CIFAR-10, CIFAR-100, and Tiny Imagenet.
CIFAR-10 CIFAR-100 Tiny Imagenet
Methods 0.10.10.10.1 0.50.50.50.5 1.01.01.01.0 0.10.10.10.1 0.50.50.50.5 1.01.01.01.0 0.10.10.10.1 0.50.50.50.5 1.01.01.01.0
FedAvg 60.39±plus-or-minus\pm±1.46 60.41±plus-or-minus\pm±1.36 60.91±plus-or-minus\pm±0.72 34.91±plus-or-minus\pm±0.86 32.78±plus-or-minus\pm±0.23 33.94±plus-or-minus\pm±0.39 21.26±plus-or-minus\pm±1.28 20.32±plus-or-minus\pm±0.91 17.20±plus-or-minus\pm±0.54
Local 81.91±plus-or-minus\pm±3.09 60.15±plus-or-minus\pm±0.86 52.24±plus-or-minus\pm±0.41 47.61±plus-or-minus\pm±0.96 22.65±plus-or-minus\pm±0.51 18.76±plus-or-minus\pm±0.63 24.07±plus-or-minus\pm±0.62 8.75±plus-or-minus\pm±0.30 6.87±plus-or-minus\pm±0.28
FedAMP 84.99±plus-or-minus\pm±1.82 68.26±plus-or-minus\pm±0.79 64.87±plus-or-minus\pm±0.95 46.68±plus-or-minus\pm±1.06 24.74±plus-or-minus\pm±0.58 18.22±plus-or-minus\pm±0.41 27.85±plus-or-minus\pm±0.71 10.70±plus-or-minus\pm±0.32 7.13±plus-or-minus\pm±0.21
FedRep 84.59±plus-or-minus\pm±1.58 67.69±plus-or-minus\pm±0.86 60.52±plus-or-minus\pm±0.72 51.25±plus-or-minus\pm±1.37 26.97±plus-or-minus\pm±0.33 20.63±plus-or-minus\pm±0.42 30.83±plus-or-minus\pm±1.05 12.14±plus-or-minus\pm±0.28 8.37±plus-or-minus\pm±0.25
FedPer 84.43±plus-or-minus\pm±0.47 68.80±plus-or-minus\pm±0.49 64.92±plus-or-minus\pm±0.66 51.38±plus-or-minus\pm±0.94 28.25±plus-or-minus\pm±1.03 21.53±plus-or-minus\pm±0.50 32.33±plus-or-minus\pm±0.31 12.69±plus-or-minus\pm±0.42 8.67±plus-or-minus\pm±0.40
FedBN 83.55±plus-or-minus\pm±2.32 66.79±plus-or-minus\pm±1.08 62.20±plus-or-minus\pm±0.67 54.35±plus-or-minus\pm±0.63 36.94±plus-or-minus\pm±0.94 33.67±plus-or-minus\pm±0.12 33.34±plus-or-minus\pm±0.71 19.61±plus-or-minus\pm±0.35 16.57±plus-or-minus\pm±0.44
FedRoD 86.23±plus-or-minus\pm±2.12 72.34±plus-or-minus\pm±1.77 68.45±plus-or-minus\pm±1.94 60.17±plus-or-minus\pm±0.48 39.88±plus-or-minus\pm±1.18 36.80±plus-or-minus\pm±0.56 41.06±plus-or-minus\pm±0.77 25.63±plus-or-minus\pm±1.11 22.32±plus-or-minus\pm±1.13
pFedSD 86.34±plus-or-minus\pm±2.61 71.97±plus-or-minus\pm±2.07 67.21±plus-or-minus\pm±1.89 54.14±plus-or-minus\pm±0.77 41.06±plus-or-minus\pm±0.83 38.27±plus-or-minus\pm±0.20 39.31±plus-or-minus\pm±0.19 19.25±plus-or-minus\pm±1.80 15.91±plus-or-minus\pm±0.33
pFedGate 87.25±plus-or-minus\pm±1.91 71.98±plus-or-minus\pm±1.61 67.85±plus-or-minus\pm±0.87 48.54±plus-or-minus\pm±0.39 27.47±plus-or-minus\pm±0.79 22.98±plus-or-minus\pm±0.03 37.59±plus-or-minus\pm±0.39 24.09±plus-or-minus\pm±0.67 19.69±plus-or-minus\pm±0.14
FedCAC 86.82±plus-or-minus\pm±1.18 69.83±plus-or-minus\pm±0.46 65.39±plus-or-minus\pm±0.51 57.22±plus-or-minus\pm±1.52 38.64±plus-or-minus\pm±0.63 32.59±plus-or-minus\pm±0.32 40.19±plus-or-minus\pm±1.20 23.70±plus-or-minus\pm±0.28 18.58±plus-or-minus\pm±0.62
FedDecomp 85.47±plus-or-minus\pm±2.06 72.78±plus-or-minus\pm±1.23 69.09±plus-or-minus\pm±1.14 63.65±plus-or-minus\pm±0.53 45.96±plus-or-minus\pm±1.19 42.98±plus-or-minus\pm±0.64 44.22±plus-or-minus\pm±0.55 28.25±plus-or-minus\pm±1.24 25.55±plus-or-minus\pm±0.13

In this section, we compare our FedDecomp with several SOTA methods. To ensure a comprehensive evaluation, we consider three different non-IID degrees (i.e., α{0.1,0.5,1.0}𝛼0.10.51.0\alpha\in\{0.1,0.5,1.0\}italic_α ∈ { 0.1 , 0.5 , 1.0 }) on CIFAR-10, CIFAR-100, and Tiny Imagenet.

The results in Table 1 demonstrate that the performance of FedAMP is comparable to other SOTA methods on the CIFAR-10 dataset, but experiences a notable decline on CIFAR-100 and Tiny Imagenet. This is primarily because of its limited capacity to leverage collaboration among clients with diverse data distributions. In contrast, mainstream model partition methods such as FedRep, FedBN, FedPer, FedRoD, and FedCAC enhance collaboration among clients by personalizing parameters sensitive to non-IID data while sharing others. Among these methods, FedRoD distinguishes itself by introducing a balanced global classifier to facilitate comprehensive knowledge exchange, underscoring the potential for improvements in client collaboration within current model partition strategies. On the other hand, fine-tuning-based approaches like pFedSD and pFedGate enable all clients to collaboratively train a global model, fostering extensive knowledge exchange. However, this approach can lead to performance degradation in certain non-IID scenarios due to mutual interference during joint training.

Notably, FedDecomp significantly outperforms all baseline methods in the majority of scenarios, particularly as α𝛼\alphaitalic_α increases. FedDecomp achieves this by effectively decoupling general and client-specific knowledge through parameter decomposition and mitigating the impact of non-IID through alternating training of full-rank and low-rank matrices.

4.3. Ablation Studies

Effect of Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. As we discuss in Section 3.3, Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT individually denote the ratio of the low-rank matrix’s rank to the full-rank matrix’s rank in convolutional and fully-connected layers, respectively. They are two important hyper-parameters to control the learning ability of the low-rank matrices. In this section, we evaluate the effect of Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT on model accuracy. We choose Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT from {20%,40%,60%,80%,100%}percent20percent40percent60percent80percent100\{20\%,40\%,60\%,80\%,100\%\}{ 20 % , 40 % , 60 % , 80 % , 100 % }.

The experimental results are presented in Table 2. Firstly, we observe that the optimal combinations of (Rc,Rl)subscript𝑅𝑐subscript𝑅𝑙(R_{c},R_{l})( italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) are (60%,60%)percent60percent60(60\%,60\%)( 60 % , 60 % ) for CIFAR-10, (80%,40%)percent80percent40(80\%,40\%)( 80 % , 40 % ) for CIFAR-100, (80%,40%)percent80percent40(80\%,40\%)( 80 % , 40 % ) for Tiny Imagenet. This underscores the importance of setting the personalized parameter matrices to low rank. Secondly, regarding the optimal combination as the focal point, model accuracy gradually decreases as the rank increases. This occurs because, after this point, the personalized matrices gain more learning capacity and begin to acquire some of the general knowledge. As a result, collaboration among clients on the shared matrices diminishes. As the rank decreases, model accuracy also gradually declines. This is because the personalized matrices fail to capture sufficient client-specific knowledge. This aligns with our expectations. Thirdly, experimental results highlight that model accuracy is more sensitive to changes in Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT than Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. This suggests that the acquisition of client-specific knowledge has a stronger correlation with the classifier than the feature extractor, consistent with prior research such as FedPer, FedRep, and FedRoD.

Table 2. The effect of Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT on CIFAR-10, CIFAR-100, and Tiny Imagenet under Dirichlet non-IID with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1.
Dataset Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT 20% 40% 60% 80% 100%
CIFAR-10 20%percent2020\%20 % 84.72 ±plus-or-minus\pm± 2.07 84.97 ±plus-or-minus\pm± 1.74 84.73 ±plus-or-minus\pm± 2.33 84.80 ±plus-or-minus\pm± 2.03 84.99 ±plus-or-minus\pm± 2.19
40%percent4040\%40 % 84.84 ±plus-or-minus\pm± 2.19 84.96 ±plus-or-minus\pm± 1.87 85.27 ±plus-or-minus\pm± 2.04 84.97 ±plus-or-minus\pm± 1.86 85.39 ±plus-or-minus\pm± 2.01
60%percent6060\%60 % 84.92 ±plus-or-minus\pm± 1.90 85.35 ±plus-or-minus\pm± 1.96 85.47 ±plus-or-minus\pm± 2.06 85.07 ±plus-or-minus\pm± 2.26 85.38 ±plus-or-minus\pm± 1.76
80%percent8080\%80 % 84.70 ±plus-or-minus\pm± 1.98 85.05 ±plus-or-minus\pm± 1.66 85.25 ±plus-or-minus\pm± 2.00 85.01 ±plus-or-minus\pm± 1.90 85.13 ±plus-or-minus\pm± 1.95
100%percent100100\%100 % 85.09 ±plus-or-minus\pm± 1.95 85.23 ±plus-or-minus\pm± 1.89 85.15 ±plus-or-minus\pm± 1.66 84.88 ±plus-or-minus\pm± 1.77 85.21 ±plus-or-minus\pm± 1.62
CIFAR-100 20%percent2020\%20 % 62.00 ±plus-or-minus\pm± 0.60 62.66 ±plus-or-minus\pm± 0.37 61.99 ±plus-or-minus\pm± 0.97 62.48 ±plus-or-minus\pm± 0.30 62.70 ±plus-or-minus\pm± 0.83
40%percent4040\%40 % 61.70 ±plus-or-minus\pm± 0.28 62.49 ±plus-or-minus\pm± 0.77 62.70 ±plus-or-minus\pm± 0.59 63.65 ±plus-or-minus\pm± 0.53 62.73 ±plus-or-minus\pm± 0.60
60%percent6060\%60 % 61.71 ±plus-or-minus\pm± 0.30 62.88 ±plus-or-minus\pm± 0.33 62.46 ±plus-or-minus\pm± 0.53 63.12 ±plus-or-minus\pm± 0.38 63.24 ±plus-or-minus\pm± 0.66
80%percent8080\%80 % 60.76 ±plus-or-minus\pm± 0.14 62.54 ±plus-or-minus\pm± 0.56 62.74 ±plus-or-minus\pm± 0.54 62.15 ±plus-or-minus\pm± 0.38 62.70 ±plus-or-minus\pm± 0.57
100%percent100100\%100 % 59.54 ±plus-or-minus\pm± 0.98 60.97 ±plus-or-minus\pm± 0.35 61.86 ±plus-or-minus\pm± 0.74 61.96 ±plus-or-minus\pm± 0.55 62.58 ±plus-or-minus\pm± 0.51
Tiny Imagenet 20%percent2020\%20 % 40.77 ±plus-or-minus\pm± 0.10 42.71 ±plus-or-minus\pm± 0.59 43.27 ±plus-or-minus\pm± 0.43 43.78 ±plus-or-minus\pm± 0.70 43.88 ±plus-or-minus\pm± 0.16
40%percent4040\%40 % 40.14 ±plus-or-minus\pm± 0.36 42.74 ±plus-or-minus\pm± 0.46 43.82 ±plus-or-minus\pm± 0.46 44.22 ±plus-or-minus\pm± 0.55 43.72 ±plus-or-minus\pm± 0.24
60%percent6060\%60 % 39.39 ±plus-or-minus\pm± 0.26 42.75 ±plus-or-minus\pm± 0.46 43.44 ±plus-or-minus\pm± 0.43 43.85 ±plus-or-minus\pm± 0.90 44.16 ±plus-or-minus\pm± 0.48
80%percent8080\%80 % 36.90 ±plus-or-minus\pm± 0.30 41.94 ±plus-or-minus\pm± 0.29 42.75 ±plus-or-minus\pm± 0.49 43.21 ±plus-or-minus\pm± 0.41 43.28 ±plus-or-minus\pm± 0.36
100%percent100100\%100 % 33.90 ±plus-or-minus\pm± 0.91 40.55 ±plus-or-minus\pm± 0.15 41.75 ±plus-or-minus\pm± 0.81 42.16 ±plus-or-minus\pm± 0.41 42.75 ±plus-or-minus\pm± 0.31
Refer to caption
(a) CIFAR-10
Refer to caption
(b) CIFAR-100
Refer to caption
(c) Tiny Imagenet
Figure 5. Effect of Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT in Dirichlet non-IID scenario with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1.

Effect of Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT and Eglobalsubscript𝐸globalE_{\text{global}}italic_E start_POSTSUBSCRIPT global end_POSTSUBSCRIPT. In this section, we verify the effect of Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT and Eglobalsubscript𝐸globalE_{\text{global}}italic_E start_POSTSUBSCRIPT global end_POSTSUBSCRIPT on model accuracy. For simplicity, we set Eglobal=EElorasubscript𝐸global𝐸subscript𝐸loraE_{\text{global}}=E-E_{\text{lora}}italic_E start_POSTSUBSCRIPT global end_POSTSUBSCRIPT = italic_E - italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT and only adjust the value of Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT. We conduct experiments on three datasets under Dirichlet non-IID with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 and sample Elora[0,E]subscript𝐸lora0𝐸E_{\text{lora}}\in[0,E]italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT ∈ [ 0 , italic_E ].

The experimental results are depicted in Fig. 5. When the Elora=0subscript𝐸lora0E_{\text{lora}}=0italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT = 0, FedDecomp essentially degenerates to FedAvg, and the accuracy closely resembles the FedAvg accuracy presented in Table 1, as expected. As Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT increases, the accuracy initially rises and then declines. When Elora=5subscript𝐸lora5E_{\text{lora}}=5italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT = 5, FedDecomp degenerates to local training with low-rank parameter matrices. However, due to the constraints imposed by these low-rank matrices on the model’s learning capacity, FedDecomp performs less effectively compared to the Local as shown in Table 1.

Effect of Alternating training. As we discussed in Section 3.4, different from previous work that trains personalized and shared components simultaneously, we propose to train the personalized part first and then the global part to reduce the impact of non-IID and better extract general knowledge. To evaluate this idea, in this experiment, we compare the performance of two training methods.

The experimental results on three datasets are shown in Table 3. We can see that when the learning task is simple (e.g., a 10-classification task on CIFAR-10), the performance of alternating training and simultaneous training of two matrices is similar. As the learning task becomes increasingly difficult, the performance improvement brought about by alternating training becomes more apparent. This is because, in the case of a simple learning task, the variations in tasks among clients are relatively minor, which facilitates the extraction of general knowledge. However, as the learning task complexity increases, the differences in tasks among clients gradually expand, rendering the extraction of general knowledge more susceptible to non-IID effects. In such scenarios, the utilization of our proposed alternating training method becomes increasingly crucial.

Table 3. The effect of alternating training in FedDecomp on three datasets.
Methods CIFAR-10 CIFAR-100 Tiny
Simultaneously 85.45±plus-or-minus\pm±1.83 61.18±plus-or-minus\pm±1.05 35.37±plus-or-minus\pm±0.71
Alternatingly 85.47±plus-or-minus\pm±2.06 63.65±plus-or-minus\pm±0.53 44.22±plus-or-minus\pm±0.55
Table 4. The effect of low-rank matrices on model capacity.
Methods CIFAR-10 & ResNet-8 CIFAR-100 & ResNet-10
Local 81.91 ±plus-or-minus\pm± 3.09 47.61 ±plus-or-minus\pm± 0.96
Local w/ Low-Rank 81.97 ±plus-or-minus\pm± 2.62 47.64 ±plus-or-minus\pm± 0.79
FedAvg 60.39 ±plus-or-minus\pm± 1.46 34.91 ±plus-or-minus\pm± 0.86
FedAvg w/ Low-Rank 60.91 ±plus-or-minus\pm± 0.53 35.91 ±plus-or-minus\pm± 0.70

Effect of Model Capacity. In FedDecomp, we employ an additive decomposition technique on the model. In theory, this approach does not change the model’s capacity. However, in practical implementation, the decomposed model introduces low-rank matrices, thereby increasing the number of trainable parameters. This raises questions about whether the decomposed model genuinely enhances the model’s capacity and whether the observed performance improvement is primarily a result of the increased number of trainable parameters. To address these concerns, we conducted an experiment to assess the impact on model capacity.

We conduct experiments using two configurations: CIFAR-10 with the ResNet-8 model and CIFAR-100 with the ResNet-10 model. We established two controlled scenarios: 1) ‘Local’ and ‘Local w/ Low-Rank’ indicate models without and with low-rank matrices that are exclusively trained locally. 2) ‘FedAvg’ and ‘FedAvg w/ Low-Rank’ indicate models without and with low-rank matrices trained using the FedAvg algorithm. The experimental results are shown in Table 4. Notably, we observe that, in comparison to the original model, the model enhanced with low-rank matrices exhibits only minimal performance improvement. This outcome underscores that our utilization of parameter decomposition does not bring about significant alterations to the model’s capacity. Hence, the performance gains achieved by FedDecomp are not solely attributed to modifications in the model itself.

Table 5. The effect of personalizing low-rank matrices while sharing full-rank matrices on CIFAR-100.
Methods α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 α=1.0𝛼1.0\alpha=1.0italic_α = 1.0
FedDecomp 63.65±plus-or-minus\pm±0.53 45.96±plus-or-minus\pm±1.19 42.98±plus-or-minus\pm±0.64
FedDecomp_Reverse 48.80±plus-or-minus\pm±0.88 23.85±plus-or-minus\pm±0.99 18.52±plus-or-minus\pm±0.86

Effect of Personalize Low-rank Matrices and Sharing Full-rank Matrices. As discussed in section 3.3, the shared parameters require a high capacity to maintain general knowledge among clients, while personalized parameters only need to preserve client-specific knowledge as a supplement to the general knowledge. To validate this intuition, we evaluate ‘FedDecomp_Reverse’ which shares low-rank matrices and personalizes full-rank matrices. Other training strategy is the same as FedDecomp and the comparison results on CIFAR-100 are shown in Table 5.

We can see that the performance of ‘FedDecomp_Reverse’ is not as good as FedDecomp. Moreover, combining the results in Table 1, we can see that the performance of ‘FedDecomp_Reverse’ is similar to local training. This indicates that in ‘FedDecomp_Reverse’, the knowledge is mainly preserved in personalized parameters and the knowledge sharing among clients is very limited. This supports our intuition of personalizing low-rank matrices and sharing full-rank matrices.

4.4. The Effect of Alternating Training on Model Difference

As we discussed in Section 3.4 and Fig. 4, the primary objective of alternating training is to mitigate the impact of data heterogeneity on the shared parameters, essentially reducing the deviation of shared parameters to the local minimum point of the client. Consequently, employing alternating training should lead to a reduction in the discrepancies among shared parameters across clients during their local training phases. To validate the effectiveness of alternating training in achieving this goal, we carry out additional experiments to compare the disparities in shared parameters among clients when using alternating training as opposed to not using it. Specifically, we calculate the average model distance between 𝝈i,1iNsubscript𝝈𝑖1𝑖𝑁\boldsymbol{\sigma}_{i},1\leq i\leq Nbold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ italic_N and 𝝈¯¯𝝈\overline{\boldsymbol{\sigma}}over¯ start_ARG bold_italic_σ end_ARG by 1NiN𝝈it𝝈¯t21𝑁superscriptsubscript𝑖𝑁subscriptnormsuperscriptsubscript𝝈𝑖𝑡superscript¯𝝈𝑡2\frac{1}{N}\sum_{i}^{N}||\boldsymbol{\sigma}_{i}^{t}-\overline{\boldsymbol{% \sigma}}^{t}||_{2}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - over¯ start_ARG bold_italic_σ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

in each round t𝑡titalic_t. The results are shown in Fig. 6. It is evident from the data that, across both datasets, the utilization of alternating training significantly diminishes the differences in the shared parameters among clients. This is consistent with our intuition and analysis.

Refer to caption
(a) CIFAR-100
Refer to caption
(b) Tiny Imagenet
Figure 6. Effect of training logic on average model difference of 𝝈i,1iNsubscript𝝈𝑖1𝑖𝑁\boldsymbol{\sigma}_{i},1\leq i\leq Nbold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ italic_N and 𝝈¯¯𝝈\overline{\boldsymbol{\sigma}}over¯ start_ARG bold_italic_σ end_ARG in Dirichlet non-IID scenario with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1.

4.5. Experiments with pre-trained model

Table 6. The relationship between Δ𝝈Δ𝝈\Delta\boldsymbol{\sigma}roman_Δ bold_italic_σ and Δ𝝉Δ𝝉\Delta\boldsymbol{\tau}roman_Δ bold_italic_τ when the 𝝈𝝈\boldsymbol{\sigma}bold_italic_σ is initialized by pre-trained weights.
Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT 0 1 2 3 4
Accuracy 39.35 50.37 71.47 72.00 72.42
Δ𝝈Δ𝝈\Delta\boldsymbol{\sigma}roman_Δ bold_italic_σ 37.99 31.59 22.12 16.16 10.57
Δ𝝉Δ𝝉\Delta\boldsymbol{\tau}roman_Δ bold_italic_τ   0.00 26.48 89.68 98.26 99.63

In FedDecomp, we suppose that the client-specific knowledge is learned by the low-rank matrices. To validate this assumption, we initialize 𝝈𝝈\boldsymbol{\sigma}bold_italic_σ with pre-trained weights. In this case, the general knowledge is well extracted. If our idea holds, then the 𝝉𝝉\boldsymbol{\tau}bold_italic_τ should be trained more to learn client-specific knowledge and 𝝈𝝈\boldsymbol{\sigma}bold_italic_σ should be trained less (i.e., Δ𝝈Δ𝝈\Delta\boldsymbol{\sigma}roman_Δ bold_italic_σ should be much smaller than Δ𝝉Δ𝝉\Delta\boldsymbol{\tau}roman_Δ bold_italic_τ).

Specifically, We initialize the 𝝈i,i[1,N]subscript𝝈𝑖𝑖1𝑁\boldsymbol{\sigma}_{i},i\in[1,N]bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ 1 , italic_N ] with ImageNet pre-trained weights and conduct an experiment on the CIFAR-100 dataset in the Dirichlet non-IID scenario with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1. We calculate the Δ𝝈Δ𝝈\Delta\boldsymbol{\sigma}roman_Δ bold_italic_σ by 𝝈¯T𝝈¯12subscriptnormsuperscript¯𝝈𝑇superscript¯𝝈12||\overline{\boldsymbol{\sigma}}^{T}-\overline{\boldsymbol{\sigma}}^{1}||_{2}| | over¯ start_ARG bold_italic_σ end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - over¯ start_ARG bold_italic_σ end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and the Δ𝝉Δ𝝉\Delta\boldsymbol{\tau}roman_Δ bold_italic_τ by 1Ni=1N𝝉iT𝝉i121𝑁superscriptsubscript𝑖1𝑁subscriptnormsuperscriptsubscript𝝉𝑖𝑇superscriptsubscript𝝉𝑖12\frac{1}{N}\sum_{i=1}^{N}||\boldsymbol{\tau}_{i}^{T}-\boldsymbol{\tau}_{i}^{1}% ||_{2}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We control the value of Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT at different levels (higher Elorasubscript𝐸loraE_{\text{lora}}italic_E start_POSTSUBSCRIPT lora end_POSTSUBSCRIPT means update 𝝉𝝉\boldsymbol{\tau}bold_italic_τ more often, and results in larger Δ𝝉Δ𝝉\Delta\boldsymbol{\tau}roman_Δ bold_italic_τ), the results are summarized in the Table 6.

From the table, we can conclude that when Δ𝝉Δ𝝉\Delta\boldsymbol{\tau}roman_Δ bold_italic_τ is larger than Δ𝝈Δ𝝈\Delta\boldsymbol{\sigma}roman_Δ bold_italic_σ, FedDecomp achieves better results. This indicates that with the pre-trained weights, the 𝝉𝝉\boldsymbol{\tau}bold_italic_τ should be updated more often than 𝝈𝝈\boldsymbol{\sigma}bold_italic_σ to learn client-specific knowledge. This aligns with our expectations.

4.6. Privacy Analysis

Table 7. PSNR (dB, \downarrow) values for privacy evaluation on CIFAR10 in Dirichlet non-IID setting with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1.
Methods FedAvg FedPer FedRoD FedDecomp
PSNR_Avg 13.91 12.43 12.00 11.34
PSNR_Max 17.11 19.52 16.99 13.55

In this section, we analyze the privacy protection capability of the FedDecomp. To this end, we adopt the Deep Leakage from Gradient (DLG) method (Zhu et al., 2019; Gei** et al., 2020) as the attack scheme. DLG is a common attack against FL, and its main idea is: 1) The attacker steals the gradients calculated by each client using local data; 2) The attacker finds the optimal input through iterative optimization, such that the gradient computed with this input is as close as possible to the actual gradient.

In our specific experimental setup, we choose the CIFAR-10 dataset with 20 clients, and the data distribution of each client follows a Dirichlet distribution with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1. For each algorithm, we assume that the gradients of shared parameters can be obtained by the attacker. At training rounds 10, 20, 30, 40, and 50, we attempt to recover each client’s 5 training images using the DLG method. To measure the quality of image recovery, we use the Peak Signal-to-Noise Ratio (PSNR), which is defined as PSNR=10log10(xx22mn)𝑃𝑆𝑁𝑅10subscript10superscriptsubscriptnorm𝑥superscript𝑥22𝑚𝑛PSNR=-10\cdot\log_{10}(\frac{||x-x^{*}||_{2}^{2}}{m\cdot n})italic_P italic_S italic_N italic_R = - 10 ⋅ roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG | | italic_x - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m ⋅ italic_n end_ARG ), where xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the target image to be recovered, x𝑥xitalic_x is the image being optimized for recovery, and m,n𝑚𝑛m,nitalic_m , italic_n are the width and height of the image, respectively. A higher PSNR indicates that the recovered image is closer to the original image, which in turn implies weaker privacy protection by the algorithm. Table 7 shows our experimental results, including the average PSNR and the maximum PSNR of the attack results. The results show that our proposed FedDecomp method achieved lower PSNR compared to FedAvg and other personalized methods. This means that using the FedDecomp method, both on average and in the most extreme cases, better privacy protection can be achieved compared to previous methods. We believe that this is because our method can effectively decouple general knowledge from personalized knowledge and keep personalized knowledge well-preserved in the local low-rank branch. As such, it becomes difficult for the DLG method to recover the original image using only the shared part of the gradients.

5. Conclusion

In this paper, we propose a new PFL method named FedDecomp. FedDecomp decomposes each model parameter matrix into a shared full-rank matrix and a personalized low-rank matrix. To further enhance the acquisition of general knowledge, we devise a training strategy that prioritizes the training of the low-rank matrix to absorb the influence of non-IID during local training. Our extensive experimental evaluations, conducted across multiple datasets characterized by varying degrees of non-IID, unequivocally demonstrate the superior performance of our FedDecomp method when compared to SOTA methods.

References

  • (1)
  • Acar et al. (2021) Durmus Alp Emre Acar, Yue Zhao, Ruizhao Zhu, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. 2021. Debiasing model updates for improving personalized federated training. In International Conference on Machine Learning. PMLR, 21–31.
  • Arivazhagan et al. (2019) Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav Choudhary. 2019. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818 (2019).
  • Cai et al. (2023) Luxin Cai, Naiyue Chen, Yuanzhouhan Cao, Jiahuan He, and Yidong Li. 2023. FedCE: Personalized Federated Learning Method based on Clustering Ensembles. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023. ACM, 1625–1633. https://doi.org/10.1145/3581783.3612217
  • Che et al. (2023) Liwei Che, Jiaqi Wang, Yao Zhou, and Fenglong Ma. 2023. Multimodal federated learning: A survey. Sensors 23, 15 (2023), 6986.
  • Chen et al. (2023) Daoyuan Chen, Liuyi Yao, Dawei Gao, Bolin Ding, and Yaliang Li. 2023. Efficient Personalized Federated Learning via Sparse Model-Adaptation. arXiv preprint arXiv:2305.02776 (2023).
  • Chen and Chao (2022) Hong-You Chen and Wei-Lun Chao. 2022. On Bridging Generic and Personalized Federated Learning for Image Classification. In International Conference on Learning Representations. https://openreview.net/forum?id=I1hQbx10Kxn
  • Collins et al. (2021) Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. 2021. Exploiting shared representations for personalized federated learning. (2021), 2089–2099.
  • Fallah et al. (2020) Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. 2020. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. Advances in Neural Information Processing Systems 33 (2020), 3557–3568.
  • Gei** et al. (2020) Jonas Gei**, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. 2020. Inverting gradients-how easy is it to break privacy in federated learning? Advances in Neural Information Processing Systems 33 (2020), 16937–16947.
  • Gong et al. (2021) Xuan Gong, Abhishek Sharma, Srikrishna Karanam, Ziyan Wu, Terrence Chen, David Doermann, and Arun Innanje. 2021. Ensemble attention distillation for privacy-preserving federated learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15076–15086.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hsu et al. (2019) Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. 2019. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335 (2019).
  • Huang et al. (2023) Tiansheng Huang, Li Shen, Yan Sun, Weiwei Lin, and Dacheng Tao. 2023. Fusion of global and local knowledge for personalized federated learning. arXiv preprint arXiv:2302.11051 (2023).
  • Huang et al. (2021) Yutao Huang, Lingyang Chu, Zirui Zhou, Lanjun Wang, Jiangchuan Liu, Jian Pei, and Yong Zhang. 2021. Personalized cross-silo federated learning on non-iid data. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Hyeon-Woo et al. (2022) Nam Hyeon-Woo, Moon Ye-Bin, and Tae-Hyun Oh. 2022. FedPara: Low-rank Hadamard Product for Communication-Efficient Federated Learning. In International Conference on Learning Representations. https://openreview.net/forum?id=d71n4ftoCBy
  • Jeong and Hwang (2022) Wonyong Jeong and Sung Ju Hwang. 2022. Factorized-fl: Personalized federated learning with parameter factorization & similarity matching. Advances in Neural Information Processing Systems 35 (2022), 35684–35695.
  • ** et al. (2022) Hai **, Dongshan Bai, Dezhong Yao, Yutong Dai, Lin Gu, Chen Yu, and Lichao Sun. 2022. Personalized edge intelligence via federated self-knowledge distillation. IEEE Transactions on Parallel and Distributed Systems 34, 2 (2022), 567–580.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
  • Krizhevsky et al. (2010) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2010. Cifar-10 (canadian institute for advanced research). URL http://www. cs. toronto. edu/kriz/cifar. html 5 (2010).
  • Lao et al. (2023) Mingrui Lao, Nan Pu, Zhun Zhong, Nicu Sebe, and Michael S. Lew. 2023. FedVQA: Personalized Federated Visual Question Answering over Heterogeneous Scenes. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023. ACM, 7796–7807. https://doi.org/10.1145/3581783.3611958
  • Le and Yang (2015) Ya Le and Xuan Yang. 2015. Tiny imagenet visual recognition challenge. CS 231N 7, 7 (2015), 3.
  • Li et al. (2021a) Qinbin Li, Bingsheng He, and Dawn Song. 2021a. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10713–10722.
  • Li et al. (2021b) Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. 2021b. Ditto: Fair and robust federated learning through personalization. In International Conference on Machine Learning. PMLR, 6357–6368.
  • Li et al. (2021c) Xiaoxiao Li, Meirui JIANG, Xiaofei Zhang, Michael Kamp, and Qi Dou. 2021c. FedBN: Federated Learning on Non-IID Features via Local Batch Normalization. In International Conference on Learning Representations. https://openreview.net/forum?id=6YEQUn0QICG
  • Liang et al. (2020) Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B Allen, Randy P Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Think locally, act globally: Federated learning with local and global representations. arXiv preprint arXiv:2001.01523 (2020).
  • Lin et al. (2020) Tao Lin, Ling**g Kong, Sebastian U Stich, and Martin Jaggi. 2020. Ensemble Distillation for Robust Model Fusion in Federated Learning. In NeurIPS.
  • Ma et al. (2022) Xiaosong Ma, Jie Zhang, Song Guo, and Wenchao Xu. 2022. Layer-wised model aggregation for personalized federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10092–10101.
  • McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. 2016. Federated learning of deep networks using model averaging. arXiv preprint arXiv:1602.05629 (2016).
  • Mills et al. (2021) Jed Mills, Jia Hu, and Geyong Min. 2021. Multi-task federated learning for personalised deep neural networks in edge computing. IEEE Transactions on Parallel and Distributed Systems 33, 3 (2021), 630–641.
  • Pillutla et al. (2022) Krishna Pillutla, Kshitiz Malik, Abdel-Rahman Mohamed, Mike Rabbat, Maziar Sanjabi, and Lin Xiao. 2022. Federated learning with partial model personalization. In International Conference on Machine Learning. PMLR, 17716–17758.
  • Sattler et al. (2020) Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. 2020. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE Transactions on Neural Networks and Learning Systems (2020).
  • Sun et al. (2021) Benyuan Sun, Hongxing Huo, Yi Yang, and Bo Bai. 2021. Partialfed: Cross-domain personalized federated learning via partial initialization. Advances in Neural Information Processing Systems 34 (2021), 23309–23320.
  • T Dinh et al. (2020) Canh T Dinh, Nguyen Tran, and Josh Nguyen. 2020. Personalized federated learning with moreau envelopes. Advances in Neural Information Processing Systems 33 (2020), 21394–21405.
  • Wu et al. (2023) Xinghao Wu, Xuefeng Liu, Jianwei Niu, Guogang Zhu, and Shaojie Tang. 2023. Bold but Cautious: Unlocking the Potential of Personalized Federated Learning through Cautiously Aggressive Collaboration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 19375–19384.
  • Wu et al. (2022) Xinghao Wu, Jianwei Niu, Xuefeng Liu, Tao Ren, Zhangmin Huang, and Zhetao Li. 2022. pFedGF: Enabling Personalized Federated Learning via Gradient Fusion. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 639–649.
  • Xiong et al. (2023) Baochen Xiong, Xiaoshan Yang, Yaguang Song, Yaowei Wang, and Changsheng Xu. 2023. Client-Adaptive Cross-Model Reconstruction Network for Modality-Incomplete Multimodal Federated Learning. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023. ACM, 1241–1249. https://doi.org/10.1145/3581783.3611757
  • Yi et al. (2023) Li** Yi, Gang Wang, Xiaoguang Liu, Zhuan Shi, and Han Yu. 2023. FedGH: Heterogeneous Federated Learning with Generalized Global Header. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023. ACM, 8686–8696. https://doi.org/10.1145/3581783.3611781
  • Zhang et al. (2023a) Jianqing Zhang, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, Jian Cao, and Haibing Guan. 2023a. GPFL: Simultaneously Learning Global and Personalized Feature Information for Personalized Federated Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5041–5051.
  • Zhang et al. (2021a) Lin Zhang, Yong Luo, Yan Bai, Bo Du, and Ling-Yu Duan. 2021a. Federated learning for non-iid data via unified feature learning and optimization objective alignment. In Proceedings of the IEEE/CVF international conference on computer vision. 4420–4428.
  • Zhang et al. (2021b) Michael Zhang, Karan Sapra, Sanja Fidler, Serena Yeung, and Jose M. Alvarez. 2021b. Personalized Federated Learning with First Order Model Optimization. In International Conference on Learning Representations. https://openreview.net/forum?id=ehJqJQk9cw
  • Zhang et al. (2023b) Pengling Zhang, Huibin Yan, Wenhui Wu, and Shuoyao Wang. 2023b. Improving Federated Person Re-Identification through Feature-Aware Proximity and Aggregation. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023. ACM, 2498–2506. https://doi.org/10.1145/3581783.3612350
  • Zheng et al. (2022) Kaiyu Zheng, Xuefeng Liu, Guogang Zhu, Xinghao Wu, and Jianwei Niu. 2022. ChannelFed: Enabling Personalized Federated Learning via Localized Channel Attention. In GLOBECOM 2022 - 2022 IEEE Global Communications Conference. 2987–2992. https://doi.org/10.1109/GLOBECOM48099.2022.10000892
  • Zhu et al. (2019) Ligeng Zhu, Zhijian Liu, and Song Han. 2019. Deep leakage from gradients. Advances in neural information processing systems 32 (2019).

Appendix A Partial Client participation

Table 8. The effect of partial client participation.
Datasets 100% 90% 70% 50%
CIFAR-10 85.47±plus-or-minus\pm±2.06 85.38±plus-or-minus\pm±1.62 (-0.09) 85.25±plus-or-minus\pm±1.67 (-0.22) 85.36±plus-or-minus\pm±1.67 (-0.11)
CIFAR-100 63.65±plus-or-minus\pm±0.53 63.01±plus-or-minus\pm±0.10 (-0.64) 63.21±plus-or-minus\pm±0.18 (-0.44) 63.13±plus-or-minus\pm±1.05 (-0.52)
Tiny 44.22±plus-or-minus\pm±0.55 44.13±plus-or-minus\pm±0.70 (-0.09) 44.10±plus-or-minus\pm±0.26 (-0.12) 43.99±plus-or-minus\pm±0.62 (-0.23)

In our previous experiments, we assume all clients participate in FL training in each round. However, some clients may be offline due to reasons such as unstable communication links. This is the partial client participation problem that is common in FL. In this section, we evaluate the robustness of FedDecomp to this problem. We consider the scenarios where 90%, 70%, and 50% clients participate in each round and carry experiments on CIFAR-10, CIFAR-100, and Tiny Imagenet with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1.

The results are illustrated in Table 8. As we can see, in all scenarios, partial client participation does not significantly affect accuracy compared to all client participation. This is attributed to FedDecomp’s effectively separating general knowledge and client-specific knowledge, and the effect of non-IID is reduced through alternate training. Ensure that client collaboration is not significantly affected by outline clients in each round.

Appendix B Additional Experiments on Larger datasets

While the current mainstream FL work focuses on the algorithm’s performance on small image datasets, in this section, we further verify the performance of FedDecomp on larger datasets as well as other modality datasets.

Specifically, we conduct additional experiments on both a larger image dataset and a natural language processing (NLP) dataset. For the larger image dataset, we select a subset from ImageNet, consisting of 400 classes with a total of 80,000 samples. We utilize the ResNet-10 model architecture, with each client having 2,000 training samples generated following the Dirichlet distribution with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1. For the NLP dataset, we opt for AG_NEWS, a text 5-classification dataset with 120,000 samples. We employ the Transformer model architecture, with each client having 3,000 training samples generated following the Dirichlet distribution with α=1.0𝛼1.0\alpha=1.0italic_α = 1.0. Additionally, for the Transformer model, we apply model decomposition to the weights in the self-attention modules and fully connected weights in the classifier module.

Table 9 displays the test accuracy results for these two datasets. It’s evident that FedDecomp consistently outperforms other state-of-the-art methods on both datasets.

Table 9. Comparison results on larger datasets.
Datesets FedAvg FedPer FedRoD FedDecomp
AG_NEWS 89.36 90.76 91.38 91.79
ImageNet-Subset 18.55 29.37 32.45 35.67

Appendix C Whether FedDecomp Sacrifices Some Clients’ Accuracy

Refer to caption
(a) CIFAR-100
Refer to caption
(b) Tiny Imagenet
Figure 7. Test accuracy of each client in Dirichlet non-IID scenario with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1.

In previous experiments, we demonstrate the improvement of the averaged accuracy of all clients. In this section, we focus on the individual improvement for each client and verify whether FedDecomp sacrifices some clients’ accuracy. We plot each client’s accuracy in FedDecomp, FedAvg, and Local (i.e., each client trains the model locally without collaboration) methods in the Dirichlet non-IID scenario with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1. The results are shown in Fig. 7.

Notice that the accuracy of all clients in the FedDecomp method is higher than that in the FedAvg and Local methods, affirming that the use of FedDecomp does not lead to any deterioration in individual client performance.

Appendix D Visualization of Data Partitioning in Dirichlet non-IID

Refer to caption
(a) α=0.1𝛼0.1\alpha=0.1italic_α = 0.1, 10-class
Refer to caption
(b) α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, 10-class
Refer to caption
(c) α=1.0𝛼1.0\alpha=1.0italic_α = 1.0, 10-class
Refer to caption
(d) α=0.1𝛼0.1\alpha=0.1italic_α = 0.1, 50-class
Refer to caption
(e) α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, 50-class
Figure 8. Visualization of data partitioning in Dirichlet non-IID scenarios with different α𝛼\alphaitalic_α.

To facilitate intuitive understanding, we utilize 20 clients on the 10-classification and 50-classification datasets to visualize the data distribution of clients with different α𝛼\alphaitalic_α values. As shown in Figure 8, the horizontal axis represents the data class label index, and the vertical axis represents the client ID. Red dots represent the data assigned to clients. The larger the dot is, the more data the client has in this class. When α𝛼\alphaitalic_α is small (e.g., α=0.1𝛼0.1\alpha=0.1italic_α = 0.1), the overall data distributions of clients vary greatly. However, the variety of client data distribution is low, and it is easy to have clients with very similar data distributions. As the α𝛼\alphaitalic_α increases, the extent of class imbalance within each client’s dataset gradually diminishes, consequently leading to more difficult local tasks (i.e., the number of classes involved and a reduction in the number of samples available for each class). Concurrently, the dissimilarity in data distribution among different clients gradually diminishes, while the diversity in client data distribution widens. Furthermore, comparing the 10-classification dataset and the 50-classification dataset, it can be seen that under the same α𝛼\alphaitalic_α value, when the number of dataset classes increases, the difference of client data distribution becomes larger, and the diversity of client data distribution increases. It becomes more difficult to extract general knowledge among clients.

In summary, the Dirichlet non-IID configuration proves to be a potent approach for assessing the performance of PFL methods across a spectrum of intricate and diverse non-IID scenarios.