Overcoming Data and Model Heterogeneities in Decentralized Federated Learning via Synthetic Anchors

Chun-Yin Huang    Kartik Srinivas    Xin Zhang    Xiaoxiao Li
Abstract

Conventional Federated Learning (FL) involves collaborative training of a global model while maintaining user data privacy. One of its branches, decentralized FL, is a serverless network that allows clients to own and optimize different local models separately, which results in saving management and communication resources. Despite the promising advancements in decentralized FL, it may reduce model generalizability due to lacking a global model. In this scenario, managing data and model heterogeneity among clients becomes a crucial problem, which poses a unique challenge that must be overcome: How can every client’s local model learn generalizable representation in a decentralized manner? To address this challenge, we propose a novel Decentralized FL technique by introducing Synthetic Anchors, dubbed as DeSA. Based on the theory of domain adaptation and Knowledge Distillation (KD), we theoretically and empirically show that synthesizing global anchors based on raw data distribution facilitates mutual knowledge transfer. We further design two effective regularization terms for local training: 1) REG loss that regularizes the distribution of the client’s latent embedding with the anchors and 2) KD loss that enables clients to learn from others. Through extensive experiments on diverse client data distributions, we showcase the effectiveness of DeSA in enhancing both inter- and intra-domain accuracy of each client.


1 Introduction

Federated learning (FL) has emerged as an important paradigm to perform machine learning from multi-source data in a distributed manner. Conventional FL techniques leverage a large number of clients to process a global model learning, which is coordinated by a central server. However, there arises concerns on increased vulnerability of system failures and trustworthiness concerns for the central server design in the conventional FL. An emerging paradigm, called decentralized FL, is featured by its serverless setting to address the issues. Recent work has shown decentralized FL framework can provide more flexibility and solubility (Beltrán et al., 2023; Yuan et al., 2023b), where they relax the use of central server for model aggregation. However, this could deflect the generalization capability of each client model. Although most of the works in decentralized FL focus on model personalization (Huang et al., 2022), we consider it crucial for decentralized FL to be generalizable since local training data may not align with local testing data in practice.

Client heterogeneity is a common phenomenon in FL that can deteriorate model generalizability. On one hand, data-heterogeneity relaxes the assumption that the data across all the client are independent and identically distributed (i.i.d.). To solve the problem, a plethora of methods have been proposed. However, most of them assumes that the model architectures are invariant across clients (Li et al., 2020b, a, 2021b; Karimireddy et al., 2020; Tang et al., 2022). On the other hand, many practical FL applications (e.g., Internet-of-Things and mobile device system) face model-heterogeneity, where clients have devices with different computation capabilities and memory constraints. In conventional FL, strategies have been proposed to leverage knowledge transferring to address the model heterogeneity issue, e.g., server collects labeled data with the similar distribution as the client data or clients transmit models (Lin et al., 2020; Zhu et al., 2021). However, these operations usually require a server to coordinate the knowledge distillation and assume global data is available (Li & Wang, 2019; Lin et al., 2020; Tan et al., 2022). Thus, they are not applicable to decentralized FL and may not fit into real-world scenarios. Recently, there are works proposing to perform test-time adaptation for out-of-federation clients (Jiang et al., 2023), while this paper focuses on the solution during FL training time.

We can see that both heterogeneous FL and decentralized FL leave the gray space of the following practical research question: How can every client model perform well on other client domains, in a completely decentralized and heterogeneous FL setup? Such a problem is referred as decentralized federated mutual learning, which is further detailed in Section 2.2. To the best of our knowledge, we are the first to address both data and model heterogeneity issues under serverless decentralized FL setting (see the comparison with related work in Table 1).

Refer to caption
Figure 1: The decision boundary before (a) and after (b) applying our proposed REGsubscriptREG\mathcal{L}_{\rm REG}caligraphic_L start_POSTSUBSCRIPT roman_REG end_POSTSUBSCRIPT (Eq. 4) and KDsubscriptKD\mathcal{L}_{\rm KD}caligraphic_L start_POSTSUBSCRIPT roman_KD end_POSTSUBSCRIPT (Eq. 3.4) using our synthetic anchor data. REGsubscriptREG\mathcal{L}_{\rm REG}caligraphic_L start_POSTSUBSCRIPT roman_REG end_POSTSUBSCRIPT aims to group the raw feature towards synthetic anchor feature, and KDsubscriptKD\mathcal{L}_{\rm KD}caligraphic_L start_POSTSUBSCRIPT roman_KD end_POSTSUBSCRIPT twists the local decision boundary towards the generalized decision boundary.

In this work, we tackle the research question by performing local heterogeneity harmonized training and knowledge distillation. In particular, we synthesize a lightweight synthetic data generation process via distribution matching (Zhao & Bilen, 2023), and the synthetic data are exchangeable across clients to augment local datasets. We theoretically and empirically show that the synthetic data can serve as anchor points to improves FL for two purposes: 1) reducing the domain-gap between the distributions in the latent feature space; and 2) enabling global knowledge distillation for mutual learning. The effect of the two losses are visualized in Figure 1. In summary, we tackle a realistic and challenging setting in decentralized FL, where both data and model heterogeneities exist, without acquiring publicly available global real data. Our contributions are listed as follows:

  • To circumvent the heterogeneity on data and model, we propose an innovative algorithm named Decentralized Federated Learning with Synthetic Anchors (DeSA) that utilizes only a small number of synthetic data.

  • We theoretically and empirically show that the strategic design of synthetic anchor data and our novel FL loss function effectively boost local model generalization in diverse data scenarios.

  • We conduct extensive experiments prove DeSA’s effectiveness, surpassing existing decentralized FL algorithms. It excels in inter- and intra-client performance across diverse tasks.

2 Preliminaries

2.1 Conventional Federated Learning

Conventional FL aims to learn a single generalized global model that performs optimally on all the clients’ data domains. Mathematically, the learning problem can be formulated as

M=argminMi=1N𝔼𝐱,yPi[(M(𝐱),y)]superscript𝑀subscript𝑀superscriptsubscript𝑖1𝑁subscript𝔼similar-to𝐱𝑦subscript𝑃𝑖delimited-[]𝑀𝐱𝑦M^{*}=\arg\min_{M\in\mathcal{M}}\sum_{i=1}^{N}\mathbb{E}_{\mathbf{x},y\sim P_{% i}}[\mathcal{L}(M(\mathbf{x}),y)]italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_M ∈ caligraphic_M end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT bold_x , italic_y ∼ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ caligraphic_L ( italic_M ( bold_x ) , italic_y ) ] (1)

where Msuperscript𝑀M^{*}italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimized global model from the shared model space \mathcal{M}caligraphic_M, Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the local data distribution at the i𝑖iitalic_ith client, \mathcal{L}caligraphic_L is the loss function and (𝐱,y)𝐱𝑦(\mathbf{x},y)( bold_x , italic_y ) is the feature-label pair. Inspired by the pioneering work of FedAvg (McMahan et al., 2017), a plethora of methods have tried to fill in the performance gap of FedAvg on data-heterogeneous scenario, which can be categorized in two main orthogonal directions: Direction 1 aims to minimize the difference between the local and global model parameters to improve convergence (Li et al., 2020a; Karimireddy et al., 2020; Wang et al., 2020). Direction 2 enforces consistency in local embedded features using anchors and regularization loss (Tang et al., 2022; Zhou et al., 2022; Ye et al., 2022). This work follows the second research direction and aim to leverage anchor points to handle data heterogeneity. We also tackle the more challenging problem of domain shift, unlike other methods that only assume a label-shift amongst the client-data distributions.

Table 1: Comparison of the settings with other related heterogeneous FL and decentralized FL methods.

Methods Data Hetero- geneity Model Hetero- geneity Serverless No Public Data VHL (Tang et al., 2022)a FedGen (Zhu et al., 2021) FedHe (Chan & Ngai, 2021) FedDF (Lin et al., 2020) FCCL (Huang et al., 2022) FedProto (Tan et al., 2022) FedFTG (Zhang et al., 2022b) DENSE (Zhang et al., 2022a) DeSA (ours) a VHL has a single global model, trained using mutual information from all clients. Therefore we reference it under Mutual Learning.

2.2 Decentralized FL and Mutual Learning

Standard decentralized FL aims to solve the similar generalization objective as conventional FL (i.e., Eq. 1), only, without a central server to do so (Gao et al., 2022), and the objective applies to each local models Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Here, we focus on learning from each other under heterogeneous models and data distributions. This brings in an essential line of works, known as mutual learning, where clients learn from each other to obtain the essential generalizability for their models. For example, during quarantine for pandemic, hospitals want to collaboratively train a model for classifying the virus. It is desired the models are generalizable to virus variants from different area, so that after quarantine the local models are still effective for incoming tourists.

Although mutual learning with heterogeneous data and models has been studied recently, most of them assume the existence of public real data (Lin et al., 2020; Huang et al., 2022; Gong et al., 2022) or a central server to coordinate the generation of synthetic data from the local client data (Zhang et al., 2022a; Zhu et al., 2021; Zhang et al., 2022b). Another line of works rely on a server to aggregate locally generated logits or prototypes, and use it as local training guidance (Jeong et al., 2018; Chan & Ngai, 2021; Tan et al., 2022). In addition, more recent works have suggested that each clients train two models, a larger model for local training and a smaller model for mutual information exchange (Wu et al., 2022; Shen et al., 2023). However, none of the above methods simultaneously address both non-iid data and heterogeneous models under serverless and data-free setting. In this work, we explore mutual learning to optimize both local (intra-client) and global (inter-client) dataset accuracy (see the detailed setup in Sec. 3.1). We list the comparison with other methods in Table 1 and more detailed related works in Appendix G.

3 Method

3.1 Notation and Problem Setup

Suppose there are N𝑁Nitalic_N clients with i𝑖iitalic_ith client denoted as Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let’s represent the private datasets on Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as Di={𝐱,y}subscript𝐷𝑖𝐱𝑦D_{i}=\{\mathbf{x},y\}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_x , italic_y }, where 𝐱𝐱\mathbf{x}bold_x is the feature and y{1,,K}𝑦1𝐾y\in\{1,\cdots,K\}italic_y ∈ { 1 , ⋯ , italic_K } is the label from K𝐾Kitalic_K classes. Let \mathcal{L}caligraphic_L represent a real-valued loss function for classification (e.g.,cross-entropy loss). Denote the communication neighboring nodes of the client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the system as 𝒩(Ci)𝒩subscript𝐶𝑖\mathcal{N}(C_{i})caligraphic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the local models as {Mi=ρiψi}i=1i=Nsuperscriptsubscriptsubscript𝑀𝑖subscript𝜌𝑖subscript𝜓𝑖𝑖1𝑖𝑁\{M_{i}=\rho_{i}\circ\psi_{i}\}_{i=1}^{i=N}{ italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i = italic_N end_POSTSUPERSCRIPT, where ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the feature encoder and ρisubscript𝜌𝑖\rho_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the classification head for the i𝑖iitalic_ith client’s model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. DeSA returns trained client models {Mi}i=1i=Nsuperscriptsubscriptsubscript𝑀𝑖𝑖1𝑖𝑁\{M_{i}\}_{i=1}^{i=N}{ italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i = italic_N end_POSTSUPERSCRIPT.

Our work aims to connect two key areas, heterogeneous FL and decentralized FL, termed as decentralized federated mutual learning, where we train multiple client models in a decentralized way such that they can generalize well across all clients’ data domains. Mathematically, our objective is formulated as, for every client i𝑖iitalic_i,

Mi=superscriptsubscript𝑀𝑖absent\displaystyle M_{i}^{*}=italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = argminMii𝔼𝐱,yPi[(Mi(𝐱),y)]Intra-clientsubscriptargminsubscript𝑀𝑖subscript𝑖subscriptsubscript𝔼similar-to𝐱𝑦subscript𝑃𝑖delimited-[]subscript𝑀𝑖𝐱𝑦Intra-client\displaystyle\operatorname*{arg\,min}_{M_{i}\in\mathcal{M}_{i}}\underbrace{\ % \mathbb{E}_{\mathbf{x},y\sim P_{i}}[\mathcal{L}(M_{i}(\mathbf{x}),y)]}_{\text{% Intra-client}}start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT bold_x , italic_y ∼ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ caligraphic_L ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) , italic_y ) ] end_ARG start_POSTSUBSCRIPT Intra-client end_POSTSUBSCRIPT
+j𝒩(Ci)𝔼𝐱,yPj[(Mi(𝐱),y)]Inter-client,subscriptsubscript𝑗𝒩subscript𝐶𝑖subscript𝔼similar-to𝐱𝑦subscript𝑃𝑗delimited-[]subscript𝑀𝑖𝐱𝑦Inter-client\displaystyle+\underbrace{\sum_{j\in\mathcal{N}(C_{i})}\mathbb{E}_{\mathbf{x},% y\sim P_{j}}[\mathcal{L}(M_{i}(\mathbf{x}),y)]}_{\text{Inter-client}},+ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_x , italic_y ∼ italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ caligraphic_L ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) , italic_y ) ] end_ARG start_POSTSUBSCRIPT Inter-client end_POSTSUBSCRIPT , (2)

where Misuperscriptsubscript𝑀𝑖M_{i}^{*}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the best possible model for client i𝑖iitalic_i with respect to the model space isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Refer to caption
(a) Heterogeneous Setup
Refer to caption
(b) DeSA pipeline
Figure 2: Heterogeneous setup and DeSA pipeline. (a) We assume a realistic FL scenario, where clients have different data distributions and computational powers, which results in different model architectures. (b) DeSA pipeline consists of three phases, local data synthesis (top left) , global synthetic data aggregation (top right)(Section  3.2), and decentralized training (bottom) using anchor regularization(Section  3.3) and knowledge distillation (Section  3.4).

Overview of DeSA. The overall objective of DeSA is to improve local models’ generalizability in FL training under both model and data heterogeneity in a serverless setting as shown in in Figure 2(a). The pipeline of DeSA is depicted in Figure 2(b). Our algorithm contains three important aspects: 1) we generate synthetic anchor data by matching raw data distribution and share them amongst the client’s neighbors; 2) we train each client model locally with a synthetic anchor-based feature regularizer; and 3) we allow the models to learn from each other via knowledge distillation based on the synthetic anchors. The effectiveness of steps 2 and 3 can be observed in Figure 1. The next three subsections delve deeper into these three designs. The full algorithm is depicted in Algorithm 1.

3.2 Synthetic Anchor Datasets Generation

The recent success of dataset distillation-based data synthesis technique that generates data with similar representation power as the original raw data (Zhao et al., 2020; Zhao & Bilen, 2023). Thus, we propose to leverage this method to efficiently and effectively generate a synthetic anchor dataset without requiring any additional model pretraining. Inspired by our theoretical analysis in Sec. 4, we utilize distribution matching (Zhao & Bilen, 2023) to distill local synthetic anchor data using the empirical maximum mean discrepancy loss (MMD) (Gretton et al., 2012) as follows,

DiSyn=argminD||\displaystyle D^{Syn}_{i}=\operatorname*{arg\,min}_{D}||italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT | | 1|Di|(𝐱,y)Diψrand(𝐱|y)1subscript𝐷𝑖subscript𝐱𝑦subscript𝐷𝑖superscript𝜓randconditional𝐱𝑦\displaystyle\frac{1}{|D_{i}|}\sum_{(\mathbf{x},y)\in D_{i}}\psi^{\rm rand}(% \mathbf{x}|y)\!divide start_ARG 1 end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( bold_x , italic_y ) ∈ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ψ start_POSTSUPERSCRIPT roman_rand end_POSTSUPERSCRIPT ( bold_x | italic_y )
1|D|(𝐱,y)Dψrand(𝐱|y)||2,\displaystyle-\!\frac{1}{|D|}\sum_{(\mathbf{x},y)\in D}\psi^{\rm rand}(\mathbf% {x}|y)||^{2},- divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT ( bold_x , italic_y ) ∈ italic_D end_POSTSUBSCRIPT italic_ψ start_POSTSUPERSCRIPT roman_rand end_POSTSUPERSCRIPT ( bold_x | italic_y ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

where ψrandsuperscript𝜓rand\psi^{\rm rand}italic_ψ start_POSTSUPERSCRIPT roman_rand end_POSTSUPERSCRIPT is a randomly sampled feature extractor for each iteration, Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the raw data for client i𝑖iitalic_i, and DiSynsubscriptsuperscript𝐷𝑆𝑦𝑛𝑖D^{Syn}_{i}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is its target synthetic data. Following Eq. 3.2, we can manipulate the synthetic anchor dataset generation in a class-balanced manner, which enables the label prior to being unbiased towards a set of classes.

Similar to other FL work sharing distilled synthetic data for efficiency (Song et al., 2023), After local data synthesis, we request each client to share it among peers to ensure they possess same global information, and the global synthetic anchor data is denoted as DSyn=iDiSynsuperscript𝐷𝑆𝑦𝑛subscript𝑖subscriptsuperscript𝐷𝑆𝑦𝑛𝑖D^{Syn}=\cup_{i}D^{Syn}_{i}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT = ∪ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 111By default, we perform simple interpolation (averaging) among clients as it is shown that using this mixup strategy can improve model fairness (Chuang & Mroueh, 2021).. As shown in Algorithm 1, our method is designed to work only to receive neighbor node information, and DeSA is designed for peer-to-peer decentralized network. Since the loss requires all nodes’ information, we can leverage the FastMix algorithm to aggregate all nodes’ information as in (Ye et al., 2020; Luo & Ye, 2022). This method can aggregate all nodes’ information via adjacent nodes’ communication at a linear speed. It is very common in fully decentralized optimization. In fact, our method can also work if each node can only receive neighbor nodes’ information during training, and we empirically show the feasibility in our CIFAR10C experiments by sampling neighboring clients.

It is worth noting that, different from (Song et al., 2023), we further propose novel loss terms and training strategies to help mitigate the distribution discrepancy between the clients, which are detailed in the following sections (see Sec. 3.3 and Sec. 3.4), enabling improved model performance, as intuitively incorporating DSynsuperscript𝐷𝑆𝑦𝑛D^{Syn}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT into training can only achieve sub-optimal results (see Figure 3 when both λKDsubscript𝜆𝐾𝐷\lambda_{KD}italic_λ start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT and λREGsubscript𝜆𝑅𝐸𝐺\lambda_{REG}italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT equals to 0.).

3.3 REG Loss for Feature Regularization

The synthetic anchor regularization loss enforces the model to learn a client-domain invariant representation of the data. (Tang et al., 2022) and other domain incremental works (Rostami, 2021) show that, adding a distribution discrepancy based loss in the latent space enables learning of a domain-invariant encoder ψ𝜓\psiitalic_ψ. However, most of the domain adaptation works require explicit access to the real data from other domains, or generates random noise as anchors. We propose using the latent space distribution of the synthetic anchor data DSynsuperscript𝐷𝑆𝑦𝑛D^{Syn}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT as a synthetic anchor to which the client-model specific encoders ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can project their local private data onto. The loss function can be therefore defined as,

REG(ψi)=𝔼[d(ψi(DSyn)||ψi(Di))],\displaystyle\mathcal{L}_{REG}(\psi_{i})=\mathbb{E}[d(\psi_{i}(D^{Syn})||\psi_% {i}(D_{i}))],caligraphic_L start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_E [ italic_d ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) | | italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] , (4)

where K𝐾Kitalic_K is the number of classes, d𝑑ditalic_d is distance computed using the supervised contrastive loss,

d(ψi;DSyn,Di)=jB1|B\jyj|𝑑subscript𝜓𝑖superscript𝐷𝑆𝑦𝑛subscript𝐷𝑖subscript𝑗𝐵1subscriptsuperscript𝐵subscript𝑦𝑗\absent𝑗\displaystyle d(\psi_{i};D^{Syn},D_{i})=\sum_{j\in B}-\frac{1}{|B^{y_{j}}_{% \backslash j}|}italic_d ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ italic_B end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG | italic_B start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT | end_ARG
𝐱pB\jyjlogexp(ψi(𝐱j)ψi(𝐱p)/τtemp)𝐱aB\jexp(ψi(𝐱j)ψi(𝐱a)/τtemp)subscriptsubscript𝐱𝑝subscriptsuperscript𝐵subscript𝑦𝑗\absent𝑗logexpsubscript𝜓𝑖subscript𝐱𝑗subscript𝜓𝑖subscript𝐱𝑝subscript𝜏𝑡𝑒𝑚𝑝subscriptsubscript𝐱𝑎subscript𝐵\absent𝑗expsubscript𝜓𝑖subscript𝐱𝑗subscript𝜓𝑖subscript𝐱𝑎subscript𝜏𝑡𝑒𝑚𝑝\displaystyle\sum_{\mathbf{x}_{p}\in B^{y_{j}}_{\backslash j}}{\rm log}\frac{{% \rm exp}(\psi_{i}(\mathbf{x}_{j})\cdot\psi_{i}(\mathbf{x}_{p})/\tau_{temp})}{% \sum_{\mathbf{x}_{a}\in B_{\backslash j}}{\rm exp}(\psi_{i}(\mathbf{x}_{j})% \cdot\psi_{i}(\mathbf{x}_{a})/\tau_{temp})}∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ italic_B start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p end_POSTSUBSCRIPT ) end_ARG (5)

where B\jsubscript𝐵\absent𝑗B_{\backslash j}italic_B start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT represents a batch containing both local raw data Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and global synthetic data DSynsuperscript𝐷𝑆𝑦𝑛D^{Syn}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT but without data j𝑗jitalic_j, B\jyjsubscriptsuperscript𝐵subscript𝑦𝑗\absent𝑗B^{y_{j}}_{\backslash j}italic_B start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT is a subset of B\jsubscript𝐵\absent𝑗B_{\backslash j}italic_B start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT only with samples belonging to class yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and τtempsubscript𝜏𝑡𝑒𝑚𝑝\tau_{temp}italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p end_POSTSUBSCRIPT is a scalar temperature parameter. Note that we will detach the synthetic anchor data to ensure we are pulling local features to global features.

3.4 Knowledge Distillation for Information Exchange

This step allows a single client model to learn from all the other models using a common synthetic anchor dataset DSynsuperscript𝐷𝑆𝑦𝑛D^{Syn}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT. Under our setting of model heterogeneity among clients, we cannot aggregate the model parameters by simply averaging as in FedAvg (McMahan et al., 2017). Instead, we propose to utilize knowledge distillation (KD) (Hinton et al., 2015) for decentralized model aggregation. Specifically, the fact that DSynsuperscript𝐷𝑆𝑦𝑛D^{Syn}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT is representative of the joint distributions of the clients allows it to be an ideal dataset for knowledge transfer. Thus, to enable the client model to mimic the predictions of the other models, we also incorporate KD loss using DSynsuperscript𝐷𝑆𝑦𝑛D^{Syn}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT, formulated as

KD(Mi)=KL(Mi(𝐱Syn),Z¯i),subscript𝐾𝐷subscript𝑀𝑖subscript𝐾𝐿subscript𝑀𝑖superscript𝐱𝑆𝑦𝑛subscript¯𝑍𝑖\displaystyle\mathcal{L}_{KD}(M_{i})=\mathcal{L}_{KL}(M_{i}(\mathbf{x}^{Syn}),% \bar{Z}_{i}),caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) , over¯ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,
Z¯i=1|N(Ci)|jN(Ci)Mj(𝐱Syn),subscript¯𝑍𝑖1𝑁subscript𝐶𝑖subscript𝑗𝑁subscript𝐶𝑖subscript𝑀𝑗superscript𝐱𝑆𝑦𝑛\displaystyle\quad\bar{Z}_{i}=\frac{1}{|N(C_{i})|}\sum_{j\in N(C_{i})}M_{j}(% \mathbf{x}^{Syn}),over¯ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) , (6)

where (𝐱Syn,ySyn)DSynsimilar-tosuperscript𝐱𝑆𝑦𝑛superscript𝑦𝑆𝑦𝑛superscript𝐷𝑆𝑦𝑛(\mathbf{x}^{Syn},y^{Syn})\sim D^{Syn}( bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) ∼ italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT, N(i)𝑁𝑖N(i)italic_N ( italic_i ) is the neighbor clients of client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT is the KL-Divergence between the output logits of 𝐱Synsuperscript𝐱𝑆𝑦𝑛\mathbf{x}^{Syn}bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT on Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the averaged output logits of 𝐱Synsuperscript𝐱𝑆𝑦𝑛\mathbf{x}^{Syn}bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT on Mj,jN(Ci)subscript𝑀𝑗for-all𝑗𝑁subscript𝐶𝑖M_{j},\forall j\in N(C_{i})italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ∈ italic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Finally, we formulate our objective function as

\displaystyle\mathcal{L}caligraphic_L =CE(DiDSyn;Mi)+λREGREG(Di,DSyn;Mi)absentsubscript𝐶𝐸subscript𝐷𝑖superscript𝐷𝑆𝑦𝑛subscript𝑀𝑖subscript𝜆𝑅𝐸𝐺subscript𝑅𝐸𝐺subscript𝐷𝑖superscript𝐷𝑆𝑦𝑛subscript𝑀𝑖\displaystyle=\mathcal{L}_{CE}(D_{i}\cup D^{Syn};M_{i})+\lambda_{REG}\mathcal{% L}_{REG}(D_{i},D^{Syn};M_{i})= caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ; italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ; italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
+λKDKD(DSyn;Mi,Zi¯),subscript𝜆𝐾𝐷subscriptKDsuperscript𝐷𝑆𝑦𝑛subscript𝑀𝑖¯subscript𝑍𝑖\displaystyle+\lambda_{KD}\mathcal{L_{\rm KD}}(D^{Syn};M_{i},\bar{Z_{i}}),+ italic_λ start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_KD end_POSTSUBSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ; italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over¯ start_ARG italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , (7)

where CE(D;M)subscript𝐶𝐸𝐷𝑀\mathcal{L}_{CE}(D;M)caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_D ; italic_M ) is the K𝐾Kitalic_K-classes cross entropy loss on data D𝐷Ditalic_D and model M𝑀Mitalic_M. λREGsubscript𝜆𝑅𝐸𝐺\lambda_{REG}italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT and λKDsubscript𝜆𝐾𝐷\lambda_{KD}italic_λ start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT are the hyperparameters for regularization and KD losses, REGsubscript𝑅𝐸𝐺\mathcal{L}_{REG}caligraphic_L start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT and KDsubscript𝐾𝐷\mathcal{L}_{KD}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT are as defined in Eq. 4 and Eq. 3.4, and Zi¯¯subscript𝑍𝑖\bar{Z_{i}}over¯ start_ARG italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is the shared logits from neighboring clients N(Ci)𝑁subscript𝐶𝑖N(C_{i})italic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We also incorporate our class-conditional-generated global synthetic data DSynsuperscript𝐷𝑆𝑦𝑛D^{Syn}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT in the CE loss to enforce models to perform well on general domains and to benefit from the augmented dataset. Overall, we formulate our objective function as

Algorithm 1 Serverless DeSA (Procedures for Client i𝑖iitalic_i)
InitCisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTj𝒩(Ci)𝑗𝒩subscript𝐶𝑖j\in\mathcal{N}(C_{i})italic_j ∈ caligraphic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )DSyn=DSyn\CallGetimgCjsuperscript𝐷𝑆𝑦𝑛superscript𝐷𝑆𝑦𝑛\Call𝐺𝑒𝑡𝑖𝑚𝑔subscript𝐶𝑗D^{Syn}=D^{Syn}\cup\Call{Get-img}{C_{j}}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT = italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ∪ italic_G italic_e italic_t - italic_i italic_m italic_g italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPTLocalTrainCisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, t𝑡titalic_tclient Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTis sampled share Zi=Mi(DSyn)subscript𝑍𝑖subscript𝑀𝑖superscript𝐷𝑆𝑦𝑛Z_{i}=M_{i}(D^{Syn})italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT )to 𝒩(Ci)𝒩subscript𝐶𝑖\mathcal{N}(C_{i})caligraphic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )get Z¯i=1/|𝒩(Ci)|jj𝒩(Ci)Zjsubscript¯𝑍𝑖1𝒩subscript𝐶𝑖superscriptsubscript𝑗𝑗𝒩subscript𝐶𝑖subscript𝑍𝑗\bar{Z}_{i}={1}/{|\mathcal{N}(C_{i})|}\sum_{j}^{j\in\mathcal{N}(C_{i})}Z_{j}over¯ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 / | caligraphic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j ∈ caligraphic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPTd CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT= classificationDiDSyn;Misubscript𝐷𝑖superscript𝐷𝑆𝑦𝑛subscript𝑀𝑖D_{i}\cup D^{Syn};M_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ; italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTREGsubscript𝑅𝐸𝐺\mathcal{L}_{REG}caligraphic_L start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT= Feature-REGDi,DSyn;Misubscript𝐷𝑖superscript𝐷𝑆𝑦𝑛subscript𝑀𝑖D_{i},D^{Syn};M_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ; italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTKDsubscript𝐾𝐷\mathcal{L}_{KD}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT= KDDSyn;Mi,Z¯superscript𝐷𝑆𝑦𝑛subscript𝑀𝑖¯𝑍D^{Syn};M_{i},\bar{Z}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ; italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over¯ start_ARG italic_Z end_ARG\mathcal{L}caligraphic_L= CE+λREGREG+λKDKDsubscript𝐶𝐸subscript𝜆𝑅𝐸𝐺subscript𝑅𝐸𝐺subscript𝜆𝐾𝐷subscript𝐾𝐷\mathcal{L}_{CE}+\lambda_{REG}\mathcal{L}_{REG}+\lambda_{KD}\mathcal{L}_{KD}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPTMi=MiηMisubscript𝑀𝑖subscript𝑀𝑖𝜂subscriptsubscript𝑀𝑖M_{i}=M_{i}-\eta\nabla_{M_{i}}\mathcal{L}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_η ∇ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L
\Procedure
\ForAll
\State
\EndFor
\EndProcedure
\Procedure
\If
\State
\State
\State
\Call
\State
\Call
\State
\Call
\State
\State
\EndIf
\EndProcedure

4 Theoretical Analysis

In this section, we focus on providing a theoretical justification for our algorithm. The technical challenge is to analyze the effect of minimizing the overall loss function (Eq. 3.4) on the generalizability on the global data distribution PTsuperscript𝑃𝑇P^{T}italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of a client model.
Notation: Here, we state the intuitive definitions borrowed from (Ben-David et al., 2010). For the precise definitions please refer to the notation table in Appendix A. The domain pair (P,fP)𝑃superscript𝑓𝑃(P,f^{P})( italic_P , italic_f start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) represents the source distribution and its optimal labeling function. The ϵitalic-ϵ\epsilonitalic_ϵ - error of a hypothesis M𝑀Mitalic_M on the domain pair (P,fP)𝑃superscript𝑓𝑃(P,f^{P})( italic_P , italic_f start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) is defined as the probability of a mismatch between the optimal labeling function fPsuperscript𝑓𝑃f^{P}italic_f start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and the hypothesis M𝑀Mitalic_M. Additionally, the ΔΔ\mathcal{H}\Delta\mathcal{H}caligraphic_H roman_Δ caligraphic_H divergence (dΔsubscript𝑑Δd_{\mathcal{H}\Delta\mathcal{H}}italic_d start_POSTSUBSCRIPT caligraphic_H roman_Δ caligraphic_H end_POSTSUBSCRIPT) describes a distance measure between two distributions.
Analysis: Our analysis focuses on the generalization on a global data distribution PTsuperscript𝑃𝑇P^{T}italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT that is the average of the client distributions (Marfoq et al., 2021), with labeling function fTsuperscript𝑓𝑇f^{T}italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT the same as fSsuperscript𝑓𝑆f^{S}italic_f start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT . We assume that the minimization of loss in Eq (3.1) matches the optimal labeling function fTsuperscript𝑓𝑇f^{T}italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of the global distribution PTsuperscript𝑃𝑇P^{T}italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and formalizes the intuition of generalizing over closely related client distributions.

We proceed by defining the distribution of our global synthetic data as PSynsuperscript𝑃𝑆𝑦𝑛P^{Syn}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT with the corresponding labeling function fSynsuperscript𝑓𝑆𝑦𝑛f^{Syn}italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT. As PSynsuperscript𝑃𝑆𝑦𝑛P^{Syn}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT is also leveraged for the knowledge distillation, inspired by (Feng et al., 2021) we describe (PKDSyn,fKDSyn)superscriptsubscript𝑃𝐾𝐷𝑆𝑦𝑛subscriptsuperscript𝑓𝑆𝑦𝑛𝐾𝐷(P_{KD}^{Syn},f^{Syn}_{KD})( italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT ) as follows.

Definition 4.1.

The extended knowledge distillation (KD) domain pair (PKDSynsubscriptsuperscript𝑃𝑆𝑦𝑛𝐾𝐷P^{Syn}_{KD}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT,fKDSynsubscriptsuperscript𝑓𝑆𝑦𝑛𝐾𝐷f^{Syn}_{KD}italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT) of a client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, originating from the KD dataset DKDSynsuperscriptsubscript𝐷𝐾𝐷𝑆𝑦𝑛D_{KD}^{Syn}italic_D start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT is defined as

DKDSyn={𝐱Syn,1|N(Ci)|jN(Ci)Mj(𝐱Syn)}(PKDSyn,fKDSyn)superscriptsubscript𝐷𝐾𝐷𝑆𝑦𝑛superscript𝐱𝑆𝑦𝑛1𝑁subscript𝐶𝑖subscript𝑗𝑁subscript𝐶𝑖subscript𝑀𝑗superscript𝐱𝑆𝑦𝑛similar-tosuperscriptsubscript𝑃𝐾𝐷𝑆𝑦𝑛subscriptsuperscript𝑓𝑆𝑦𝑛𝐾𝐷D_{KD}^{Syn}\!=\!\{\mathbf{x}^{Syn},\!\frac{1}{|N(C_{i})|}\sum_{j\in N(C_{i})}% \!M_{j}(\mathbf{x}^{Syn})\}\!\sim\!(P_{KD}^{Syn},f^{Syn}_{KD})italic_D start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT = { bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , divide start_ARG 1 end_ARG start_ARG | italic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_N ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) } ∼ ( italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT )

where Mi(𝐱Syn)subscript𝑀𝑖superscript𝐱𝑆𝑦𝑛M_{i}(\mathbf{x}^{Syn})italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) is the predicted logit on global synthetic data 𝐱SynPSyn(x)similar-tosuperscript𝐱𝑆𝑦𝑛superscript𝑃𝑆𝑦𝑛𝑥\mathbf{x}^{Syn}\sim P^{Syn}(x)bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ( italic_x ).

Definition 4.2.

We define the the overall source distribution of the client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as PiSsuperscriptsubscript𝑃𝑖𝑆P_{i}^{S}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, which is a convex combination of the local and synthetic distributions

PiS=αPi+αSynPSyn+αKDSynPKDSynsuperscriptsubscript𝑃𝑖𝑆𝛼subscript𝑃𝑖superscript𝛼𝑆𝑦𝑛superscript𝑃𝑆𝑦𝑛subscriptsuperscript𝛼𝑆𝑦𝑛𝐾𝐷subscriptsuperscript𝑃𝑆𝑦𝑛𝐾𝐷\displaystyle P_{i}^{S}=\alpha P_{i}+\alpha^{Syn}P^{Syn}+\alpha^{Syn}_{KD}P^{% Syn}_{KD}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = italic_α italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT (8)

The positive component weights {α,αSyn,αKDSyn}𝛼superscript𝛼𝑆𝑦𝑛subscriptsuperscript𝛼𝑆𝑦𝑛𝐾𝐷\{\alpha,\alpha^{Syn},\alpha^{Syn}_{KD}\}{ italic_α , italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT } describe the dependence of the client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, on the local, synthetic and knowledge distillation data, and α+αSyn+αKDSyn=1𝛼superscript𝛼𝑆𝑦𝑛subscriptsuperscript𝛼𝑆𝑦𝑛𝐾𝐷1\alpha+\alpha^{Syn}+\alpha^{Syn}_{KD}=1italic_α + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT = 1.

Theorem 1.

Denote the client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s model as Mi=ρiψi𝓟iΨi=isubscript𝑀𝑖subscript𝜌𝑖subscript𝜓𝑖subscript𝓟𝑖subscriptΨ𝑖subscript𝑖M_{i}=\rho_{i}\circ\psi_{i}\in{\mathcal{\bm{P}}}_{i}\circ\Psi_{i}=\mathcal{M}_% {i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ roman_Ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its overall source distribution as PiSsuperscriptsubscript𝑃𝑖𝑆P_{i}^{S}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT with component weights (𝛂𝛂\bm{\alpha}bold_italic_α). Then the generalization error on the global data distribution PTsuperscript𝑃𝑇P^{T}italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT can be bounded as follows

ϵPT(Mi)subscriptitalic-ϵsuperscript𝑃𝑇subscript𝑀𝑖\displaystyle\epsilon_{{P}^{T}}(M_{i})italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ϵPiS(Mi)+α𝐂(Pi,PT)absentsubscriptitalic-ϵsuperscriptsubscript𝑃𝑖𝑆subscript𝑀𝑖𝛼𝐂subscript𝑃𝑖subscript𝑃𝑇\displaystyle\leq\epsilon_{{P}_{i}^{S}}(M_{i})+\alpha\mathbf{C}(P_{i},P_{T})≤ italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_α bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
+αSynϵPT(fSyn)+αKDSynϵPT(fKDSyn)superscript𝛼𝑆𝑦𝑛subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛subscriptsuperscript𝛼𝑆𝑦𝑛𝐾𝐷subscriptitalic-ϵsuperscript𝑃𝑇superscriptsubscript𝑓𝐾𝐷𝑆𝑦𝑛\displaystyle+\alpha^{Syn}\epsilon_{{P}^{T}}(f^{Syn})+\alpha^{Syn}_{KD}% \epsilon_{{P}^{T}}(f_{KD}^{Syn})+ italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT )
+(1α)2d𝒫iΔ𝒫i(ψPSyn,ψPT)1𝛼2subscript𝑑subscript𝒫𝑖Δsubscript𝒫𝑖𝜓superscript𝑃𝑆𝑦𝑛𝜓superscript𝑃𝑇\displaystyle+\frac{(1-\alpha)}{2}d_{\mathcal{P}_{i}\Delta\mathcal{P}_{i}}(% \psi\circ P^{Syn},\psi\circ P^{T})+ divide start_ARG ( 1 - italic_α ) end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (9)

where 𝐂(Pi,PT)𝐂subscript𝑃𝑖subscript𝑃𝑇\mathbf{C}(P_{i},P_{T})bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) are small distance terms depending on the distributions Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

Below we give a short summary of the interpretation of our theorem in 4.3. For a more detailed interpretation, please refer to Appendix A.

Remark 4.3.

The first term is minimized via local cross-entropy loss. The second term is a constant given data distributions. The third and fourth terms measure the discrepancy between the labeling function of the target domain fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and the labeling function fSynsuperscript𝑓𝑆𝑦𝑛f^{Syn}italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT or fKDSynsubscriptsuperscript𝑓𝑆𝑦𝑛𝐾𝐷f^{Syn}_{KD}italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT of the corresponding synthetic distribution PSynsuperscript𝑃𝑆𝑦𝑛P^{Syn}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT or PKDSynsubscriptsuperscript𝑃𝑆𝑦𝑛𝐾𝐷P^{Syn}_{KD}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT. With proper dataset distillation (Zhao & Bilen, 2023), we have the model trained on DSynsuperscript𝐷𝑆𝑦𝑛D^{Syn}italic_D start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT similar to that trained by DTsuperscript𝐷𝑇D^{T}italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, i.e., 𝔼𝐱PT[l(MT(𝐱),y)]𝔼𝐱PSyn[l(MSyn(𝐱),y)],similar-to-or-equalssubscript𝔼similar-to𝐱superscript𝑃𝑇delimited-[]𝑙superscript𝑀𝑇𝐱𝑦subscript𝔼similar-to𝐱superscript𝑃𝑆𝑦𝑛delimited-[]𝑙superscript𝑀𝑆𝑦𝑛𝐱𝑦\mathbb{E}_{\mathbf{x}\sim P^{T}}[l(M^{T}(\mathbf{x}),y)]\simeq\mathbb{E}_{% \mathbf{x}\sim P^{Syn}}[l(M^{Syn}(\mathbf{x}),y)],blackboard_E start_POSTSUBSCRIPT bold_x ∼ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_l ( italic_M start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_x ) , italic_y ) ] ≃ blackboard_E start_POSTSUBSCRIPT bold_x ∼ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_l ( italic_M start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ( bold_x ) , italic_y ) ] , implying fSynfTsuperscript𝑓𝑆𝑦𝑛superscript𝑓𝑇f^{Syn}\rightarrow f^{T}italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT → italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and then we have a small ϵPT(fSyn)subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛\epsilon_{{P}^{T}}(f^{Syn})italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ). A small ϵPT(fKDSyn)subscriptitalic-ϵsuperscript𝑃𝑇subscriptsuperscript𝑓𝑆𝑦𝑛𝐾𝐷\epsilon_{{P}^{T}}(f^{Syn}_{KD})italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT ) can be achieved when every client achieves low CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT, indicating the model ability to learn fKDSynsuperscriptsubscript𝑓𝐾𝐷𝑆𝑦𝑛f_{KD}^{Syn}italic_f start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT approximating to fTsuperscript𝑓𝑇f^{T}italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The last term d𝒫iΔ𝒫isubscript𝑑subscript𝒫𝑖Δsubscript𝒫𝑖d_{\mathcal{P}_{i}\Delta\mathcal{P}_{i}}italic_d start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT motivates the need for reducing our domain-invariant regularizer in Eq. 4, elicited to be bounded.

Furthermore, for the domain pair (D,fD)𝐷superscript𝑓𝐷(D,f^{D})( italic_D , italic_f start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ), we denote 𝒥(D)=|ϵD(M)ϵPT(M)|+ϵPT(fD)𝒥𝐷subscriptitalic-ϵ𝐷𝑀subscriptitalic-ϵsuperscript𝑃𝑇𝑀subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝐷\mathcal{J}(D)=|\epsilon_{D}(M)-\epsilon_{P^{T}}(M)|+\epsilon_{P^{T}}(f^{D})caligraphic_J ( italic_D ) = | italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_M ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) | + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ). The following proposition implies our generalization bound in Theorem 1 is tighter than the generalization bound of training with local data in Eq. 11 (Ben-David et al., 2010) under some mild conditions.

Proposition 2.

Under the conditions in Theorem 1, if it further holds that

supMisubscriptsupremum𝑀subscript𝑖\displaystyle\sup_{M\in\mathcal{M}_{i}}roman_sup start_POSTSUBSCRIPT italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT min{𝒥(PSyn),𝒥(PKDSyn)}𝒥superscript𝑃𝑆𝑦𝑛𝒥subscriptsuperscript𝑃𝑆𝑦𝑛𝐾𝐷\displaystyle\min\{\mathcal{J}(P^{Syn}),\mathcal{J}(P^{Syn}_{KD})\}roman_min { caligraphic_J ( italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) , caligraphic_J ( italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT ) }
\displaystyle\leq infMi(ϵPi(M)ϵPT(M))+𝐂(Pi,PT)subscriptinfimum𝑀subscript𝑖subscriptitalic-ϵsubscript𝑃𝑖𝑀subscriptitalic-ϵsuperscript𝑃𝑇𝑀𝐂subscript𝑃𝑖subscript𝑃𝑇\displaystyle\inf_{M\in\mathcal{M}_{i}}(\epsilon_{P_{i}}(M)-\epsilon_{{P}^{T}}% (M))+\mathbf{C}(P_{i},P_{T})roman_inf start_POSTSUBSCRIPT italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) ) + bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) (10)

then we can get a tighter generalization bound on the i𝑖iitalic_ith client’s model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT than learning with local data only.

The key idea of Proposition 2 is to have the generalization bounds induced by PSynsuperscript𝑃𝑆𝑦𝑛P^{Syn}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT and PKDSynsubscriptsuperscript𝑃𝑆𝑦𝑛𝐾𝐷P^{Syn}_{KD}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT is smaller than the generalization bound by the local training data distribution Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. When the local data heterogeneity is severe, infMi(ϵPi(M)ϵPT(M))subscriptinfimum𝑀subscript𝑖subscriptitalic-ϵsubscript𝑃𝑖𝑀subscriptitalic-ϵsuperscript𝑃𝑇𝑀\inf_{M\in\mathcal{M}_{i}}(\epsilon_{P_{i}}(M)-\epsilon_{{P}^{T}}(M))roman_inf start_POSTSUBSCRIPT italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) ) and 𝐂(Pi,PT)𝐂subscript𝑃𝑖superscript𝑃𝑇\mathbf{C}({P}_{i},{P}^{T})bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) would be large. As the synthetic data and the extended KD data are approaching the the global data distribution, the left side term in (2) would be small. Thus, the above proposition points out that, to reach better generalization, the model learning should rely more on the synthetic data and the extended KD data, when the local data are highly heterogeneous and the synthetic and the extended KD datasets are similar to the global ones.

5 Experiment

Table 2: Heterogeneous model experiments. We compare with model heterogeneous FL methods and report the averaged global accuracy over all client models.The best accuracy is marked in bold. Observe that DeSA can achieve best averaged global accuracy on DIGITS and OFFICE. For CIFAR10C, DeSA can outperform most of the baseline methods except for FCCL, which utilizes CIFAR100 as the public dataset. This is because CIFAR100 and CIFAR10C have a genuine semantic overlap.

DIGITS OFFICE CIFAR10C MN(C)a SV(A) US(C) Syn(A) MM(C) Avg AM(A) CA(C) DS(A) WE(A) Avg 0.1 0.2 FedHe 59.51 66.67 49.89 75.39 71.57 64.81 33.33 47.17 36.86 52.96 42.59 42.67 48.11 FedDF Cifar100 65.98 65.21 61.30 69.65 74.48 67.32 38.87 49.51 33.12 46.89 42.09 35.14 42.19 FMNIST 43.05 69.14 44.95 74.67 71.27 60.61 39.13 46.53 40.23 43.77 42.36 21.67 27.68 FCCL Cifar100 - - - - - - 38.22 49.10 44.68 52.26 46.07 51.88 53.93 FMNIST 46.43 61.02 42.64 63.05 66.39 55.91 27.39 46.78 38.56 48.47 40.30 - - FedProto 62.59 71.74 58.52 81.19 74.44 69.70 38.08 25.06 26.49 47.22 34.21 15.64 19.41 DeSA(DVHLSynsuperscriptsubscript𝐷VHL𝑆𝑦𝑛D_{\rm VHL}^{Syn}italic_D start_POSTSUBSCRIPT roman_VHL end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT)b 54.40 62.03 42.34 67.75 73.03 59.91 8.82 48.98 16.90 49.13 30.96 28.52 35.20 DeSA 70.12 76.17 71.17 81.10 73.83 74.47 51.35 52.80 52.17 52.31 54.46 43.00 49.45 a The letter inside the parenthesis is the model architecture used by the client. A and C represent AlexNet and ConvNet, respectively. b For VHL baseline, we use the synthetic data sampling strategy in VHL only. The purpose is to show DeSA can generate better synthetic anchor data for feature regularization and knowledge distillation.

5.1 Training Setup

Datasets and Models We extensively evaluate DeSA under data heterogeneity in our experiments. Specifically, we consider three classification tasks on three sets of domain-shifted datasets:
1) DIGITS={MNIST (LeCun et al., 1998), SVHN (Netzer et al., 2011), USPS (Hull, 1994), SynthDigits (Ganin & Lempitsky, 2015), MNIST-M (Ganin & Lempitsky, 2015)} contains digits from different styles, and each dataset represents one client.
2) OFFICE={Amazon (Saenko et al., 2010), Caltech (Griffin et al., 2007), DSLR (Saenko et al., 2010), and WebCam (Saenko et al., 2010)} contains images from different cameras and environments, and, similarly, each dataset represents one client.
3) CIFAR10C consists 57 subsets with domain- and label-shifted datasets sampled using Dirichlet distribution with β=2𝛽2\beta=2italic_β = 2 from Cifar10-C (Hendrycks & Dietterich, 2019).
More information about datasets and image synthesis can be found in Appendix E. In our model heterogeneity experiments (Sec. 5.2), we randomly assign model architectures from {ConvNet, AlexNet} for each client, while in model homogeneous experiments, we use ConvNet for all clients (see Appendix F for model details).

Comparison Methods We compare DeSA with two sets of baseline federated learning methods: one considers heterogeneous models (Sec. 5.2) and the other considers homogeneous models (Sec. 5.3). For heterogeneous model experiments, we compare with FedHe (Chan & Ngai, 2021), FedDF (Lin et al., 2020), FCCL (Huang et al., 2022), and FedProto (Tan et al., 2022), and assume the clients owns personalized models222For the purposes of this comparison, we have excluded FedFTG (Zhang et al., 2022b) and DENSE (Zhang et al., 2022a), which address heterogeneities in different learning scenarios. FedFTG focuses on fine-tuning a global model, and DENSE belongs to one-shot FL, and both of them requires aggregate local information and train a generator on the server side. Note that none of the data-sharing-based baseline methods employ privacy-preserving techniques.. For homogeneous model experiments, we compare with FedAvg (McMahan et al., 2017), FedProx (Li et al., 2020b), MOON (Li et al., 2021b), Scaffold (Karimireddy et al., 2020), FedGen (Zhu et al., 2021), and VHL (Tang et al., 2022), and assume these baseline methods can leverage a server for global model aggregation.

FL Training Setup If not otherwise specified, we use SGD optimizer with a learning rate of 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, and our default setting for local model update epochs is 1, total update rounds is 100, and the batch size for local training is 32. Since we only have a few clients for DIGITS and OFFICE experiments, we will select all the clients for each iteration, while we randomly sample 10%percent1010\%10 % and 20%percent2020\%20 % clients for each round when performing CIFAR10C experiments. By default, λREGsubscript𝜆REG\lambda_{\rm REG}italic_λ start_POSTSUBSCRIPT roman_REG end_POSTSUBSCRIPT and λKDsubscript𝜆KD\lambda_{\rm KD}italic_λ start_POSTSUBSCRIPT roman_KD end_POSTSUBSCRIPT are set to 1.

5.2 Heterogeneous Model Experiments

The objective of the experiments is to show that DeSA can effectively leverage and learn generalized information from other clients under data and model heterogeneities. Thus, we report the averaged global accuracy by testing i𝑖iitalic_i-th local model on every client j𝑗jitalic_j’s (j,j[N]for-all𝑗𝑗delimited-[]𝑁\forall j,j\in[N]∀ italic_j , italic_j ∈ [ italic_N ]) test sets. Note FedDF and FCCL require accessing to public available data. To make a fair comparison, we use FMNIST (Xiao et al., 2017) and Cifar100 (Krizhevsky et al., 2009) as public datasets for knowledge distillation. DeSA(DVHLSynsuperscriptsubscript𝐷VHL𝑆𝑦𝑛D_{\rm VHL}^{Syn}italic_D start_POSTSUBSCRIPT roman_VHL end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT) uses our training pipeline, but the synthetic data is sampled from an untrained StyleGAN (Karras et al., 2019) as in VHL (Tang et al., 2022).

Based on our experiment results for heterogeneous models in Table. 2, it is evident that our method DeSA significantly enhances the overall accuracy on DIGITS and OFFICE. In the CIFAR10C experiments, we observe the performance of FL is notably hindered by a low client sampling ratio, especially in the case of serverless methods. As shown in the table, DeSA can improves the global accuracy and out-performs most of the baseline methods, except for FCCL with CIFAR100 as the public dataset. We believe this is due to three reasons: 1) CIFAR100 has significantly more data points than our synthetic data, 2) CIFAR100 and CIFAR10C have a genuine semantic overlap (Krizhevsky et al., 2009), and 3) FCCL starts from pre-trained models, which naturally biases the comparison in favor of methods using additional CIFAR100 information. Nonetheless, it is worth mentioning that the accuracies of FedDF and FCCL drop significantly when switching from Cifar100 to FMNIST. We found that the training with both methods are easy to obtain nan loss, which we mark as ‘-’ in the table. This suggests that these methods strongly rely on the public data, which restricts the utility of the methods. In contrast, DeSA does not depend on public data, and furthermore, it is completely serverless.

5.3 Homogeneous Model Experiments

Table 3: Homogeneous model experiments. We compare with model homogeneous FL methods and report the averaged local accuracy. The best accuracy is marked in bold.
DIGITS OFFICE CIFAR10C
0.1 0.2
FedAvg 94.20 76.45 65.26 66.40
FedProx 94.19 76.45 65.33 66.36
MOON 94.37 73.64 64.74 66.70
Scaffold 94.95 77.52 65.66 67.15
VHL 94.11 75.69 64.67 66.55
FedGen 82.62 63.60 45.77 48.10
DeSA 95.53 82.92 64.47 68.13

Among the baseline methods for homogeneous model experiments, they learn a generalizable global model via model aggregation on a server, which is not required in DeSA. As these baseline methods are only evaluated on their single global model, for a fair comparison, we report the averaged local accuracy in the experiments. Specifically, the average local accuracy for the baseline methods is the average performance over testing the global model over all the clients; while we report, we report the average local accuracy on testing i𝑖iitalic_i-th local model on client i𝑖iitalic_i’s test set over all the clients.

One can observe from Table 3 that DeSA can effectively leverage local information and outperforms other methods DIGITS and OFFICE. For CIFAR10C, although DeSA has highest averaged local accuracy when client ratio is 0.2, it has lower performance when client ratio is 0.1. This is because smaller client sampling ratios have a larger impact on decentralized learning as we do not have a global model, and thus some clients may suffer from low model performance due to insufficient training and the scarce global information from the sampled neighbor clients. Overall, despite the serverless setting, DeSA is compatible with the baseline methods that have central servers.

5.4 Ablation studies for DeSA

The effectiveness of DeSA relies on the novel designs of synthetic anchor data and the losses. To evaluate how these designs influences the performance of DeSA, we vary the number of synthetic anchor data (IPC) and the loss coefficients (λssuperscript𝜆𝑠\lambda^{\prime}sitalic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s) in the following paragraphs. If not otherwise specified, we use the default hyperparameters and model heterogeneous setting. We report the averaged global accuracy in this section.

Refer to caption
(a) λKDsubscript𝜆𝐾𝐷\lambda_{KD}italic_λ start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT
Refer to caption
(b) λREGsubscript𝜆𝑅𝐸𝐺\lambda_{REG}italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT
Figure 3: Ablation studies for λ𝜆\lambdaitalic_λ’s using OFFICE. We report the averaged global accuracy when changing the λ𝜆\lambdaitalic_λ values.
Table 4: Ablation study on the size of synthetic dataset using DIGITS. We use Images-Per-Class (IPC) to show its size.
IPC 5 10 20 50 100 200
Global Acc 70.12 72.74 72.46 74.32 70.29 70.45

Evaluation of λ𝜆\lambdaitalic_λ Selections λKDsubscript𝜆𝐾𝐷\lambda_{KD}italic_λ start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT and λREGsubscript𝜆𝑅𝐸𝐺\lambda_{REG}italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT play an important role to help the clients to learn generalized information as well as improving the performance on local data. We select OFFICE as the candidate dataset for this set of experiments because it has larger domain-shift. We vary both of λKDsubscript𝜆𝐾𝐷\lambda_{KD}italic_λ start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT and λREGsubscript𝜆𝑅𝐸𝐺\lambda_{REG}italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT between 0 to 2 and report the global accuracy in Figure 3(a). One can observe that when λKDsubscript𝜆𝐾𝐷\lambda_{KD}italic_λ start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT or λREGsubscript𝜆𝑅𝐸𝐺\lambda_{REG}italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT increases, the overall global accuracy increases. However, in Figure 3(b), when we increase λREGsubscript𝜆𝑅𝐸𝐺\lambda_{REG}italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT to 2, the performance drops. This happens because the magnitude of REG loss term dominates the total training loss. Overall, we conclude that both λKDsubscript𝜆𝐾𝐷\lambda_{KD}italic_λ start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT helps the local model learn information from other clients’ models, and λREGsubscript𝜆𝑅𝐸𝐺\lambda_{REG}italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT improves the global performance by enforcing the local model to learn generalized features.

Evaluation of Size of Synthetic Dataset The size of synthetic data is a critical hyperparameter for DeSA as it represents the shared local information. Since DeSA synthesizes class-balanced data, we use Images-Per-Class (IPC) to represent the size of the synthetic data. We select DIGITS as the candidate dataset for this set of experiments because it contains larger number of data for each client, which allows us to increase IPC up to 200. One can observe in Table 4 that blindly increasing the IPC does not guarantee to obtain optimal global accuracy. It will cause the loss function to be dominated by the last 2 terms of Eq. 1, i.e., by synthetic data. However, synthesizing larger number of synthetic data may degrade its quality, and the sampled batch for REGsubscript𝑅𝐸𝐺\mathcal{L}_{REG}caligraphic_L start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT may fail to capture the distribution.

5.5 Further Discussion

Communication Overhead Although DeSA requires training and transferring local synthetic data to every clients, the process happens before the FL training, and can be pre-processed offline. During DeSA training, clients only need to share logits w.r.t. global synthetic data, resulting in a lightweight in-training communication overhead. We discuss the communication of DeSA compared to standard FL in Appendix D, and show that DeSA has lower overall communication overhead.

Privacy DeSA requires sharing image-level information among clients in FL, which may raise privacy concerns. Therefore, we empirically show that our distilled synthetic data can protect against some privacy attacks in Appendix B.1. We further discuss DeSA’s potential for higher privacy guarantee using Differential Privacy (Abadi et al., 2016) in Appendix B.2. Furthermore, we claim that decentralized FL with both data and model heterogeneities is an extremely challenging setting, where existing solutions either require sharing real public data (Lin et al., 2020; Huang et al., 2022) or synthetic data generated from GAN-based generator (Zhang et al., 2022a, b).

Theory vs. Practice We note that obtaining the tight bound from our theoretical findings in Theorem 1 requires proper dataset distillation. In Section  3.2, we propose an efficient approximate solution for dataset distillation. Although perfect dataset distillation may not be achievable (as shown in the visualization in Appendix E), we have found through experimentation that using our synthetic data in combination with the proposed REGsubscript𝑅𝐸𝐺\mathcal{L}_{REG}caligraphic_L start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT (Eq. 4) and KDsubscript𝐾𝐷\mathcal{L}_{KD}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT (Eq. 3.4) can already lead to improved generalization.

6 Conclusion

A novel and effective method, DeSA, is presented that utilizes synthetic data to deal with both data and model heterogeneities in serverless decentralized FL. In particular, DeSA introduces a pipeline that involves synthetic data generalization, and we propose a new scheme that incorporates the synthetic data as anchor points in decentralized FL model training. To address heterogeneity issues, we utilize the synthetic anchor data and propose two regularization losses: anchor loss and knowledge distillation loss. We provide theoretical analysis on the generalization bound to justify the effectiveness of DeSA using the synthetic anchor data. Empirically, the resulted client models not only achieve compelling local performance but also can generalize well onto other clients’ data distributions, boosting cross-domain performance. Through extensive experiments on various classification tasks, we show that DeSA robustly improves the efficacy of collaborative learning when compared with state-of-the-art methods, under both model and data heterogeneous settings.

Acknowledgement

This work is supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), Public Safety Canada, CIFAR Catalyst Grant, and Compute Canada Research Platform.

Impact Statements

This paper represents an effort to progress the field of machine learning and distributed learning. Our work has various potential societal impacts, although we do not believe any specific one must be highlighted here.

References

  • Abadi et al. (2016) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp.  308–318, 2016.
  • Albuquerque et al. (2019) Albuquerque, I., Monteiro, J., Darvishi, M., Falk, T. H., and Mitliagkas, I. Generalizing to unseen domains via distribution matching. arXiv preprint arXiv:1911.00804, 2019.
  • Assran et al. (2019) Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, pp.  344–353. PMLR, 2019.
  • Beltrán et al. (2023) Beltrán, E. T. M., Pérez, M. Q., Sánchez, P. M. S., Bernal, S. L., Bovet, G., Pérez, M. G., Pérez, G. M., and Celdrán, A. H. Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Communications Surveys & Tutorials, 2023.
  • Ben-David et al. (2006) Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19, 2006.
  • Ben-David et al. (2010) Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79:151–175, 2010.
  • Carlini et al. (2022a) Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., and Tramer, F. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp.  1897–1914. IEEE, 2022a.
  • Carlini et al. (2022b) Carlini, N., Feldman, V., and Nasr, M. No free lunch in” privacy for free: How does dataset condensation help privacy”. arXiv preprint arXiv:2209.14987, 2022b.
  • Cazenavette et al. (2022) Cazenavette, G., Wang, T., Torralba, A., Efros, A. A., and Zhu, J.-Y. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4750–4759, 2022.
  • Chan & Ngai (2021) Chan, Y. H. and Ngai, E. C. Fedhe: Heterogeneous models and communication-efficient federated learning. In 2021 17th International Conference on Mobility, Sensing and Networking (MSN), pp.  207–214. IEEE, 2021.
  • Chang et al. (2018) Chang, K., Balachandar, N., Lam, C., Yi, D., Brown, J., Beers, A., Rosen, B., Rubin, D. L., and Kalpathy-Cramer, J. Distributed deep learning networks among institutions for medical imaging. Journal of the American Medical Informatics Association, 25(8):945–954, 2018.
  • Chuang & Mroueh (2021) Chuang, C.-Y. and Mroueh, Y. Fair mixup: Fairness via interpolation. arXiv preprint arXiv:2103.06503, 2021.
  • Crammer et al. (2008) Crammer, K., Kearns, M., and Wortman, J. Learning from multiple sources. Journal of Machine Learning Research, 9(8), 2008.
  • Cui et al. (2023) Cui, J., Wang, R., Si, S., and Hsieh, C.-J. Scaling up dataset distillation to imagenet-1k with constant memory. In International Conference on Machine Learning, pp.  6565–6590. PMLR, 2023.
  • Donahue & Kleinberg (2021) Donahue, K. and Kleinberg, J. Model-sharing games: Analyzing federated learning under voluntary participation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  5303–5311, 2021.
  • Dong et al. (2022) Dong, T., Zhao, B., and Lyu, L. Privacy for free: How does dataset condensation help privacy? arXiv preprint arXiv:2206.00240, 2022.
  • Fallah et al. (2020) Fallah, A., Mokhtari, A., and Ozdaglar, A. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. Advances in Neural Information Processing Systems, 33:3557–3568, 2020.
  • Feng et al. (2021) Feng, H., You, Z., Chen, M., Zhang, T., Zhu, M., Wu, F., Wu, C., and Chen, W. Kd3a: Unsupervised multi-source decentralized domain adaptation via knowledge distillation. In ICML, pp.  3274–3283, 2021.
  • Ganin & Lempitsky (2015) Ganin, Y. and Lempitsky, V. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pp.  1180–1189. PMLR, 2015.
  • Gao et al. (2022) Gao, D., Yao, X., and Yang, Q. A survey on heterogeneous federated learning. arXiv preprint arXiv:2210.04505, 2022.
  • Ghosh et al. (2022) Ghosh, A., Chung, J., Yin, D., and Ramchandran, K. An efficient framework for clustered federated learning. IEEE Transactions on Information Theory, 68(12):8076–8091, 2022.
  • Gong et al. (2022) Gong, X., Sharma, A., Karanam, S., Wu, Z., Chen, T., Doermann, D., and Innanje, A. Preserving privacy in federated learning with ensemble cross-domain knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  11891–11899, 2022.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • Griffin et al. (2007) Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. 2007.
  • Hendrycks & Dietterich (2019) Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  • Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Huang et al. (2022) Huang, W., Ye, M., and Du, B. Learn from others and be yourself in heterogeneous federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10143–10153, 2022.
  • Huang et al. (2021a) Huang, Y., Chu, L., Zhou, Z., Wang, L., Liu, J., Pei, J., and Zhang, Y. Personalized cross-silo federated learning on non-iid data. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  7865–7873, 2021a.
  • Huang et al. (2021b) Huang, Y., Gupta, S., Song, Z., Li, K., and Arora, S. Evaluating gradient inversion attacks and defenses in federated learning. Advances in Neural Information Processing Systems, 34:7232–7241, 2021b.
  • Hull (1994) Hull, J. J. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
  • Jeong et al. (2018) Jeong, E., Oh, S., Kim, H., Park, J., Bennis, M., and Kim, S. Federated distillation and augmentation under non-iid private data. NIPS Wksp. MLPCD, 2018.
  • Jiang et al. (2023) Jiang, M., Yang, H., Cheng, C., and Dou, Q. Iop-fl: Inside-outside personalization for federated medical image segmentation. IEEE Transactions on Medical Imaging, 2023.
  • Karimireddy et al. (2020) Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp.  5132–5143. PMLR, 2020.
  • Karras et al. (2019) Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4401–4410, 2019.
  • Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lee et al. (2022) Lee, S., Chun, S., Jung, S., Yun, S., and Yoon, S. Dataset condensation with contrastive signals. In International Conference on Machine Learning, pp.  12352–12364. PMLR, 2022.
  • Li et al. (2021a) Li, C., Li, G., and Varshney, P. K. Decentralized federated learning via mutual knowledge transfer. IEEE Internet of Things Journal, 9(2):1136–1147, 2021a.
  • Li & Wang (2019) Li, D. and Wang, J. Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
  • Li et al. (2022) Li, G., Togo, R., Ogawa, T., and Haseyama, M. Dataset distillation for medical dataset sharing. arXiv preprint arXiv:2209.14603, 2022.
  • Li et al. (2021b) Li, Q., He, B., and Song, D. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10713–10722, 2021b.
  • Li et al. (2020a) Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020a.
  • Li et al. (2020b) Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. International Conference on Learning Representations, 2020b.
  • Lin et al. (2020) Lin, T., Kong, L., Stich, S. U., and Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33:2351–2363, 2020.
  • Luo & Ye (2022) Luo, L. and Ye, H. Decentralized stochastic variance reduced extragradient method. arXiv preprint arXiv:2202.00509, 2022.
  • Marfoq et al. (2021) Marfoq, O., Neglia, G., Bellet, A., Kameni, L., and Vidal, R. Federated multi-task learning under a mixture of distributions. Advances in Neural Information Processing Systems, 34:15434–15447, 2021.
  • Matsuda et al. (2022) Matsuda, K., Sasaki, Y., Xiao, C., and Onizuka, M. Fedme: Federated learning via model exchange. In Proceedings of the 2022 SIAM international conference on data mining (SDM), pp.  459–467. SIAM, 2022.
  • McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.  1273–1282. PMLR, 2017.
  • Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. 2011.
  • Pappas et al. (2021) Pappas, C., Chatzopoulos, D., Lalis, S., and Vavalis, M. Ipls: A framework for decentralized federated learning. In 2021 IFIP Networking Conference (IFIP Networking), pp.  1–6. IEEE, 2021.
  • Rostami (2021) Rostami, M. Lifelong domain adaptation via consolidated internal distribution. Advances in neural information processing systems, 34:11172–11183, 2021.
  • Roy et al. (2019) Roy, A. G., Siddiqui, S., Pölsterl, S., Navab, N., and Wachinger, C. Braintorrent: A peer-to-peer environment for decentralized federated learning. arXiv preprint arXiv:1905.06731, 2019.
  • Saenko et al. (2010) Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adapting visual category models to new domains. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pp.  213–226. Springer, 2010.
  • Sheller et al. (2019) Sheller, M. J., Reina, G. A., Edwards, B., Martin, J., and Bakas, S. Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part I 4, pp.  92–104. Springer, 2019.
  • Sheller et al. (2020) Sheller, M. J., Edwards, B., Reina, G. A., Martin, J., Pati, S., Kotrotsou, A., Milchenko, M., Xu, W., Marcus, D., Colen, R. R., et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Scientific reports, 10(1):12598, 2020.
  • Shen et al. (2023) Shen, T., Zhang, J., Jia, X., Zhang, F., Lv, Z., Kuang, K., Wu, C., and Wu, F. Federated mutual learning: a collaborative machine learning method for heterogeneous data, models, and objectives. Frontiers of Information Technology & Electronic Engineering, 24(10):1390–1402, 2023.
  • Shokri et al. (2017) Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp.  3–18. IEEE, 2017.
  • Song et al. (2023) Song, R., Liu, D., Chen, D. Z., Festag, A., Trinitis, C., Schulz, M., and Knoll, A. Federated learning via decentralized dataset distillation in resource-constrained edge environments. In 2023 International Joint Conference on Neural Networks (IJCNN), pp.  1–10. IEEE, 2023.
  • Tan et al. (2022) Tan, Y., Long, G., Liu, L., Zhou, T., Lu, Q., Jiang, J., and Zhang, C. Fedproto: Federated prototype learning across heterogeneous clients. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  8432–8440, 2022.
  • Tang et al. (2022) Tang, Z., Zhang, Y., Shi, S., He, X., Han, B., and Chu, X. Virtual homogeneity learning: Defending against data heterogeneity in federated learning. arXiv preprint arXiv:2206.02465, 2022.
  • Wang et al. (2023) Wang, H.-P., Chen, D., Kerkouche, R., and Fritz, M. Fed-gloss-dp: Federated, global learning using synthetic sets with record level differential privacy. arXiv preprint arXiv:2302.01068, 2023.
  • Wang et al. (2020) Wang, J., Liu, Q., Liang, H., Joshi, G., and Poor, H. V. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020.
  • Wang et al. (2022) Wang, K., Zhao, B., Peng, X., Zhu, Z., Yang, S., Wang, S., Huang, G., Bilen, H., Wang, X., and You, Y. Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12196–12205, 2022.
  • Wu et al. (2022) Wu, C., Wu, F., Lyu, L., Huang, Y., and Xie, X. Communication-efficient federated learning via knowledge distillation. Nature communications, 13(1):2032, 2022.
  • Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • Xiong et al. (2023) Xiong, Y., Wang, R., Cheng, M., Yu, F., and Hsieh, C.-J. Feddm: Iterative distribution matching for communication-efficient federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16323–16332, 2023.
  • Yang et al. (2021) Yang, R., Tian, J., and Zhang, Y. Regularized mutual learning for personalized federated learning. In Asian Conference on Machine Learning, pp.  1521–1536. PMLR, 2021.
  • Ye et al. (2020) Ye, H., Zhou, Z., Luo, L., and Zhang, T. Decentralized accelerated proximal gradient descent. Advances in Neural Information Processing Systems, 33:18308–18317, 2020.
  • Ye et al. (2022) Ye, R., Ni, Z., Xu, C., Wang, J., Chen, S., and Eldar, Y. C. Fedfm: Anchor-based feature matching for data heterogeneity in federated learning. arXiv preprint arXiv:2210.07615, 2022.
  • Yuan et al. (2023a) Yuan, L., Ma, Y., Su, L., and Wang, Z. Peer-to-peer federated continual learning for naturalistic driving action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5249–5258, 2023a.
  • Yuan et al. (2023b) Yuan, L., Sun, L., Yu, P. S., and Wang, Z. Decentralized federated learning: A survey and perspective. arXiv preprint arXiv:2306.01603, 2023b.
  • Zhang et al. (2022a) Zhang, J., Chen, C., Li, B., Lyu, L., Wu, S., Ding, S., Shen, C., and Wu, C. Dense: Data-free one-shot federated learning. Advances in Neural Information Processing Systems, 35:21414–21428, 2022a.
  • Zhang et al. (2022b) Zhang, L., Shen, L., Ding, L., Tao, D., and Duan, L.-Y. Fine-tuning global model via data-free knowledge distillation for non-iid federated learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10174–10183, 2022b.
  • Zhang et al. (2018) Zhang, Y., Xiang, T., Hospedales, T. M., and Lu, H. Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4320–4328, 2018.
  • Zhao & Bilen (2021) Zhao, B. and Bilen, H. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pp.  12674–12685. PMLR, 2021.
  • Zhao & Bilen (2022) Zhao, B. and Bilen, H. Synthesizing informative training samples with gan. arXiv preprint arXiv:2204.07513, 2022.
  • Zhao & Bilen (2023) Zhao, B. and Bilen, H. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  6514–6523, 2023.
  • Zhao et al. (2020) Zhao, B., Mopuri, K. R., and Bilen, H. Dataset condensation with gradient matching. In International Conference on Learning Representations, 2020.
  • Zhou et al. (2022) Zhou, T., Zhang, J., and Tsang, D. Fedfa: Federated learning with feature anchors to align feature and classifier for heterogeneous data. arXiv preprint arXiv:2211.09299, 2022.
  • Zhu et al. (2021) Zhu, Z., Hong, J., and Zhou, J. Data-free knowledge distillation for heterogeneous federated learning. In International conference on machine learning, pp.  12878–12889. PMLR, 2021.

Road Map of Appendix Our appendix is organized into five sections. The theoretical analysis and proof is in Appendix A. Appendix B.1 shows the results for Membership Inference Attack (MIA) on DeSA trained models using DIGITS datasets. Appendix B.2 discusses how we inject DP mechanism in our data synthesis process, and shows that using DP synthetic anchor data for DeSA can still yeilds comparable performance. Appendix E introduce the selected datasets and how we synthesize anchor data in detail. Appendix F describes the model architectures (ConvNet and AlexNet) we use in our experiments. Finally, Appendix G provides a detailed literature review about the related works. Our code and model checkpoints are available along with the supplementary materials.

Appendix A Theoretical Analysis and Proofs

A.1 Notation

Table 5: Notations used
𝑫𝒊=(xi,yi)i=1i=msubscript𝑫𝒊superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑖𝑚\bm{D_{i}}=(x_{i},y_{i})_{i=1}^{i=m}bold_italic_D start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i = italic_m end_POSTSUPERSCRIPT \triangleq Local Dataset of client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Pi(x,y)subscript𝑃𝑖𝑥𝑦P_{i}(x,y)italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) \triangleq The local joint distribution of client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
LCE(a,b)subscript𝐿𝐶𝐸𝑎𝑏L_{CE}(a,b)italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_a , italic_b ) \triangleq Cross entropy Loss between distributions a𝑎aitalic_a and b𝑏bitalic_b
LKL(a,b)subscript𝐿𝐾𝐿𝑎𝑏L_{KL}(a,b)italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_a , italic_b ) \triangleq KL divergence loss between distributions(a,b𝑎𝑏a,bitalic_a , italic_b)
isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \triangleq Model Space of Client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \triangleq Space of Classifier heads for Client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
ΨisubscriptΨ𝑖\Psi_{i}roman_Ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \triangleq Space of encoder heads for Client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Mi=ρiψisubscript𝑀𝑖subscript𝜌𝑖subscript𝜓𝑖M_{i}=\rho_{i}\circ\psi_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \triangleq Model for client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with encoder and decoder heads sampled from 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ΨisubscriptΨ𝑖\Psi_{i}roman_Ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
𝜶=[α,αSyn,αKDSyn]𝜶𝛼superscript𝛼𝑆𝑦𝑛subscriptsuperscript𝛼𝑆𝑦𝑛𝐾𝐷\bm{\alpha}=[\alpha,\alpha^{Syn},\alpha^{Syn}_{KD}]bold_italic_α = [ italic_α , italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT ] \triangleq Component weights for the losses, as defined in Eq.  8
Notations from (Ben-David et al., 2010)
(P,fP)𝑃superscript𝑓𝑃(P,f^{P})( italic_P , italic_f start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) \triangleq P(x)𝑃𝑥P(x)italic_P ( italic_x ) is the source data distribution and fPsuperscript𝑓𝑃f^{P}italic_f start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT is the optimal labelling function
ϵP(M)subscriptitalic-ϵ𝑃𝑀\epsilon_{P}(M)italic_ϵ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_M ) \triangleq PrxP(x)(M(x)fP(x))subscriptPrsimilar-to𝑥𝑃𝑥𝑀𝑥subscript𝑓𝑃𝑥\Pr_{x\sim P(x)}(M(x)\neq f_{P}(x))roman_Pr start_POSTSUBSCRIPT italic_x ∼ italic_P ( italic_x ) end_POSTSUBSCRIPT ( italic_M ( italic_x ) ≠ italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_x ) )
dΔ(Ps,Pt)subscript𝑑Δsubscript𝑃𝑠subscript𝑃𝑡d_{\mathcal{H}\Delta\mathcal{H}}(P_{s},P_{t})italic_d start_POSTSUBSCRIPT caligraphic_H roman_Δ caligraphic_H end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleq 2suph,h|PrxPs(h(x)h(x))|PrxPt(h(x)h(x))|2\underset{h,h^{\prime}\in\mathcal{H}}{\sup}|\Pr_{x\sim P_{s}}(h(x)\neq h(x^{% \prime}))|-\Pr_{x\sim P_{t}}(h(x)\neq h(x^{\prime}))|2 start_UNDERACCENT italic_h , italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_H end_UNDERACCENT start_ARG roman_sup end_ARG | roman_Pr start_POSTSUBSCRIPT italic_x ∼ italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h ( italic_x ) ≠ italic_h ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) | - roman_Pr start_POSTSUBSCRIPT italic_x ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h ( italic_x ) ≠ italic_h ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) |
λ(P)𝜆𝑃\lambda(P)italic_λ ( italic_P ) \triangleq Least error of a jointly trained model =minMϵP(M)+ϵPT(M)subscript𝑀subscriptitalic-ϵ𝑃𝑀subscriptitalic-ϵsuperscript𝑃𝑇𝑀\min_{M\in\mathcal{M}}\epsilon_{P}(M)+\epsilon_{P^{T}}(M)roman_min start_POSTSUBSCRIPT italic_M ∈ caligraphic_M end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_M ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M )
𝐂(Pi,PT)𝐂subscript𝑃𝑖superscript𝑃𝑇\mathbf{C}(P_{i},P^{T})bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) \triangleq A distance term appearing in (Ben-David et al., 2010) =12(dΔ(Pi,PT)+λ(Pi)=\frac{1}{2}(d_{\mathcal{M}\Delta\mathcal{M}}(P_{i},P^{T})+\lambda(P_{i})= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_d start_POSTSUBSCRIPT caligraphic_M roman_Δ caligraphic_M end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_λ ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

A.2 Proof for Theorem 1

Proof.

The training data at i𝑖iitalic_ith client are from as three distributions: 1) the local source data; 2) the global virtual data; 3) the extended KD data. The data from the first two groups are used for the cross entropy loss and the distribution divergence, while the third is used for Knowledge distillation.

Without loss of generality, at i𝑖iitalic_ith client, we set the weight for Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, PSynsuperscript𝑃𝑆𝑦𝑛P^{Syn}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT and PKDSynsuperscriptsubscript𝑃𝐾𝐷𝑆𝑦𝑛{P}_{KD}^{Syn}italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT as α𝛼\alphaitalic_α, αSynsuperscript𝛼𝑆𝑦𝑛\alpha^{Syn}italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT and αKDSynsuperscriptsubscript𝛼𝐾𝐷𝑆𝑦𝑛{\alpha}_{KD}^{Syn}italic_α start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT, respectively. For notation simplicity, we assume α+αSyn+αKDSyn=1𝛼superscript𝛼𝑆𝑦𝑛superscriptsubscript𝛼𝐾𝐷𝑆𝑦𝑛1\alpha+\alpha^{Syn}+{\alpha}_{KD}^{Syn}=1italic_α + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT = 1. Then the training source data at i𝑖iitalic_ith client is PiS=αPi+αSynPSyn+αKDSynPKDSynsuperscriptsubscript𝑃𝑖𝑆𝛼subscript𝑃𝑖superscript𝛼𝑆𝑦𝑛superscript𝑃𝑆𝑦𝑛superscriptsubscript𝛼𝐾𝐷𝑆𝑦𝑛superscriptsubscript𝑃𝐾𝐷𝑆𝑦𝑛{P}_{i}^{S}=\alpha{P}_{i}+\alpha^{Syn}{P}^{Syn}+{\alpha}_{KD}^{Syn}{P}_{KD}^{Syn}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = italic_α italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT.

From Theorem 2 in (Ben-David et al., 2010), it holds that

ϵPT(Mi)ϵPi(Mi)+𝐂(Pi,PT)subscriptitalic-ϵsuperscript𝑃𝑇subscript𝑀𝑖subscriptitalic-ϵsubscript𝑃𝑖subscript𝑀𝑖𝐂subscript𝑃𝑖subscript𝑃𝑇\displaystyle\epsilon_{{P}^{T}}(M_{i})\leq\epsilon_{{P}_{i}}(M_{i})+\mathbf{C}% (P_{i},P_{T})italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) (11)

where 𝐂(Pi,PT)=12diΔi(Pi,PT)+λ(Pi)𝐂subscript𝑃𝑖subscript𝑃𝑇12subscript𝑑subscript𝑖Δsubscript𝑖subscript𝑃𝑖superscript𝑃𝑇𝜆subscript𝑃𝑖\mathbf{C}(P_{i},P_{T})=\frac{1}{2}d_{\mathcal{M}_{i}\Delta\mathcal{M}_{i}}({P% }_{i},{P}^{T})+\lambda({P}_{i})bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_λ ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and λ(Pi)=minMiϵPi(M)+ϵPT(M)𝜆subscript𝑃𝑖subscript𝑀subscript𝑖subscriptitalic-ϵsubscript𝑃𝑖𝑀subscriptitalic-ϵsuperscript𝑃𝑇𝑀\lambda(P_{i})=\min_{M\in\mathcal{M}_{i}}\epsilon_{{P}_{i}}(M)+\epsilon_{{P}^{% T}}(M)italic_λ ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) is a constant. These terms are small unless the data heterogeneity is severe (Ben-David et al., 2010). Then with (11) and Lemma 1, we have three inequalities, which we will add after multiplying each one of them with their corresponding component weight 𝜶𝜶\bm{\alpha}bold_italic_α.

Furthermore, note that the support of PSynsuperscript𝑃𝑆𝑦𝑛P^{Syn}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT and PKDSynsubscriptsuperscript𝑃𝑆𝑦𝑛𝐾𝐷P^{Syn}_{KD}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT are the same. (4.1). Therefore, their distances from the support of the global distribution will also be the same, i.e.

d𝒫iΔ𝒫i(ψPSyn,ψPT)=d𝒫iΔ𝒫i(ψPKDSyn,ψPT)subscript𝑑subscript𝒫𝑖Δsubscript𝒫𝑖𝜓superscript𝑃𝑆𝑦𝑛𝜓superscript𝑃𝑇subscript𝑑subscript𝒫𝑖Δsubscript𝒫𝑖𝜓subscriptsuperscript𝑃𝑆𝑦𝑛𝐾𝐷𝜓superscript𝑃𝑇d_{\mathcal{P}_{i}\Delta\mathcal{P}_{i}}(\psi\circ P^{Syn},\psi\circ P^{T})=d_% {\mathcal{P}_{i}\Delta\mathcal{P}_{i}}(\psi\circ P^{Syn}_{KD},\psi\circ P^{T})italic_d start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) = italic_d start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT , italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )

Continuing after adding all three inequalities and using Claim  1 to introduce ϵPiS(M)subscriptitalic-ϵsuperscriptsubscript𝑃𝑖𝑆𝑀\epsilon_{P_{i}^{S}}(M)italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M )

ϵPT(Mi)subscriptitalic-ϵsuperscript𝑃𝑇subscript𝑀𝑖absent\displaystyle\epsilon_{{P}^{T}}(M_{i})\leqitalic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ ϵPiS(Mi)+𝐂(Pi,PT)+αSyn2d𝒫Δ𝒫(ψPSyn,ψPT)subscriptitalic-ϵsuperscriptsubscript𝑃𝑖𝑆subscript𝑀𝑖𝐂subscript𝑃𝑖superscript𝑃𝑇superscript𝛼𝑆𝑦𝑛2subscript𝑑𝒫Δ𝒫𝜓superscript𝑃𝑆𝑦𝑛𝜓superscript𝑃𝑇\displaystyle\epsilon_{{P}_{i}^{S}}(M_{i})+\mathbf{C}(P_{i},P^{T})+\frac{% \alpha^{Syn}}{2}d_{\mathcal{P}\Delta\mathcal{P}}(\psi\circ P^{Syn},\psi\circ P% ^{T})italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + divide start_ARG italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_P roman_Δ caligraphic_P end_POSTSUBSCRIPT ( italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )
+ϵPT(fSyn))+αKDSyn2d𝒫Δ𝒫(ψPSyn,ψPT)+ϵPKDT(fSyn))\displaystyle+\epsilon_{{P}^{T}}(f^{Syn}))+\frac{\alpha^{Syn}_{KD}}{2}d_{% \mathcal{P}\Delta\mathcal{P}}(\psi\circ P^{Syn},\psi\circ P^{T})+\epsilon_{{P}% _{KD}^{T}}(f^{Syn}))+ italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) ) + divide start_ARG italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_P roman_Δ caligraphic_P end_POSTSUBSCRIPT ( italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) )
\displaystyle\leq ϵPiS(Mi)+α𝐂(Pi,PT)+αSynϵPT(fSyn)+αKDSynϵPKDT(fSyn)subscriptitalic-ϵsuperscriptsubscript𝑃𝑖𝑆subscript𝑀𝑖𝛼𝐂subscript𝑃𝑖superscript𝑃𝑇superscript𝛼𝑆𝑦𝑛subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛subscriptsuperscript𝛼𝑆𝑦𝑛𝐾𝐷subscriptitalic-ϵsuperscriptsubscript𝑃𝐾𝐷𝑇superscript𝑓𝑆𝑦𝑛\displaystyle\epsilon_{{P}_{i}^{S}}(M_{i})+\alpha\mathbf{C}(P_{i},P^{T})+% \alpha^{Syn}\epsilon_{{P}^{T}}(f^{Syn})+\alpha^{Syn}_{KD}\epsilon_{{P}_{KD}^{T% }}(f^{Syn})italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_α bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT )
+(1α)2d𝒫Δ𝒫(ψPSyn,ψPT)1𝛼2subscript𝑑𝒫Δ𝒫𝜓superscript𝑃𝑆𝑦𝑛𝜓superscript𝑃𝑇\displaystyle+\frac{(1-\alpha)}{2}d_{\mathcal{P}\Delta\mathcal{P}}(\psi\circ P% ^{Syn},\psi\circ P^{T})+ divide start_ARG ( 1 - italic_α ) end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_P roman_Δ caligraphic_P end_POSTSUBSCRIPT ( italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (12)

With the last condition coming from α+αSyn+αKDSyn=1𝛼superscript𝛼𝑆𝑦𝑛subscriptsuperscript𝛼𝑆𝑦𝑛𝐾𝐷1\alpha+\alpha^{Syn}+\alpha^{Syn}_{KD}=1italic_α + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT = 1

ϵPT(Mi)ϵPiS(Mi)+α𝐂(Pi,PT)+αSynϵPT(fSyn)+αKDSynϵPT(fKDSyn)+(1α)2d𝒫Δ𝒫(ψPSyn,ψPT)subscriptitalic-ϵsuperscript𝑃𝑇subscript𝑀𝑖subscriptitalic-ϵsuperscriptsubscript𝑃𝑖𝑆subscript𝑀𝑖𝛼𝐂subscript𝑃𝑖superscript𝑃𝑇superscript𝛼𝑆𝑦𝑛subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛subscriptsuperscript𝛼𝑆𝑦𝑛𝐾𝐷subscriptitalic-ϵsuperscript𝑃𝑇superscriptsubscript𝑓𝐾𝐷𝑆𝑦𝑛1𝛼2subscript𝑑𝒫Δ𝒫𝜓superscript𝑃𝑆𝑦𝑛𝜓superscript𝑃𝑇\displaystyle\epsilon_{{P}^{T}}(M_{i})\leq\epsilon_{{P}_{i}^{S}}(M_{i})+\alpha% \mathbf{C}(P_{i},P^{T})+\alpha^{Syn}\epsilon_{{P}^{T}}(f^{Syn})+\alpha^{Syn}_{% KD}\epsilon_{{P}^{T}}(f_{KD}^{Syn})+\frac{(1-\alpha)}{2}d_{\mathcal{P}\Delta% \mathcal{P}}(\psi\circ P^{Syn},\psi\circ P^{T})italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_α bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) + divide start_ARG ( 1 - italic_α ) end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_P roman_Δ caligraphic_P end_POSTSUBSCRIPT ( italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (13)

A.3 Interpretation for Theorem 1

From Eq. (1), it can be seen that the generalization bound for Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of five terms.

  • The first term ϵPiS(Mi)subscriptitalic-ϵsuperscriptsubscript𝑃𝑖𝑆subscript𝑀𝑖\epsilon_{{P}_{i}^{S}}(M_{i})italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the error bound with respect to the training source data distribution. With Claim 1 in appendix, minimizing this term is equivalent to optimizing the loss α𝔼(𝐱,y)PiCE+αSyn𝔼(𝐱,y)PSynCE+αKDSyn𝔼(𝐱,y)PSynKD𝛼subscript𝔼similar-to𝐱𝑦subscript𝑃𝑖subscriptCEsuperscript𝛼𝑆𝑦𝑛subscript𝔼similar-to𝐱𝑦superscript𝑃𝑆𝑦𝑛subscriptCEsubscriptsuperscript𝛼𝑆𝑦𝑛KDsubscript𝔼similar-to𝐱𝑦superscript𝑃𝑆𝑦𝑛subscriptKD\alpha\mathbb{E}_{(\mathbf{x},y)\sim P_{i}}\mathcal{L}_{\text{CE}}+\alpha^{Syn% }\mathbb{E}_{(\mathbf{x},y)\sim P^{Syn}}\mathcal{L}_{\text{CE}}+\alpha^{Syn}_{% \text{KD}}\mathbb{E}_{(\mathbf{x},y)\sim P^{Syn}}\mathcal{L}_{\text{KD}}italic_α blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT. Since this is the form of our loss function in Eq. 3.4, we expect this term to be minimized

  • The second term is inherited from the original generalization bound in (Ben-David et al., 2010) with the local training data. For our case, it can be controlled by the component weight α𝛼\alphaitalic_α. If we rely less on the local data (i.e. smaller α𝛼\alphaitalic_α), then these terms will be vanishing. Moreover even if we rely more on local data, it is essentially a distance measure between the local client distribution Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the global data distribution PTsuperscript𝑃𝑇P^{T}italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Since the global data distribution is an average of the closely related local client distributions, we expect this term to be small (Tang et al., 2022), (Ben-David et al., 2010), (Albuquerque et al., 2019).

  • The third term measures the discrepancy between real labeling and the synthetic data labeling mechanisms. This discrepancy will be low because of our synthetic data generation process. Note that the data distillation’s objective is to achieve 𝔼𝐱PT[l(MT(𝐱),y)]𝔼𝐱PSyn[l(MSyn(𝐱),y)]similar-to-or-equalssubscript𝔼similar-to𝐱superscript𝑃𝑇delimited-[]𝑙superscript𝑀𝑇𝐱𝑦subscript𝔼similar-to𝐱superscript𝑃𝑆𝑦𝑛delimited-[]𝑙superscript𝑀𝑆𝑦𝑛𝐱𝑦\mathbb{E}_{\mathbf{x}\sim P^{T}}[l(M^{T}(\mathbf{x}),y)]\simeq\mathbb{E}_{% \mathbf{x}\sim P^{Syn}}[l(M^{Syn}(\mathbf{x}),y)]blackboard_E start_POSTSUBSCRIPT bold_x ∼ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_l ( italic_M start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_x ) , italic_y ) ] ≃ blackboard_E start_POSTSUBSCRIPT bold_x ∼ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_l ( italic_M start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ( bold_x ) , italic_y ) ] (Eq. 1 in (Zhao & Bilen, 2023)). If we change the M𝑀Mitalic_M to a well-trained deep NN, then it’s easy to see the synthetic data labelling fsynsuperscript𝑓𝑠𝑦𝑛f^{syn}italic_f start_POSTSUPERSCRIPT italic_s italic_y italic_n end_POSTSUPERSCRIPT will be similar to the real labelling fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Here we leverage the distribution matching that uses MMD loss to minimize the embedding differences between the synthetic data and real data in the same class (Zhao & Bilen, 2023) as a proxy way to achieve that.

  • The fourth term originates from the knowledge distillation loss in equation 3.4. Here, we use the consensus knowledge from neighbour models to improve the local model. The labelling function of the extended KD data fKDsynsubscriptsuperscript𝑓𝑠𝑦𝑛𝐾𝐷f^{syn}_{KD}italic_f start_POSTSUPERSCRIPT italic_s italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT, changes as training continues and the neighbour models learn to generalize well. Towards the end of training, predictions from the consensus knowledge should match the predictions of the true labeling function, therefore, fKDsynsubscriptsuperscript𝑓𝑠𝑦𝑛𝐾𝐷f^{syn}_{KD}italic_f start_POSTSUPERSCRIPT italic_s italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT will be close to fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

  • The fifth term is a distribution divergence between the encoded distributions of PSynsuperscript𝑃𝑆𝑦𝑛P^{Syn}italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT and PTsuperscript𝑃𝑇P^{T}italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. This is minimized by the domain invariant regularizer in Eq. 4, which acts as an anchor to pull all the encoded distributions together.

Remark: In order to get a tight generalization guarantee, we only need one of the fourth or fifth terms to be small. Since, if either any one of them is small, we can adjust the component weights α𝛼\alphaitalic_α (practically λREGsubscript𝜆𝑅𝐸𝐺\lambda_{REG}italic_λ start_POSTSUBSCRIPT italic_R italic_E italic_G end_POSTSUBSCRIPT and λKDsubscript𝜆𝐾𝐷\lambda_{KD}italic_λ start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT) to get a better generalization guarantee.

A.4 Proof for Proposition 2

Proof.

Without loss of generality, let’s start with

supMi|ϵPSyn(M)ϵPT(M)|+limit-fromsubscriptsupremum𝑀subscript𝑖subscriptitalic-ϵsuperscript𝑃𝑆𝑦𝑛𝑀subscriptitalic-ϵsuperscript𝑃𝑇𝑀\displaystyle\sup_{M\in\mathcal{M}_{i}}|\epsilon_{P^{Syn}}(M)-\epsilon_{{P}^{T% }}(M)|+roman_sup start_POSTSUBSCRIPT italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) | + ϵPT(fSyn)subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛absent\displaystyle\epsilon_{{P}^{T}}(f^{Syn})\leqitalic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) ≤
infMi(ϵPi(M)ϵPT(M))+12diΔi(Pi,PT)+λ(Pi).𝐂(Pi,PT)\displaystyle\inf_{M\in\mathcal{M}_{i}}(\epsilon_{P_{i}}(M)-\epsilon_{{P}^{T}}% (M))+\underbrace{\frac{1}{2}d_{\mathcal{M}_{i}\Delta\mathcal{M}_{i}}({P}_{i},{% P}^{T})+\lambda({P}_{i}).}_{\mathbf{C}(P_{i},P^{T})}roman_inf start_POSTSUBSCRIPT italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) ) + under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_λ ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . end_ARG start_POSTSUBSCRIPT bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT (14)

Then it holds that for any Mi𝑀subscript𝑖M\in\mathcal{M}_{i}italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,

ϵPSyn(M)ϵPT(M)+ϵPT(fSyn)subscriptitalic-ϵsuperscript𝑃𝑆𝑦𝑛𝑀subscriptitalic-ϵsuperscript𝑃𝑇𝑀subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛\displaystyle\epsilon_{P^{Syn}}(M)-\epsilon_{{P}^{T}}(M)+\epsilon_{{P}^{T}}(f^% {Syn})italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) ϵPi(M)ϵPT(M)+𝐂(Pi,PT)absentsubscriptitalic-ϵsubscript𝑃𝑖𝑀subscriptitalic-ϵsuperscript𝑃𝑇𝑀𝐂subscript𝑃𝑖superscript𝑃𝑇\displaystyle\leq\epsilon_{P_{i}}(M)-\epsilon_{{P}^{T}}(M)+\mathbf{C}(P_{i},P^% {T})≤ italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) + bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )
ϵPSyn(M)+ϵPT(fSyn)absentsubscriptitalic-ϵsuperscript𝑃𝑆𝑦𝑛𝑀subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛\displaystyle\Rightarrow~{}~{}\epsilon_{P^{Syn}}(M)+\epsilon_{{P}^{T}}(f^{Syn})⇒ italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) ϵPi(M)+𝐂(Pi,PT)absentsubscriptitalic-ϵsubscript𝑃𝑖𝑀𝐂subscript𝑃𝑖superscript𝑃𝑇\displaystyle\leq\epsilon_{{P}_{i}}(M)+\mathbf{C}(P_{i},P^{T})≤ italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) + bold_C ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (15)

Note that the right side of (A.4) is the original bound in Theorem 2 in (Ben-David et al., 2010). Similarly, we can achieve

ϵPKDSyn(M)+ϵPT(fKDSyn)subscriptitalic-ϵsuperscriptsubscript𝑃𝐾𝐷𝑆𝑦𝑛𝑀subscriptitalic-ϵsuperscript𝑃𝑇superscriptsubscript𝑓𝐾𝐷𝑆𝑦𝑛\displaystyle\epsilon_{P_{KD}^{Syn}}(M)+\epsilon_{{P}^{T}}(f_{KD}^{Syn})italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) ϵPi(M)+12diΔi(Pi,PT)+λ(Pi)absentsubscriptitalic-ϵsubscript𝑃𝑖𝑀12subscript𝑑subscript𝑖Δsubscript𝑖subscript𝑃𝑖superscript𝑃𝑇𝜆subscript𝑃𝑖\displaystyle\leq\epsilon_{{P}_{i}}(M)+\frac{1}{2}d_{\mathcal{M}_{i}\Delta% \mathcal{M}_{i}}({P}_{i},{P}^{T})+\lambda({P}_{i})≤ italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_λ ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (16)

Combining (A.4-16) together with the component weights 𝜶𝜶\bm{\alpha}bold_italic_α and setting α0𝛼0\alpha\rightarrow 0italic_α → 0,

αKDSynϵPKDSyn(M)+αSynϵPSyn(M)+ϵPT(fKDSyn)subscriptsuperscript𝛼𝑆𝑦𝑛𝐾𝐷subscriptitalic-ϵsuperscriptsubscript𝑃𝐾𝐷𝑆𝑦𝑛𝑀superscript𝛼𝑆𝑦𝑛subscriptitalic-ϵsuperscript𝑃𝑆𝑦𝑛𝑀subscriptitalic-ϵsuperscript𝑃𝑇superscriptsubscript𝑓𝐾𝐷𝑆𝑦𝑛\displaystyle\alpha^{Syn}_{KD}\epsilon_{P_{KD}^{Syn}}(M)+\alpha^{Syn}\epsilon_% {P^{Syn}}(M)+\epsilon_{{P}^{T}}(f_{KD}^{Syn})italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) ϵPi(M)+12diΔi(Pi,PT)+λ(Pi)absentsubscriptitalic-ϵsubscript𝑃𝑖𝑀12subscript𝑑subscript𝑖Δsubscript𝑖subscript𝑃𝑖superscript𝑃𝑇𝜆subscript𝑃𝑖\displaystyle\leq\epsilon_{{P}_{i}}(M)+\frac{1}{2}d_{\mathcal{M}_{i}\Delta% \mathcal{M}_{i}}({P}_{i},{P}^{T})+\lambda({P}_{i})≤ italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_λ ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (17)

Therefore, we conclude that our global generalization bound in Theorem 1 (which is the LHS of Eq. 17) is tighter than the original bound in (Ben-David et al., 2010) (RHS), when the condition of Proposition 2 holds.

A.5 Some useful lemmas and claims

Lemma 1.

Denote the model as M=ρψ𝑀𝜌𝜓M=\rho\circ\psi\in\mathcal{M}italic_M = italic_ρ ∘ italic_ψ ∈ caligraphic_M. The global generalization bound holds as

ϵPT(M)ϵP(M)+12d𝒫Δ𝒫(ψP,ψPT)+ϵPT(f),subscriptitalic-ϵsuperscript𝑃𝑇𝑀subscriptitalic-ϵ𝑃𝑀12subscript𝑑𝒫Δ𝒫𝜓𝑃𝜓superscript𝑃𝑇subscriptitalic-ϵsuperscript𝑃𝑇𝑓\displaystyle\epsilon_{{P}^{T}}(M)\leq\epsilon_{{P}}(M)+\frac{1}{2}d_{\mathcal% {P}\Delta\mathcal{P}}(\psi\circ P,\psi\circ P^{T})+\epsilon_{{P}^{T}}(f),italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) ≤ italic_ϵ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_M ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_P roman_Δ caligraphic_P end_POSTSUBSCRIPT ( italic_ψ ∘ italic_P , italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ) , (18)

where (P,f)𝑃𝑓(P,f)( italic_P , italic_f ) could be either (PSyn,fSyn)superscript𝑃𝑆𝑦𝑛superscript𝑓𝑆𝑦𝑛(P^{Syn},f^{Syn})( italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) or (PKDSyn,fKDSyn)subscriptsuperscript𝑃𝑆𝑦𝑛𝐾𝐷subscriptsuperscript𝑓𝑆𝑦𝑛𝐾𝐷(P^{Syn}_{KD},f^{Syn}_{KD})( italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT ) pair.

Proof.

For any model M=ρψ𝑀𝜌𝜓M=\rho\circ\psi\in\mathcal{M}italic_M = italic_ρ ∘ italic_ψ ∈ caligraphic_M, we have the following bound for the global virtual data distribution:

ϵPT(M)ϵPSyn(M)subscriptitalic-ϵsuperscript𝑃𝑇𝑀subscriptitalic-ϵsuperscript𝑃𝑆𝑦𝑛𝑀\displaystyle\epsilon_{{P}^{T}}(M)-\epsilon_{{P}^{Syn}}(M)italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) =(a)ϵPT(M,fT)ϵPSyn(M,fSyn)superscript𝑎absentsubscriptitalic-ϵsuperscript𝑃𝑇𝑀superscript𝑓𝑇subscriptitalic-ϵsuperscript𝑃𝑆𝑦𝑛𝑀superscript𝑓𝑆𝑦𝑛\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\epsilon_{{P}^{T}}(M,f^{T})-% \epsilon_{{P}^{Syn}}(M,f^{Syn})start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M , italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT )
(b)|ϵPT(M,fSyn)+ϵPT(fSyn,fT)ϵPSyn(M,fSyn)|superscript𝑏absentsubscriptitalic-ϵsuperscript𝑃𝑇𝑀superscript𝑓𝑆𝑦𝑛subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛superscript𝑓𝑇subscriptitalic-ϵsuperscript𝑃𝑆𝑦𝑛𝑀superscript𝑓𝑆𝑦𝑛\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}|\epsilon_{{P}^{T}}(M,f^{Syn}% )+\epsilon_{{P}^{T}}(f^{Syn},f^{T})-\epsilon_{{P}^{Syn}}(M,f^{Syn})|start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_b ) end_ARG end_RELOP | italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) |
|ϵPT(M,fSyn)ϵPSyn(M,fSyn)|+ϵPT(fSyn)absentsubscriptitalic-ϵsuperscript𝑃𝑇𝑀superscript𝑓𝑆𝑦𝑛subscriptitalic-ϵsuperscript𝑃𝑆𝑦𝑛𝑀superscript𝑓𝑆𝑦𝑛subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛\displaystyle\leq|\epsilon_{{P}^{T}}(M,f^{Syn})-\epsilon_{{P}^{Syn}}(M,f^{Syn}% )|+\epsilon_{{P}^{T}}(f^{Syn})≤ | italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) | + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT )
=|ϵPT(ρψ,fSyn)ϵPSyn(ρψ,fSyn)|+ϵPT(fSyn)absentsubscriptitalic-ϵsuperscript𝑃𝑇𝜌𝜓superscript𝑓𝑆𝑦𝑛subscriptitalic-ϵsuperscript𝑃𝑆𝑦𝑛𝜌𝜓superscript𝑓𝑆𝑦𝑛subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛\displaystyle=|\epsilon_{{P}^{T}}(\rho\circ\psi,f^{Syn})-\epsilon_{{P}^{Syn}}(% \rho\circ\psi,f^{Syn})|+\epsilon_{{P}^{T}}(f^{Syn})= | italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ρ ∘ italic_ψ , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) - italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ρ ∘ italic_ψ , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) | + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT )
=|ϵψPT(ρ,fSynψ1)ϵψPSyn(ρ,fSynψ1)|+ϵPT(fSyn)absentsubscriptitalic-ϵ𝜓superscript𝑃𝑇𝜌superscript𝑓𝑆𝑦𝑛superscript𝜓1subscriptitalic-ϵ𝜓superscript𝑃𝑆𝑦𝑛𝜌superscript𝑓𝑆𝑦𝑛superscript𝜓1subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛\displaystyle=|\epsilon_{\psi\circ{P}^{T}}(\rho,f^{Syn}\circ\psi^{-1})-% \epsilon_{\psi\circ{P}^{Syn}}(\rho,f^{Syn}\circ\psi^{-1})|+\epsilon_{{P}^{T}}(% f^{Syn})= | italic_ϵ start_POSTSUBSCRIPT italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ρ , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ∘ italic_ψ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) - italic_ϵ start_POSTSUBSCRIPT italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ρ , italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ∘ italic_ψ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) | + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT )
supρ,ρ|ϵψPT(ρ,ρ)ϵψPSyn(ρ,ρ)|+ϵPT(fSyn)absentsubscriptsupremum𝜌superscript𝜌subscriptitalic-ϵ𝜓superscript𝑃𝑇𝜌superscript𝜌subscriptitalic-ϵ𝜓superscript𝑃𝑆𝑦𝑛𝜌superscript𝜌subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛\displaystyle\leq\sup_{\rho,\rho^{\prime}}|\epsilon_{\psi\circ{P}^{T}}(\rho,% \rho^{\prime})-\epsilon_{\psi\circ{P}^{Syn}}(\rho,\rho^{\prime})|+\epsilon_{{P% }^{T}}(f^{Syn})≤ roman_sup start_POSTSUBSCRIPT italic_ρ , italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_ϵ start_POSTSUBSCRIPT italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ρ , italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_ϵ start_POSTSUBSCRIPT italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ρ , italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT )
12d𝒫Δ𝒫(ψPSyn,ψPT)+ϵPT(fSyn)absent12subscript𝑑𝒫Δ𝒫𝜓superscript𝑃𝑆𝑦𝑛𝜓superscript𝑃𝑇subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛\displaystyle\leq\frac{1}{2}d_{\mathcal{P}\Delta\mathcal{P}}(\psi\circ P^{Syn}% ,\psi\circ P^{T})+\epsilon_{P^{T}}(f^{Syn})≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_P roman_Δ caligraphic_P end_POSTSUBSCRIPT ( italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) (19)

where (a) is by definitions and (b) relies on the triangle inequality for classification error (Ben-David et al., 2006; Crammer et al., 2008). Thus, we have that

ϵPT(M)ϵPSyn(M)+12d𝒫Δ𝒫(ψPSyn,ψPT)+ϵPT(fSyn).subscriptitalic-ϵsuperscript𝑃𝑇𝑀subscriptitalic-ϵsuperscript𝑃𝑆𝑦𝑛𝑀12subscript𝑑𝒫Δ𝒫𝜓superscript𝑃𝑆𝑦𝑛𝜓superscript𝑃𝑇subscriptitalic-ϵsuperscript𝑃𝑇superscript𝑓𝑆𝑦𝑛\displaystyle\epsilon_{{P}^{T}}(M)\leq\epsilon_{{P}^{Syn}}(M)+\frac{1}{2}d_{% \mathcal{P}\Delta\mathcal{P}}(\psi\circ P^{Syn},\psi\circ P^{T})+\epsilon_{{P}% ^{T}}(f^{Syn}).italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) ≤ italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_M ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT caligraphic_P roman_Δ caligraphic_P end_POSTSUBSCRIPT ( italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_ψ ∘ italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) . (20)

Similarly, as the the extended KD dataset shares the same feature distribution with the global virtual dataset, thus the above bound also holds for fKDSynsubscriptsuperscript𝑓𝑆𝑦𝑛𝐾𝐷f^{Syn}_{KD}italic_f start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT. ∎

Lemma 2 (Appendix A (Feng et al., 2021)).

For the extended source domain (𝐱Syn,y^Syn)P^Synsimilar-tosuperscript𝐱𝑆𝑦𝑛superscript^𝑦𝑆𝑦𝑛superscript^𝑃𝑆𝑦𝑛(\mathbf{x}^{Syn},\hat{y}^{Syn})\sim\hat{P}^{Syn}( bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) ∼ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT, training the related model with the knowledge distillation loss LKD=DKD(y^Synh(𝐱))subscript𝐿𝐾𝐷subscript𝐷𝐾𝐷conditionalsuperscript^𝑦𝑆𝑦𝑛𝐱L_{KD}=D_{KD}(\hat{y}^{Syn}\|h(\mathbf{x}))italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ∥ italic_h ( bold_x ) ) equals to optimizing the task risk ϵP^Syn=(𝐱Syn,y^Syn)P^Syn[h(𝐱)argmaxy^Syn]subscriptitalic-ϵsuperscript^𝑃𝑆𝑦𝑛subscriptsimilar-tosuperscript𝐱𝑆𝑦𝑛superscript^𝑦𝑆𝑦𝑛superscript^𝑃𝑆𝑦𝑛delimited-[]𝐱superscript^𝑦𝑆𝑦𝑛\epsilon_{\hat{P}^{Syn}}=\mathbb{P}_{(\mathbf{x}^{Syn},\hat{y}^{Syn})\sim\hat{% P}^{Syn}}[h(\mathbf{x})\neq\arg\max\hat{y}^{Syn}]italic_ϵ start_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = blackboard_P start_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ) ∼ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_h ( bold_x ) ≠ roman_arg roman_max over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ].

Claim 1.

With the training source data at i𝑖iitalic_ith client as PiSsuperscriptsubscript𝑃𝑖𝑆{P}_{i}^{S}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT with the component weight 𝛂=[α,αSyn,αKDSyn]𝛂superscript𝛼superscript𝛼𝑆𝑦𝑛superscriptsubscript𝛼𝐾𝐷𝑆𝑦𝑛top\bm{\alpha}=[\alpha,\alpha^{Syn},{\alpha}_{KD}^{Syn}]^{\top}bold_italic_α = [ italic_α , italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT on the local data, virtual data and extended KD data, ϵPiS(h)subscriptitalic-ϵsuperscriptsubscript𝑃𝑖𝑆\epsilon_{{P}_{i}^{S}}(h)italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_h ) is minimized by optimizing the loss:

minMα𝔼(𝐱,y)PiLCE(y,M(𝐱))+αSyn𝔼(𝐱,y)PSynLCE(y,M(𝐱))+αKDSyn𝔼(𝐱,y)PKDSynLKL(yM(𝐱))subscript𝑀𝛼subscript𝔼similar-to𝐱𝑦subscript𝑃𝑖subscript𝐿𝐶𝐸𝑦𝑀𝐱superscript𝛼𝑆𝑦𝑛subscript𝔼similar-to𝐱𝑦superscript𝑃𝑆𝑦𝑛subscript𝐿𝐶𝐸𝑦𝑀𝐱superscriptsubscript𝛼𝐾𝐷𝑆𝑦𝑛subscript𝔼similar-to𝐱𝑦superscriptsubscript𝑃𝐾𝐷𝑆𝑦𝑛subscript𝐿𝐾𝐿conditional𝑦𝑀𝐱\displaystyle\min_{M\in\mathcal{M}}\alpha\mathbb{E}_{(\mathbf{x},y)\sim{P}_{i}% }L_{CE}(y,M(\mathbf{x}))\!+\!\alpha^{Syn}\mathbb{E}_{(\mathbf{x},y)\sim{P}^{% Syn}}L_{CE}(y,M(\mathbf{x}))\!+\!{\alpha}_{KD}^{Syn}\mathbb{E}_{(\mathbf{x},y)% \sim{P}_{KD}^{Syn}}L_{KL}(y\|M(\mathbf{x}))roman_min start_POSTSUBSCRIPT italic_M ∈ caligraphic_M end_POSTSUBSCRIPT italic_α blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_y , italic_M ( bold_x ) ) + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_y , italic_M ( bold_x ) ) + italic_α start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_y ∥ italic_M ( bold_x ) ) (21)
Proof.

Note that

minM𝔼(𝐱,y)Pi(S)LKL(yM(𝐱))subscript𝑀subscript𝔼similar-to𝐱𝑦superscriptsubscript𝑃𝑖𝑆subscript𝐿𝐾𝐿conditional𝑦𝑀𝐱\displaystyle\min_{M\in\mathcal{M}}\mathbb{E}_{(\mathbf{x},y)\sim{P}_{i}^{(S)}% }L_{KL}(y\|M(\mathbf{x}))roman_min start_POSTSUBSCRIPT italic_M ∈ caligraphic_M end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_y ∥ italic_M ( bold_x ) )
minMα𝔼(𝐱,y)PiLKL(yM(𝐱))+αSyn𝔼(𝐱,y)PSynLKL(yM(𝐱))+αKDSyn𝔼(𝐱,y)PKDSynLKL(yM(𝐱))proportional-toabsentsubscript𝑀𝛼subscript𝔼similar-to𝐱𝑦subscript𝑃𝑖subscript𝐿𝐾𝐿conditional𝑦𝑀𝐱superscript𝛼𝑆𝑦𝑛subscript𝔼similar-to𝐱𝑦superscript𝑃𝑆𝑦𝑛subscript𝐿𝐾𝐿conditional𝑦𝑀𝐱superscriptsubscript𝛼𝐾𝐷𝑆𝑦𝑛subscript𝔼similar-to𝐱𝑦superscriptsubscript𝑃𝐾𝐷𝑆𝑦𝑛subscript𝐿𝐾𝐿conditional𝑦𝑀𝐱\displaystyle\propto\!\!\min_{M\in\mathcal{M}}\alpha\mathbb{E}_{(\mathbf{x},y)% \sim{P}_{i}}L_{KL}(y\|M(\mathbf{x}))\!+\!\alpha^{Syn}\mathbb{E}_{(\mathbf{x},y% )\sim{P}^{Syn}}L_{KL}(y\|M(\mathbf{x}))\!+\!{\alpha}_{KD}^{Syn}\mathbb{E}_{(% \mathbf{x},y)\sim{P}_{KD}^{Syn}}L_{KL}(y\|M(\mathbf{x}))∝ roman_min start_POSTSUBSCRIPT italic_M ∈ caligraphic_M end_POSTSUBSCRIPT italic_α blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_y ∥ italic_M ( bold_x ) ) + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_y ∥ italic_M ( bold_x ) ) + italic_α start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_y ∥ italic_M ( bold_x ) )
(a)minMα𝔼(𝐱,y)PiLCE(y,M(𝐱))+αSyn𝔼(𝐱,y)PSynLCE(y,M(𝐱))+αKDSyn𝔼(𝐱,y)PKDSynLKL(yM(𝐱))superscriptproportional-to𝑎absentsubscript𝑀𝛼subscript𝔼similar-to𝐱𝑦subscript𝑃𝑖subscript𝐿𝐶𝐸𝑦𝑀𝐱superscript𝛼𝑆𝑦𝑛subscript𝔼similar-to𝐱𝑦superscript𝑃𝑆𝑦𝑛subscript𝐿𝐶𝐸𝑦𝑀𝐱superscriptsubscript𝛼𝐾𝐷𝑆𝑦𝑛subscript𝔼similar-to𝐱𝑦superscriptsubscript𝑃𝐾𝐷𝑆𝑦𝑛subscript𝐿𝐾𝐿conditional𝑦𝑀𝐱\displaystyle\stackrel{{\scriptstyle(a)}}{{\propto}}\!\!\min_{M\in\mathcal{M}}% \alpha\mathbb{E}_{(\mathbf{x},y)\sim{P}_{i}}L_{CE}(y,M(\mathbf{x}))\!+\!\alpha% ^{Syn}\mathbb{E}_{(\mathbf{x},y)\sim{P}^{Syn}}L_{CE}(y,M(\mathbf{x}))\!+\!{% \alpha}_{KD}^{Syn}\mathbb{E}_{(\mathbf{x},y)\sim{P}_{KD}^{Syn}}L_{KL}(y\|M(% \mathbf{x}))start_RELOP SUPERSCRIPTOP start_ARG ∝ end_ARG start_ARG ( italic_a ) end_ARG end_RELOP roman_min start_POSTSUBSCRIPT italic_M ∈ caligraphic_M end_POSTSUBSCRIPT italic_α blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_y , italic_M ( bold_x ) ) + italic_α start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_y , italic_M ( bold_x ) ) + italic_α start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ italic_P start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_y ∥ italic_M ( bold_x ) )

where (a) is because LKL(yh(𝐱))=LCE(y,h(𝐱))H(y)subscript𝐿𝐾𝐿conditional𝑦𝐱subscript𝐿𝐶𝐸𝑦𝐱𝐻𝑦L_{{KL}}(y\|h(\mathbf{x}))=L_{CE}(y,h(\mathbf{x}))-H(y)italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_y ∥ italic_h ( bold_x ) ) = italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_y , italic_h ( bold_x ) ) - italic_H ( italic_y ), where H(y)=ylog(y)𝐻𝑦𝑦𝑦H(y)=-y\log(y)italic_H ( italic_y ) = - italic_y roman_log ( italic_y ) is a constant depending on data distribution. With Lemma 2 and Pinsker’s inequality, it is easy to show that ϵPiS(h)subscriptitalic-ϵsuperscriptsubscript𝑃𝑖𝑆\epsilon_{{P}_{i}^{S}}(h)italic_ϵ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_h ) is minimized by optimizing the above loss. ∎

Appendix B Privacy Discussion for DeSA

Sharing image-level information among clients may raise privacy concerns. However, we claim that decentralized FL with both data and model heterogeneities is an extremely challenging setting, where existing solutions either require sharing real public data (Lin et al., 2020; Huang et al., 2022) or synthetic data generated from real data with GAN-based generator (Zhang et al., 2022a, b). Instead, we propose to use distribution matching to distill data, a simple and less data-greedy strategy, for data synthesis. Research has shown that using distilled data can defend against privacy attacks (Dong et al., 2022) such as membership inference attacks (MIA) (Shokri et al., 2017) and gradient inversion attacks (Huang et al., 2021b). We show the DeSA’s defense against MIA (Carlini et al., 2022a) in Appendix B.1. In addition, recent papers have successfully applied differential privacy (DP) (Abadi et al., 2016) mechanism into data distillation (Xiong et al., 2023; Wang et al., 2023) to ensure privacy. We also discuss the feasibility of adding DP into data distillation process following (Xiong et al., 2023) and show that DeSA is still effective using the DP synthetic anchor data in Appendix B.2. We would like to put more emphasis on our proposed methodology and theoretical analysis in the main text, as sharing synthetic data commonly exists in the related work mentioned above and we fairly align with their settings in our comparisons. Thus, we consider the potential privacy risk of FL with DeSA is beyond the main scope of our study and leave it in our Appendix.

B.1 Membership Inference Attack

We show what under the basic setting of DeSA (i.e., not applying Differential Privacy when generating local synthetic data), we can better protect the membership information of local real data than local training or FedAvg (McMahan et al., 2017) on local real data only when facing Membership Inference Attack (MIA) on trained local models. Although we share the logits during communication, it’s important to note that these logits are from synthetic anchor data and not real data that needs protection. Therefore, we cannot use MIA methods that rely on logits. Instead, we perform a strong MIA attack recently proposed and evaluate it following the approach in (Carlini et al., 2022a).

Refer to caption
(a) SVHN
Refer to caption
(b) SynthDigits
Refer to caption
(c) MNIST-M
Figure 4: MIA on the models trained by SVHN, SynthDigits, and MNIST-M clients. Observe that the synthetic data sharing of DeSA does not reveal other clients’ local data identity information.

The goal of the experiment is to investigate whether our local model is vulnerable to MIA, namely leaking information about local real datasets’ membership. To compare and demonstrate the effectiveness of the chosen attack, we also present results from local training and FedAvg training. We conduct MIA experiments using DIGITS. The MIA for local training and FedAvg is related to real local training data. Since we use synthetic anchor data generated from other clients with data distillation, we also provide MIA results for inferring real data of other clients. For example, if attacking SVHN’s local model, local training and FeAvg report the MIA results on SVHN only, while we also report MIA results on MNIST, USPS, SynthDigits, MNIST-M for DeSA.

Using the metric in (Carlini et al., 2022a), the results are shown in Figure 4. The Ref(diagonal) line indicates MIA cannot tell the differences between training and testing data. If the line bends towards True Positive Rate, it means the membership form the training set can be inferred. It is shown that all the MIA curves of targeted and other cients lie along the Ref line for DeSA’s model, which indicates that the membership of each training sets is well protected given the applied attack. While the curves for the MIA attacks on FedAvg and local training with SVHN dataset are all offset the Ref (diagonal) line towards True Positive, indicating they are more vulnerable to MIA and leaking training data information.

B.2 Differential Privacy for Data Synthesis

To enhance the data privacy-preservation on shared synthetic anchor data, we apply the Differential Privacy stochastic gradient descent (DP-SGD) (Abadi et al., 2016) for the synthetic image generation. DP-SGD protects local data information via noise injection on clipped gradients. In our experiments, we apply Gaussian Mechanism for the inejcted noise. Specifically, we first sample a class-balanced subset from the raw data to train the objective 3.2. We set up the batch size as 256. For each iteration, we clip the gradient so that its l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm is 2. The injected noises are from 𝒩(0,1.2)𝒩01.2\mathcal{N}(0,1.2)caligraphic_N ( 0 , 1.2 ). This step ensures (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP with (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ ) values in {(7.622, 0.00015), (10.3605, 0.00021), (8.6677, 0.00017), (7.3174, 0.00014), (7.6221, 0.00015)} guarantees for {MNIST, SVHN, USPS, SynthDigits, MNIST-M}, respectively. We visualize the non-DP and DP synthetic images from each clients in DIGITS in Figure 5 and Figure 6, respectively. One can observe that the synthetic data with DP mechanism is noisy and hard to inspect the individual information of the raw data.

Refer to caption
(a) Mnist
Refer to caption
(b) SVHN
Refer to caption
(c) USPS
Refer to caption
(d) SynthDigits
Refer to caption
(e) Mnist-M
Figure 5: Visualization of the global and local synthetic images from the DIGITS dataset. (a) visualized the MNIST client; (b) visualized the SVHN client; (c) visualized the USPS client; (d) visualized the SynthDigits client; (e) visualized the MNIST-M client; (f) visualized the server synthetic data.
Refer to caption
(a) Mnist
Refer to caption
(b) SVHN
Refer to caption
(c) USPS
Refer to caption
(d) SynthDigits
Refer to caption
(e) Mnist-M
Figure 6: Visualization of the global and local synthetic images from the DIGITS dataset with DP mechanism. (a) visualized the MNIST client; (b) visualized the SVHN client; (c) visualized the USPS client; (d) visualized the SynthDigits client; (e) visualized the MNIST-M client; (f) visualized the server synthetic data.

We replace the synthetic data by DP synthetic data and perform DIGITS experiments, and the result is shown in Table 6. It can be observed that although DeSA’s performance slightly drops due to the DP mechanism, the averaged inter and intra-accuracy are in the second place, which indicates that DeSA is robust as long as we can synthesize images that roughly captures the global data distribution.

Table 6: We add the the results for DeSA trained with DP synthetic anchor data into our Table 2. The best result is marked as bold, and the second best is marked as blue. The table shows that DeSA with DP synthetic anchor data can still obtain comparable results as DeSA with non-DP synthetic data.
DIGITS
MN(C) SV(A) US(C) Syn(A) MM(C) Avg
FedHe 59.51 66.67 49.89 75.39 71.57 64.81
FedDF Cifar100 65.98 65.21 61.30 69.65 74.48 67.32
FMNIST 43.05 69.14 44.95 74.67 71.27 60.61
FCCL CIFAR100 - - - - - -
FMNIST 46.43 61.02 42.64 63.05 66.39 55.91
FedProto 62.59 71.74 58.52 81.19 74.44 69.70
DeSA(DVHLSynsuperscriptsubscript𝐷VHL𝑆𝑦𝑛D_{\rm VHL}^{Syn}italic_D start_POSTSUBSCRIPT roman_VHL end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_n end_POSTSUPERSCRIPT) 54.40 62.03 42.34 67.75 73.03 59.91
DeSA 70.12 76.17 71.17 81.10 73.83 74.47
DeSA(DP) 69.06 71.54 63.92 78.93 73.16 71.12

Appendix C Local Epoch

Here we present the effect of local epochs on DeSA. To ensure fair comparison, we fix the total training iterations for the three experiments, i.e.,we set FL communication rounds to 50 when local epochs is 2 to match up with local epoch equals to 1 when FL communication rounds is 100. Figure 7 shows that DeSA is robust to various local epoch selections. The experiment is run on DIGITS dataset, and we report the global accuracy.

Table 7: Ablation study on number of local epochs. The experiment is run on DIGITS dataset.
Local Epoch 1 2 5
Global Acc 74.47 74.15 74.34

Appendix D Communication Overhead

As noted in Section 3.1, DeSA only requires sharing logits w.r.t. Global synthetic data during training. Thus it has a relatively low communication overhead compared to baseline methods which require sharing model parameters. For fair comparison, we analyze the communication cost based on the number of parameters Pre-FL and During-FL in Table 8. Note that we show the number of parameters for one communication round for During-FL, and the total communication cost depends on the number of global iterations. One can observe that sharing logits can largely reduce the communication overhead. For example, if we use ConvNet as our model, set IPC=50, and train for 100 global iteration, the total number of parameters for communication for DeSA will be 30.7 K ×\times× 50 (Pre-FL) + 10 (number of classes) ×\times× 50 (images/class) ×\times× 10 (logits/image) ×\times× 100 (global iteration) = 2.04M. In comparison, baseline methods need to share 0 (Pre-FL) + 320K (parameters/iteration) ×\times× 100 (global iteration) = 32M, which is much larger than DeSA. Under model heterogeneity experimental setting, clients using AlexNet would suffer even higher total communication cost, which is 0 (Pre-FL) + 1.87M (parameters/iteration) ×\times× 100 (global iteration) = 187M.

Table 8: Comparison of communication overhead. Note that for DeSA, we only share virtual global anchor logits during training. The total communication cost counts the total parameter transferred for 100 global iterations. IPC is the synthesized images per class, and C is the number of classes.
ConvNet AlexNet Global Anchor Logits
Pre-FL 0 0 30.7 K ×\times× IPC ×\times× C
During-FL 320 K 1.87 M 100 ×\times× IPC ×\times× C
Total 32M 187M 40.7K ×\times× IPC ×\times× C

Appendix E Datasets and Synthetic Images

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 7: Visualization of the original digits dataset. (a) visualized the MNIST client; (b) visualized the SVHN client; (c) visualized the USPS client; (d) visualized the SynthDigits client; (e) visualized the MNIST-M client.
Refer to caption
(a) Amazon
Refer to caption
(b) Caltech
Refer to caption
(c) DSLR
Refer to caption
(d) Webcam
Figure 8: Visualization of the original digits dataset. (a) visualized the Amazon client; (b) visualized the Caltech client; (c) visualized the DSLR client; (d) visualized the Webcam client
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 9: Visualization of the original CIFAR10C. Sampled images from the first six clients.
Detailed Information of Selected Datasets

1) DIGITS={MNIST (LeCun et al., 1998), SVHN (Netzer et al., 2011), USPS (Hull, 1994), SynthDigits (Ganin & Lempitsky, 2015), MNIST-M (Ganin & Lempitsky, 2015)} consists of 5 digit datasets with handwritten, real street and synthetic digit images of 0,1,,90190,1,\cdots,90 , 1 , ⋯ , 9. Thus, we assume 5 clients for this set of experiments. We use DIGITS as baseline to show DeSA can handle FL under large domain shift. Example images can be found in Figure 7.
2) OFFICE={Amazon (Saenko et al., 2010), Caltech (Griffin et al., 2007), DSLR (Saenko et al., 2010), and WebCam (Saenko et al., 2010)} consists of four data sources from Office-31 (Saenko et al., 2010) (Amazon, DSLR, and WebCam) and Caltech-256 (Griffin et al., 2007) (Caltech), resulting in four clients. Each client possesses images that were taken using various camera devices in different real-world environments, each featuring diverse backgrounds. We show DeSA can handle FL under large domain shifted real-world images using OFFICE. Example images can be found in Figure 8.
3) CIFAR10C consists subsets extracted from Cifar10-C (Hendrycks & Dietterich, 2019), a collection of augmented Cifar10 (Krizhevsky et al., 2009) that applies 19 different corruptions. We employ a Dirichlet distribution with β=2𝛽2\beta=2italic_β = 2 for the purpose of generating three partitions within each distorted non-IID dataset. As a result, we have 57 clients with domain- and label-shifted datasets. Example images can be found in Figure 9.

Synthetic Data Generation

We fix ConvNet as the backbone for data synthesis to avoid additional domain shift caused by different model architectures. We set learning rate to 1 and use SGD optimizer with momentum = 0.5. The batch size for DIGITS and CIFAR10 is set to 256, while we use 32 for OFFICE as it’s clients has fewer data points. For the same reason, we use 500 synthetic data points for DIGITS and CIFAR10C, and we set 100 synthetic data points for OFFICE. The training iteration for DIGITS and OFFICE is 1000, and we set 2000 for CIFAR10C since it contains more complex images.

We show the local synthetic images and global anchor images of DIGITS, OFFICE, and CIFAR10C in Figure 10, Figure 11, and Figure 12, respectively.

Refer to caption
(a) Mnist
Refer to caption
(b) SVHN
Refer to caption
(c) USPS
Refer to caption
(d) SynthDigits
Refer to caption
(e) Mnist-M
Refer to caption
(f) Average
Figure 10: Visualization of the sampled global and local synthetic images from the DIGITS dataset. (a) visualized the MNIST client’s synthetic data; (b) visualized the SVHN client’s synthetic data; (c) visualized the USPS client’s synthetic data; (d) visualized the SynthDigits client’s synthetic data; (e) visualized the MNIST-M client’s synthetic data; (f) visualized the server synthetic data.
Refer to caption
(a) Amazon
Refer to caption
(b) Caltech
Refer to caption
(c) DSLR
Refer to caption
(d) Webcam
Refer to caption
(e) Average
Figure 11: Visualization of the sampled global and local synthetic images from the OFFICE dataset. (a) visualized the Amazon client’s synthetic data; (b) visualized the Caltech client’s synthetic data; (c) visualized the DSLR client’s synthetic data; (d) visualized the Webcam client’s synthetic data; (e) visualized the averaged synthetic data.
Refer to caption
(a) Client0
Refer to caption
(b) Client1
Refer to caption
(c) Client2
Refer to caption
(d) Client3
Refer to caption
(e) Client4
Refer to caption
(f) Average
Figure 12: Visualization of the sampled global and local synthetic images from the first 5 clients in CIFAR10C dataset. (a) visualized the first ’s synthetic data; (b) visualized the second client’s synthetic data; (c) visualized the third client’s synthetic data; (d) visualized the forth client’s synthetic data; (e) visualized the fifth client’s synthetic data; (f) visualized the server synthetic data.

Appendix F Model Architectures

We use ConvNet to perform data distillation for the best synthesis quality. For model hetero genenity scenarios, we randomly select classification model architectures from {AlexNet, ConvNet}. The detailed setup for bot models are depicted in Table 9 and Table 10

Table 9: AlexNet architecture. For the convolutional layer (Conv2D), we list parameters with a sequence of input and output dimensions, kernel size, stride, and padding. For the max pooling layer (MaxPool2D), we list kernel and stride. For a fully connected layer (FC), we list input and output dimensions.
Layer Details
1 Conv2D(3, 128, 5, 1, 4), ReLU, MaxPoo2D(2, 2)
2 Conv2D(128, 192, 5, 1, 2), ReLU, MaxPoo2D(2, 2)
3 Conv2D(192, 256, 3, 1, 1), ReLU
4 Conv2D(256, 192, 3, 1, 1), ReLU
5 Conv2D(192, 192, 3, 1, 1), ReLU, MaxPoo2D(2, 2)
22 FC(3072, num_class)
Table 10: ConvNet architecture. For the convolutional layer (Conv2D), we list parameters with a sequence of input and output dimensions, kernel size, stride, and padding. For the max pooling layer (MaxPool2D), we list kernel and stride. For a fully connected layer (FC), we list the input and output dimensions. For the GroupNormalization layer (GN), we list the channel dimension.
Layer Details
1 Conv2D(3, 128, 3, 1, 1), GN(128), ReLU, AvgPool2d(2,2,0)
2 Conv2D(128, 118, 3, 1, 1), GN(128), ReLU, AvgPool2d(2,2,0)
3 Conv2D(128, 128, 3, 1, 1), GN(128), ReLU, AvgPool2d(2,2,0)
4 FC(1152, num_class)

Appendix G More Related Work

G.1 Model Homogeneous Federated Learning

We list down different Model Homogeneous FL approaches in decentralized FL and collaborative methods that are relevant to our setting.

G.1.1 Decentralized Federated Learning

In order to tackle training a global model without a server, Decentralized FL methods communicate a set of models through diverse decentralized client-network topologies (such as a ring - (Chang et al., 2018), Mesh - (Roy et al., 2019), or a sequential line (Assran et al., 2019)) using different communication protocols such as Single-peer(gossip) or Multiple-Peer(Broadcast).  (Yuan et al., 2023a; Sheller et al., 2019, 2020) pass a single model from client to client similar to an Incremental Learning setup. In this continual setting, only a single model is trained. (Pappas et al., 2021; Roy et al., 2019; Assran et al., 2019) pass models and aggregate their weights similar to conventional FL. Since these models use averaged aggregation techniques similar to FedAvg, most of these methods assume client model homogeneity. DeSA’s client network topology is similar to that of a Mesh using the broadcast-gossip protocol, where every client samples certain neighbours in each communication round for sharing logits.

None of the works above aim to train various client model types without a server, which is our goal.

G.1.2 Collaborative Methods

(Fallah et al., 2020) uses an MAML(model agnostic meta learning) framework to explicitly train model homogeneous client models to personalize well. The objective function of MAML evaluates the personalized performance assuming a one-step gradient descent update on the subsequent task. (Huang et al., 2021a) modifies the personalized objective by adding an attention inducing term to the objective function which promotes collaboration between pairs of clients that have similar data.

(Ghosh et al., 2022) captures settings where different groups of users have their own objectives (learning tasks) but by aggregating their private data with others in the same cluster (same learning task), they can leverage the strength in numbers in order to perform more efficient personalized federated learning (Donahue & Kleinberg, 2021) uses game theory to analyze whether a client should jointly train with other clients in a conventional FL setup [2.1] assuming it’s primary objective is to minimize the MSE loss on its own private dataset. They also find techniques where it is more beneficial for the clients to create coalitions and train one global model.

All the above works either slightly change the intra-client objective to enable some collaboration between model-homogeneous clients or explicitly create client clusters to collaboratively learn from each other. They do not tackle the general objective function that we do- 3.1

G.2 Model Heterogeneous Federated Learning

Model heterogeneous FL approaches relevant to DeSA broadly come under the following two types.

G.2.1 Knowledge distillation methods

(Gong et al., 2022) proposes FedKD that is a one-shot centralized Knowledge distillation approach on unlabelled public data after the local training stage in-order to mitigate the accuracy drop due to the label shift amongst clients. DENSE (Zhang et al., 2022a) propose one-shot federated learning to generate decision boundary-aware synthetic data and train the global model on the server side. FedFTG (Zhang et al., 2022b) finetunes the global model by knowledge distillation with hard sample mining. (Yang et al., 2021) introduces a method called Personalized Federated Mutual Learning (PFML), which leverages the non-IID properties to create customized models for individual parties. PFML incorporates mutual learning into the local update process within each party, enhancing both the global model and personalized local models. Furthermore, mutual distillation is employed to expedite convergence. The method assumes homogeneity of models for global server aggregation. However, all the above methods are centralized.

G.2.2 Mutual Learning Methods

Papers in this area predominantly use ideas from deep-mutual learning (Zhang et al., 2018) (Matsuda et al., 2022) uses deep mutual learning to train heterogeneous local models for the sole purpose of personalization. The method creates clusters of clients whose local models have similar outputs. Clients within a cluster exchange their local models in-order to tackle label shift amongst the data points. However, the method is centralized and each client maintains two copies of models, one which is personalized and one that is exchanged. (Li et al., 2021a) has a similar setting to (Chan & Ngai, 2021), but instead solves the problem in a peer to peer decentralized manner using soft logit predictions on the local data of a client itself. It makes its own baselines that assume model homogeneity amongst clients, also their technique assumes that there is no covariate shift because it only uses local data for the soft predictions. However, their technique can be modified for model heterogeneity. They report personalization(Intra) accuracies only.

G.3 Dataset Distillation

Data distillation methods aim to create concise data summaries Dsynsubscript𝐷𝑠𝑦𝑛D_{syn}italic_D start_POSTSUBSCRIPT italic_s italic_y italic_n end_POSTSUBSCRIPT that can effectively substitute the original dataset D𝐷Ditalic_D in tasks such as model training, inference, and architecture search. Moreover, recent studies have justified that data distillation also preserves privacy (Dong et al., 2022; Carlini et al., 2022b) which is critical in federated learning. In practice, dataset distillation is used in healthcare for medical data sharing for privacy protection (Li et al., 2022). We briefly mention two types of Distillation works below.

G.3.1 Gradient and Trajectory Matching techniques

Gradient Matching (Zhao et al., 2020) is proposed to make the deep neural network produce similar gradients for both the terse synthetic images and the original large-scale dataset. The objective function involves matching the gradients of the loss w.r.t weights(parameters) evaluated on both D𝐷Ditalic_D and Dsynsubscript𝐷𝑠𝑦𝑛D_{syn}italic_D start_POSTSUBSCRIPT italic_s italic_y italic_n end_POSTSUBSCRIPT at successive parameter values during the optimization on the original dataset D𝐷Ditalic_D. Usually the cosine distance is used to measure the difference in gradient direction. Other works in this area modify the objective function slightly, by either adding class contrastive signals for better stability (Lee et al., 2022) or by adding same image-augmentations(such as crop, rotate to both D𝐷Ditalic_D and Dsynsubscript𝐷𝑠𝑦𝑛D_{syn}italic_D start_POSTSUBSCRIPT italic_s italic_y italic_n end_POSTSUBSCRIPT)(Zhao & Bilen, 2021). A similar technique is that of (Cazenavette et al., 2022) which tries to match the intermediate parameters in the optimization trajectory of both D𝐷Ditalic_D and DSynsubscript𝐷𝑆𝑦𝑛D_{Syn}italic_D start_POSTSUBSCRIPT italic_S italic_y italic_n end_POSTSUBSCRIPT. It is very computationally expensive because of a gradient unrolling in the optimization. TESLA (Cui et al., 2023) attempts at using linear-algebraic manipulations to give better computational guarantees for Trajectory matching

G.3.2 Distribution Matching techniques

Distribution matching (Zhao & Bilen, 2023) solves the distillation task via a single-level optimization, leading to a vastly improved scalability. More specifically, instead of matching the quality of models on D𝐷Ditalic_D vs. Dsynsubscript𝐷𝑠𝑦𝑛D_{syn}italic_D start_POSTSUBSCRIPT italic_s italic_y italic_n end_POSTSUBSCRIPT, distribution-matching techniques directly match the distribution of D𝐷Ditalic_D vs. Dsynsubscript𝐷𝑠𝑦𝑛D_{syn}italic_D start_POSTSUBSCRIPT italic_s italic_y italic_n end_POSTSUBSCRIPT in a latent encoded space. See 3.2 for the objective function. CAFE  (Wang et al., 2022) further refines the distribution-matching idea by solving a bilevel optimization problem for jointly optimizing a single encoder and the data summary, rather than using a pre-determined set of encoders Adversarial techniques using Distribution matching such as IT-GAN (Zhao & Bilen, 2022) and GAN (Goodfellow et al., 2014) aren’t suitable for a serverless setting. Since we aim to mitigate drifts in client-distribution across using our synthetic data, Distribution Matching is a more natural option for our work.