Federated Contrastive Learning for Personalized Semantic Communication

Yining Wang, Wanli Ni, Wenqiang Yi,  Xiaodong Xu,  ** Zhang,  and Arumugam Nallanathan The work presented in this paper is funded by the National Key R&D Program of China No. 2020YFB1806905, the National Natural Science Foundation of China No. 62201079, the Bei**g Natural Science Foundation No. L232051 and the Major Key Project of PCL Department of Broadband Communication. (Corresponding author: Xiaodong Xu.)Yining Wang is with the State Key Laboratory of Networking and Switching Technology, Bei**g University of Posts and Telecommunications, Bei**g, 100876, China (e-mail: [email protected]).Wanli Ni is with Department of Electronic Engineering, Tsinghua University, Bei**g, 100084, China (e-mail: [email protected]).Wenqiang Yi is with the School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, U.K. (e-mail: [email protected]).Xiaodong Xu and ** Zhang are with the State Key Laboratory of Networking and Switching Technology, Bei**g University of Posts and Telecommunications, Bei**g 100876, China, and also with Peng Cheng Laboratory, Shenzhen 518066, China (e-mail: [email protected]; [email protected]). Arumugam Nallanathan is with the School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, U.K. (e-mail: [email protected]).
Abstract

In this letter, we design a federated contrastive learning (FedCL) framework aimed at supporting personalized semantic communication. Our FedCL enables collaborative training of local semantic encoders across multiple clients and a global semantic decoder owned by the base station. This framework supports heterogeneous semantic encoders since it does not require client-side model aggregation. Furthermore, to tackle the semantic imbalance issue arising from heterogeneous datasets across distributed clients, we employ contrastive learning to train a semantic centroid generator (SCG). This generator obtains representative global semantic centroids that exhibit intra-semantic compactness and inter-semantic separability. Consequently, it provides superior supervision for learning discriminative local semantic features. Additionally, we conduct theoretical analysis to quantify the convergence performance of FedCL. Simulation results verify the superiority of the proposed FedCL framework compared to other distributed learning benchmarks in terms of task performance and robustness under different numbers of clients and channel conditions, especially in low signal-to-noise ratio and highly heterogeneous data scenarios.

Index Terms:
Federated semantic learning, contrastive learning, task-oriented communications, data heterogeneity.

I Introduction

TASK-oriented semantic communication (SemCom) systems mainly employ sophisticated deep neural network (DNN) models or optimize wireless resource allocation to balance communication efficiency with target performance. However, few of them addressed the training approach of DNN-based semantic models, while the effectiveness of task-oriented SemCom relies heavily on semantic models deployed on each transceiver, which requires continuous update along with the changing channel environment and datasets [1].

Since semantic model learning requires a huge quantity of training samples from dispersed users, most existing works exploited federated learning (FL) approaches. [2] proposed a FL-based semantic learning system with dynamic model aggregation. However, it relied on local training and central server aggregation, leading to high parameter transmission and underutilization of the server’s computing power. In [3], a FL framework for semantic reconstruction reduced communication costs through partial client model aggregation, but still conducted training locally, neglecting server’s potential. Wei et al. in [4] introduced a client-server collaborative FL framework for knowledge graph generation, yet still requiring model uploading and client-side transmitter aggregation.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: (a) Architecture of the proposed FedCL for multi-user semantic learning; (b) Workflow of the proposed FedCL framework.

The previous studies assumed uniformity in user semantic models, limiting their use to homogeneous settings. However, in practice, client-side semantic transmitters should accommodate personalized encoders, which can adapt to diverse data distributions and varying model structures due to local devices’ different computation and storage capabilities. Moreover, existing research ignored the non-independent and identically distribution (non-IID) data among users. This inconsistency in feature spaces across clients degrades the performance of traditional FL [5, 6]. By grou** intra-class samples as positives and distinguishing inter-class samples as negatives, contrastive learning [7] fosters the learning of discriminative features that aid in identifying semantics, even in scenarios with unbalanced data distributions [8]. Applying this principle, contrastive loss can guide the training of semantic models by generalizing knowledge from similar samples while minimizing interference from semantically inconsistent ones [9].

In this work, we propose a federated contrastive learning (FedCL) framework for task-oriented SemCom, where personalized semantic encoders and a global semantic decoder are trained collaboratively between the clients and the base station (BS). The main contributions of this work are summarized as follows:

  • \bullet

    We design a novel FedCL framework for collaborative training of personalized semantic encoders on multiple clients and a global semantic decoder on the BS. Instead of exchanging model parameters or raw data, our approach exchanges features and back-propagation gradients, which not only preserves user privacy but also eliminates client-side model aggregation.

  • \bullet

    To overcome performance degradation from inconsistent semantic distributions in heterogeneous multi-user datasets, we introduce a semantic centroid generator (SCG) at the server. This network leverages contrastive learning to generate global semantic centroids, which are updated in each round to provide a unified semantic space for supervised local semantic feature learning. This approach transforms noisy features from heterogeneous data distributions into regularized features with intra-semantic compactness and inter-semantic separability, thereby enhancing robustness against channel noise.

  • \bullet

    We theoretically analyze the convergence performance of FedCL under the non-convex loss function setting, which provides a convergence guarantee to the proposed framework. Simulation results demonstrate that FedCL surpasses benchmark approaches in task performance, particularly in scenarios with low signal-to-noise ratio (SNR) and significant data heterogeneity.

II System Model

We consider a wireless network comprising one BS with an edge server and a set of devices 𝒦={1,2,,K}𝒦12𝐾\mathcal{K}=\{1,2,...,K\}caligraphic_K = { 1 , 2 , … , italic_K }. The clients and the BS learn collaboratively to obtain personalized semantic encoders for feature extraction as well as channel encoding on each client, and a global semantic decoder for channel decoding as well as performing downstream tasks among semantic concepts 𝒞={1,2,,C}𝒞12𝐶\mathcal{C}=\{1,2,...,C\}caligraphic_C = { 1 , 2 , … , italic_C }. In this section, we propose a FedCL framework for personalized semantic communication, which facilitates the training of heterogeneous semantic models among distributed clients.

II-A FedCL Framework

As depicted in Fig. 1LABEL:sub@model, a semantic encoder is deployed on client k𝑘kitalic_k to facilitate feature extraction from raw data while considering the impact of wireless channel. The local dataset 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT data samples owned by client k𝑘kitalic_k is denoted as 𝒟k={(𝒙k,i,yk,i)|i=1,2,,Dk}subscript𝒟𝑘conditional-setsubscript𝒙𝑘𝑖subscript𝑦𝑘𝑖𝑖12subscript𝐷𝑘\mathcal{D}_{k}=\{(\bm{x}_{k,i},y_{k,i})|i=1,2,...,D_{k}\}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ) | italic_i = 1 , 2 , … , italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, where 𝒙k,isubscript𝒙𝑘𝑖\bm{x}_{k,i}bold_italic_x start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT represents the source input and yk,isubscript𝑦𝑘𝑖y_{k,i}italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT is the corresponding label indicating the semantic concept of 𝒙k,isubscript𝒙𝑘𝑖\bm{x}_{k,i}bold_italic_x start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT. Thus, the entire dataset 𝒟𝒟\mathcal{D}caligraphic_D of all clients is denoted by 𝒟=k=1K𝒟k𝒟superscriptsubscript𝑘1𝐾subscript𝒟𝑘\mathcal{D}=\cup_{k=1}^{K}\mathcal{D}_{k}caligraphic_D = ∪ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with D=k=1KDk𝐷superscriptsubscript𝑘1𝐾subscript𝐷𝑘D=\sum_{k=1}^{K}{D_{k}}italic_D = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT data samples. Note that the subscript i𝑖iitalic_i is omitted when sample-wise formulation is not required. We consider a training process for FedCL with T𝑇Titalic_T communication rounds, as shown in Fig. 1LABEL:sub@FedContrast. Specifically, in the t𝑡titalic_t-th round, the FedCL process consists of the following stages.

First, each client k𝑘kitalic_k performs the forward propagation in parallel on its personalized semantic encoder and extracts feature 𝒇ksubscript𝒇𝑘\bm{f}_{k}bold_italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from input data 𝒙ksubscript𝒙𝑘\bm{x}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which is denoted as

𝒇k=α(𝒙k;𝜽k(t)),k𝒦,formulae-sequencesubscript𝒇𝑘𝛼subscript𝒙𝑘superscriptsubscript𝜽𝑘𝑡for-all𝑘𝒦\displaystyle\bm{f}_{k}=\alpha(\bm{x}_{k};\bm{\theta}_{k}^{(t)}),\forall k\in% \mathcal{K},bold_italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_α ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , ∀ italic_k ∈ caligraphic_K , (1)

where 𝜽k(t)superscriptsubscript𝜽𝑘𝑡\bm{\theta}_{k}^{(t)}bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the semantic encoder parameter set of client k𝑘kitalic_k in the t𝑡titalic_t-th round. After the client-side forward propagation is completed, the encoded feature 𝒇ksubscript𝒇𝑘\bm{f}_{k}bold_italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is reshaped into semantic symbols 𝒔ksubscript𝒔𝑘\bm{s}_{k}bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and transmitted to the BS over the wireless channel along with the label 𝒚ksubscript𝒚𝑘\bm{y}_{k}bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. We assume the proposed learning system is operated with orthogonal frequency division multiplexing (OFDM), where the channel is divided into orthogonal subcarriers according to the number of participating users. Thus, it ensures the access of multiple local devices without interference. Then, the feature received at the BS can be expressed as

𝒔^k=hk𝒔k+𝒏k,k𝒦,formulae-sequencesubscript^𝒔𝑘subscript𝑘subscript𝒔𝑘subscript𝒏𝑘for-all𝑘𝒦\displaystyle\hat{\bm{s}}_{k}=h_{k}\bm{s}_{k}+\bm{n}_{k},\forall k\in\mathcal{% K},over^ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∀ italic_k ∈ caligraphic_K , (2)

where hksubscript𝑘h_{k}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the channel coefficient and 𝒏ksubscript𝒏𝑘\bm{n}_{k}bold_italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the IID channel noise vector, which follows symmetric complex Gaussian distribution 𝒞𝒩(0,δ2𝑰)𝒞𝒩0superscript𝛿2𝑰\mathcal{CN}(0,\delta^{2}\bm{I})caligraphic_C caligraphic_N ( 0 , italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ) with zero mean and variance δ2superscript𝛿2\delta^{2}italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [10]. Note that since the data volume of the label 𝒚ksubscript𝒚𝑘\bm{y}_{k}bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is small, it can be transmitted accurately to the server.

Subsequently, the noised semantic symbols received from all participating clients are reshaped into noised semantic feature 𝒇^ksubscript^𝒇𝑘\hat{\bm{f}}_{k}over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which is utilized as the input for forward propagation of the semantic decoder hosted on the BS. In parallel, the BS trains a separate instance β(ϕk)𝛽subscriptbold-italic-ϕ𝑘\beta(\bm{\phi}_{k})italic_β ( bold_italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) of the semantic decoder for each user, which outputs the task results as

rk=β(𝒇^k;ϕk(t)),k𝒦,formulae-sequencesubscript𝑟𝑘𝛽subscript^𝒇𝑘superscriptsubscriptbold-italic-ϕ𝑘𝑡for-all𝑘𝒦\displaystyle r_{k}=\beta(\hat{\bm{f}}_{k};\bm{\phi}_{k}^{(t)}),\forall k\in% \mathcal{K},italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_β ( over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; bold_italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , ∀ italic_k ∈ caligraphic_K , (3)

where ϕk(t)superscriptsubscriptbold-italic-ϕ𝑘𝑡\bm{\phi}_{k}^{(t)}bold_italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the parameter set of client k𝑘kitalic_k’s semantic decoder in the t𝑡titalic_t-th round. The parallel training method is adopted at the server side, where the semantic decoder parameters of each client can be iteratively updated using stochastic gradient descent (SGD). In the t𝑡titalic_t-th communication round, the updated parameters of the k𝑘kitalic_k-th semantic decoder can be obtained as

ϕk(t+1)=ϕk(t)η𝒈k(t),k𝒦,formulae-sequencesuperscriptsubscriptbold-italic-ϕ𝑘𝑡1superscriptsubscriptbold-italic-ϕ𝑘𝑡𝜂superscriptsubscript𝒈𝑘𝑡for-all𝑘𝒦\displaystyle\bm{\phi}_{k}^{(t+1)}=\bm{\phi}_{k}^{(t)}-\eta\bm{g}_{k}^{(t)},% \forall k\in\mathcal{K},bold_italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_η bold_italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∀ italic_k ∈ caligraphic_K , (4)

where η𝜂\etaitalic_η is the learning rate of the semantic decoder network and 𝒈k(t)superscriptsubscript𝒈𝑘𝑡\bm{g}_{k}^{(t)}bold_italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the obtained gradient in the t𝑡titalic_t-th communication round. At the end of each round, model aggregation is performed at the BS to obtain an updated global semantic decoder β(ϕ)𝛽bold-italic-ϕ\beta(\bm{\phi})italic_β ( bold_italic_ϕ ). The decoder parameters in t+1𝑡1t+1italic_t + 1 are aggregated as,

ϕ(t+1)=k=1KDkDϕk(t+1),superscriptbold-italic-ϕ𝑡1superscriptsubscript𝑘1𝐾subscript𝐷𝑘𝐷superscriptsubscriptbold-italic-ϕ𝑘𝑡1\displaystyle\bm{\phi}^{(t+1)}=\sum_{k=1}^{K}\frac{D_{k}}{D}\bm{\phi}_{k}^{(t+% 1)},bold_italic_ϕ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_D end_ARG bold_italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , (5)

When the back propagation process reaches the first layer of the semantic decoder, the BS sends the updated gradients 𝒈k(t)superscriptsubscript𝒈𝑘𝑡\bm{g}_{k}^{(t)}bold_italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to all participating clients over the wireless channel to guide the back propagation of the personalized semantic encoders. At the client’s side, after receiving the noised gradients 𝒈ˇk(t)superscriptsubscriptˇ𝒈𝑘𝑡\check{\bm{g}}_{k}^{(t)}overroman_ˇ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT which is corrupted by the downlink transmissions between the BS and the devices, each client k𝑘kitalic_k performs back propagation on its local model and updates the parameters using the gradient descent method, i.e.,

𝜽k(t+1)=𝜽k(t)ηk𝒈ˇk(t),k𝒦,formulae-sequencesuperscriptsubscript𝜽𝑘𝑡1superscriptsubscript𝜽𝑘𝑡subscript𝜂𝑘superscriptsubscriptˇ𝒈𝑘𝑡for-all𝑘𝒦\displaystyle\bm{\theta}_{k}^{(t+1)}=\bm{\theta}_{k}^{(t)}-\eta_{k}\check{\bm{% g}}_{k}^{(t)},\forall k\in\mathcal{K},bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∀ italic_k ∈ caligraphic_K , (6)

where ηksubscript𝜂𝑘\eta_{k}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the learning rate of client k𝑘kitalic_k. The local learning rate can vary among different clients with heterogeneous computing power and latency requirements. After completing the back propagation process on the client side, the updated semantic encoder of round t+1𝑡1t+1italic_t + 1 is obtained as α(𝜽k(t+1))𝛼superscriptsubscript𝜽𝑘𝑡1\alpha(\bm{\theta}_{k}^{(t+1)})italic_α ( bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ).

II-B Contrastive Learning-Based SCG

In real-world scenarios, clients deployed across various locations encounter diverse environments, resulting in semantic heterogeneity among local datasets. This discrepancy, known as non-IID distribution, arises as each client’s semantic distribution is inconsistent with the server. To address this statistical heterogeneity issue, we employ a contrastive learning method to align the inconsistent semantic space across clients into unified global semantic distribution by training a semantic centroid generator (SCG).

Each client has its local semantic centroid 𝒇¯kcsuperscriptsubscript¯𝒇𝑘𝑐\overline{\bm{f}}_{k}^{c}over¯ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT for each semantic concept c𝑐citalic_c, which is aggregated on the server as

𝒇¯kc=1Dk,ci𝒟k,c𝒇^k,i,k𝒦,c𝒞,formulae-sequencesuperscriptsubscript¯𝒇𝑘𝑐1subscript𝐷𝑘𝑐subscript𝑖subscript𝒟𝑘𝑐subscript^𝒇𝑘𝑖formulae-sequencefor-all𝑘𝒦𝑐𝒞\displaystyle\overline{\bm{f}}_{k}^{c}=\frac{1}{D_{k,c}}\sum_{i\in\mathcal{D}_% {k,c}}\hat{\bm{f}}_{k,i},\forall k\in\mathcal{K},c\in\mathcal{C},over¯ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_D start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT , ∀ italic_k ∈ caligraphic_K , italic_c ∈ caligraphic_C , (7)

where 𝒟k,csubscript𝒟𝑘𝑐\mathcal{D}_{k,c}caligraphic_D start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT denotes the set of samples belong to c𝑐citalic_c-th category on client k𝑘kitalic_k with Dk,csubscript𝐷𝑘𝑐D_{k,c}italic_D start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT data samples.

However, due to the statistical and model heterogeneity of personalized semantic encoders, the aggregated semantic centroids of different clients are much diverse even if they are with the same semantic concept. Therefore, dislike other FL frameworks with centroid regularization that achieve the global centroids by simply aggregating the local centroids [11, 12], we design the SCG to generate trainable global semantic centroids 𝑭¯={𝑭¯c}c=1C¯𝑭superscriptsubscriptsuperscript¯𝑭𝑐𝑐1𝐶\overline{\bm{F}}=\{\overline{\bm{F}}^{c}\}_{c=1}^{C}over¯ start_ARG bold_italic_F end_ARG = { over¯ start_ARG bold_italic_F end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT via contrastive learning. The proposed SCG is constructed by two fully-connected layers with ReLU activation in the middle, and such structure is proven useful in improving the quality of representations [13].

Specifically, we first randomly initialize each global semantic centroid vector. Then the SCG model σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) parameterized by 𝝋𝝋\bm{\varphi}bold_italic_φ is updated in each round to generate better global semantic centroids using the following objective

min𝝋c=1CFc,subscript𝝋superscriptsubscript𝑐1𝐶superscriptsubscript𝐹𝑐\displaystyle\min_{\bm{\varphi}}\sum_{c=1}^{C}\mathcal{L}_{F}^{c},roman_min start_POSTSUBSCRIPT bold_italic_φ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , (8)
Fc=k=1Klog(n𝒩cexp(𝒇¯kc𝑭¯n))loss for negativeslog(exp(𝒇¯kc𝑭¯c))loss for positives,superscriptsubscript𝐹𝑐superscriptsubscript𝑘1𝐾subscriptsubscript𝑛subscript𝒩𝑐superscriptsubscript¯𝒇𝑘𝑐superscript¯𝑭𝑛loss for negativessubscriptsuperscriptsubscript¯𝒇𝑘𝑐superscript¯𝑭𝑐loss for positives\displaystyle\mathcal{L}_{F}^{c}\!=\!\sum_{k=1}^{K}\underbrace{\log\!\big{(}\!% {\sum_{n\in\mathcal{N}_{c}}\!\exp\!{(\overline{\bm{f}}_{k}^{c}\!\cdot\!{% \overline{\bm{F}}^{n}})}}\big{)}}_{\textit{loss for negatives}}\!-\!% \underbrace{\log\big{(}{\exp{(\overline{\bm{f}}_{k}^{c}\!\cdot\!{\overline{\bm% {F}}^{c}})}}\big{)}}_{\textit{loss for positives}},caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT under⏟ start_ARG roman_log ( ∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( over¯ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⋅ over¯ start_ARG bold_italic_F end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) end_ARG start_POSTSUBSCRIPT loss for negatives end_POSTSUBSCRIPT - under⏟ start_ARG roman_log ( roman_exp ( over¯ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⋅ over¯ start_ARG bold_italic_F end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ) end_ARG start_POSTSUBSCRIPT loss for positives end_POSTSUBSCRIPT , (9)

where 𝒩csubscript𝒩𝑐\mathcal{N}_{c}caligraphic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denotes the set of semantic concepts other than c𝑐citalic_c. Therefore, by maximizing the similarity between each local semantic centroid 𝒇¯kcsuperscriptsubscript¯𝒇𝑘𝑐\overline{\bm{f}}_{k}^{c}over¯ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and the global semantic centroid 𝑭¯csuperscript¯𝑭𝑐\overline{\bm{F}}^{c}over¯ start_ARG bold_italic_F end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT of its ground-truth semantic concept c𝑐citalic_c (positives), while simultaneously minimizing the similarity between 𝒇¯kcsuperscriptsubscript¯𝒇𝑘𝑐\overline{\bm{f}}_{k}^{c}over¯ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and the global semantic centroids of other irrelevant semantic concepts (negatives), the SCG can generate representative global semantic centroids that preserve the semantic information while maintaining certain distance from centroids with different semantics.

Deriving the SCG model on the server, global semantic centroids are generated with better inter-semantic separability and intra-semantic compactness, which are exploited for the regularization of noised local semantic features. Thus, each semantic model is guided by a regularized loss function as

k=1Dkc=1Ci𝒟k,cT(rk,i,yk,i)+λ𝒇^k,i𝑭¯c22,subscript𝑘1subscript𝐷𝑘superscriptsubscript𝑐1𝐶subscript𝑖subscript𝒟𝑘𝑐subscript𝑇subscript𝑟𝑘𝑖subscript𝑦𝑘𝑖𝜆superscriptsubscriptnormsubscript^𝒇𝑘𝑖superscript¯𝑭𝑐22\displaystyle\mathcal{L}_{k}=\frac{1}{D_{k}}\sum_{c=1}^{C}\sum_{i\in\mathcal{D% }_{k,c}}\mathcal{L}_{T}(r_{k,i},y_{k,i})+\lambda\|\hat{\bm{f}}_{k,i}-\overline% {\bm{F}}^{c}\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_D start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ) + italic_λ ∥ over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_F end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

where Tsubscript𝑇\mathcal{L}_{T}caligraphic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT denotes the task loss and λ𝜆\lambdaitalic_λ is the regularization coefficient. Thus, the optimization goal of the entire FedCL framework is expressed as

min{𝜽k,ϕk}k=1K,𝝋1Kk=1Kk.subscriptsuperscriptsubscriptsubscript𝜽𝑘subscriptbold-italic-ϕ𝑘𝑘1𝐾𝝋1𝐾superscriptsubscript𝑘1𝐾subscript𝑘\displaystyle\min_{\{\bm{\theta}_{k},\bm{\phi}_{k}\}_{k=1}^{K},\bm{\varphi}}% \frac{1}{K}\sum_{k=1}^{K}\mathcal{L}_{k}.roman_min start_POSTSUBSCRIPT { bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , bold_italic_φ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (11)

Under the L2 supervision of SCG-based global semantic centroids, all noised semantic features from different clients with heterogeneous data distributions and channel conditions are restricted in a consistent global semantic space, thereby integrating the personalized semantic features and preserving the shared semantics in a compact form.

Algorithm 1 Training process of the FedCL framework
1:  Input: Dataset 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of each client k𝑘kitalic_k, noised channel generated from a fixed distribution.
2:  Initialize: Client-side semantic encoder parameters 𝜽k(0)superscriptsubscript𝜽𝑘0\bm{\theta}_{k}^{(0)}bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, server-side semantic decoder parameters ϕk(0)superscriptsubscriptbold-italic-ϕ𝑘0\bm{\phi}_{k}^{(0)}bold_italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, global semantic centroids 𝑭¯={𝑭¯c}c=1C¯𝑭superscriptsubscriptsuperscript¯𝑭𝑐𝑐1𝐶\overline{\bm{F}}=\{\overline{\bm{F}}^{c}\}_{c=1}^{C}over¯ start_ARG bold_italic_F end_ARG = { over¯ start_ARG bold_italic_F end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT.
3:  while communication round t=0𝑡0t=0italic_t = 0 to 200200200200 do
4:     for k=1𝑘1k=1italic_k = 1 to K𝐾Kitalic_K do
5:        Extract 𝒇ksubscript𝒇𝑘\bm{f}_{k}bold_italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using semantic encoder α(𝜽k(t))𝛼superscriptsubscript𝜽𝑘𝑡\alpha(\bm{\theta}_{k}^{(t)})italic_α ( bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) by (1).
6:        Reshape 𝒇ksubscript𝒇𝑘\bm{f}_{k}bold_italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as 𝒔ksubscript𝒔𝑘\bm{s}_{k}bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and transmit over the channel.
7:        Receive and reshape the noised symbols 𝒔^ksubscript^𝒔𝑘\hat{\bm{s}}_{k}over^ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT at the server and obtain 𝒇^ksubscript^𝒇𝑘\hat{\bm{f}}_{k}over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.
8:        Output task result rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using (3) at the BS.
9:        Aggregate local semantic centroids {𝒇¯kc}c=1Csuperscriptsubscriptsuperscriptsubscript¯𝒇𝑘𝑐𝑐1𝐶\{\overline{\bm{f}}_{k}^{c}\}_{c=1}^{C}{ over¯ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT by (7).
10:        Calculate ksubscript𝑘\mathcal{L}_{k}caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using (10) at the BS server.
11:        Update semantic decoder by (4) and obtain ϕk(t+1)superscriptsubscriptbold-italic-ϕ𝑘𝑡1\bm{\phi}_{k}^{(t+1)}bold_italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT.
12:        Transmit gradients 𝒈k(t)superscriptsubscript𝒈𝑘𝑡\bm{g}_{k}^{(t)}bold_italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT back to client k𝑘kitalic_k over downlink wireless channel and obtain 𝒈ˇk(t)superscriptsubscriptˇ𝒈𝑘𝑡\check{\bm{g}}_{k}^{(t)}overroman_ˇ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.
13:        Update semantic encoder by (6) and obtain 𝜽k(t+1)superscriptsubscript𝜽𝑘𝑡1\bm{\theta}_{k}^{(t+1)}bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT.
14:     end for
15:     Update the SCG on server using (8) and update global semantic centroids 𝑭¯¯𝑭\overline{\bm{F}}over¯ start_ARG bold_italic_F end_ARG.
16:     Aggregate server-side semantic decoder by (5).
17:  end while
18:  Output: Converged semantic encoder and decoder model.

Note that the SCG is deployed on the server but independent of the semantic decoder, which is only used to generate global semantic centroids as supervision during semantic model training. There is no other parameter interactions between the SCG and the semantic decoder. The entire training process of the proposed FedCL framework is described in Algorithm 1.

II-C Convergence Analysis

Denoting the entire semantic model parameters of client k𝑘kitalic_k as 𝒘k={𝜽k,ϕk}subscript𝒘𝑘subscript𝜽𝑘subscriptbold-italic-ϕ𝑘\bm{w}_{k}=\{\bm{\theta}_{k},\bm{\phi}_{k}\}bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, we analyze the convergence performance of FedCL by introducing the following assumptions:

Assumption 1

(Lipschitz smooth). Each loss function is L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Lipschitz smooth, and the gradient of each loss function is L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Lipschitz continuous. Since this assumption is valid for arbitrary client, we omit the footnote k𝑘kitalic_k,

t1t22L1𝒘t1𝒘t22,t1,t2>0.formulae-sequencesubscriptdelimited-∥∥subscriptsubscript𝑡1subscriptsubscript𝑡22subscript𝐿1subscriptdelimited-∥∥subscript𝒘subscript𝑡1subscript𝒘subscript𝑡22for-allsubscript𝑡1subscript𝑡20\displaystyle\begin{split}\|\nabla\mathcal{L}_{t_{1}}-\nabla\mathcal{L}_{t_{2}% }\|_{2}\leq L_{1}\|\bm{w}_{t_{1}}-\bm{w}_{t_{2}}\|_{2},\forall t_{1},t_{2}>0.% \end{split}start_ROW start_CELL ∥ ∇ caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - ∇ caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∀ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 . end_CELL end_ROW (12)

This also implies the following quadratic bound,

t1t2t2,(𝒘t1𝒘t2)+L12𝒘t1𝒘t222.subscriptsubscript𝑡1subscriptsubscript𝑡2subscriptsubscript𝑡2subscript𝒘subscript𝑡1subscript𝒘subscript𝑡2subscript𝐿12superscriptsubscriptnormsubscript𝒘subscript𝑡1subscript𝒘subscript𝑡222\displaystyle\mathcal{L}_{t_{1}}\!-\!\mathcal{L}_{t_{2}}\!\leq\!\langle\nabla% \mathcal{L}_{t_{2}},(\bm{w}_{t_{1}}\!-\!\bm{w}_{t_{2}})\rangle\!+\!\frac{L_{1}% }{2}\|\bm{w}_{t_{1}}\!-\!\bm{w}_{t_{2}}\|_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ ⟨ ∇ caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( bold_italic_w start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⟩ + divide start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_italic_w start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (13)
Assumption 2

(Unbiased gradient and bounded variance). The stochastic gradient gt=(𝐰t,ξt)subscript𝑔𝑡subscript𝐰𝑡subscript𝜉𝑡g_{t}=\nabla\mathcal{L}(\bm{w}_{t},\xi_{t})italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∇ caligraphic_L ( bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is an unbiased estimator of the local gradient for each client k𝑘kitalic_k. Its expectation is formulated as

𝔼ξk𝒟k[𝒈k,t]=k(𝒘k,t)=t,k𝒦,formulae-sequencesubscript𝔼similar-tosubscript𝜉𝑘subscript𝒟𝑘delimited-[]subscript𝒈𝑘𝑡subscript𝑘subscript𝒘𝑘𝑡subscript𝑡for-all𝑘𝒦\displaystyle\mathbb{E}_{\xi_{k}\sim\mathcal{D}_{k}}[\bm{g}_{k,t}]=\nabla% \mathcal{L}_{k}(\bm{w}_{k,t})=\nabla\mathcal{L}_{t},\;\;\forall k\in\mathcal{K},blackboard_E start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ bold_italic_g start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ] = ∇ caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) = ∇ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∀ italic_k ∈ caligraphic_K , (14)

and its variance is bounded by ρ2superscript𝜌2\rho^{2}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT:

𝔼[𝒈k,t(𝒘k,t)22]ρ2,k𝒦.formulae-sequence𝔼delimited-[]superscriptsubscriptnormsubscript𝒈𝑘𝑡subscript𝒘𝑘𝑡22superscript𝜌2for-all𝑘𝒦\displaystyle\mathbb{E}[\|\bm{g}_{k,t}-\nabla\mathcal{L}(\bm{w}_{k,t})\|_{2}^{% 2}]\leq\rho^{2},\;\;\forall k\in\mathcal{K}.blackboard_E [ ∥ bold_italic_g start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - ∇ caligraphic_L ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ italic_k ∈ caligraphic_K . (15)
Assumption 3

(Lipschitz continuity). The SCG network σ(𝛗)𝜎𝛗\sigma(\bm{\varphi})italic_σ ( bold_italic_φ ) is L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-Lipschitz continuous, that is,

σ(𝝋𝒕𝟏)σ(𝝋𝒕𝟐)L2𝝋𝒕𝟏𝝋𝒕𝟐2,t1,t2>0.formulae-sequencenorm𝜎subscript𝝋subscript𝒕1𝜎subscript𝝋subscript𝒕2subscript𝐿2subscriptnormsubscript𝝋subscript𝒕1subscript𝝋subscript𝒕22for-allsubscript𝑡1subscript𝑡20\displaystyle\|\sigma(\bm{\varphi_{t_{1}}})-\sigma(\bm{\varphi_{t_{2}}})\|\leq L% _{2}\|\bm{\varphi_{t_{1}}}-\bm{\varphi_{t_{2}}}\|_{2},\;\;\forall t_{1},t_{2}>0.∥ italic_σ ( bold_italic_φ start_POSTSUBSCRIPT bold_italic_t start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_σ ( bold_italic_φ start_POSTSUBSCRIPT bold_italic_t start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ ≤ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_φ start_POSTSUBSCRIPT bold_italic_t start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_φ start_POSTSUBSCRIPT bold_italic_t start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∀ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 . (16)
Assumption 4

(Bounded expectation of Euclidean norm of stochastic gradients). The expectation of the stochastic gradient of the SCG network is bounded by G𝐺Gitalic_G:

𝔼[𝒈t2]G.𝔼delimited-[]subscriptnormsuperscriptsubscript𝒈𝑡2𝐺\displaystyle\mathbb{E}[\|\bm{g}_{t}^{\prime}\|_{2}]\leq G.blackboard_E [ ∥ bold_italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ≤ italic_G . (17)

Then we derive the expected one-round decrease in Theorem 1. We denote {1/2,1,2,,E}1212𝐸\{1/2,1,2,...,E\}{ 1 / 2 , 1 , 2 , … , italic_E } as the local iteration of semantic model parameters 𝒘ksubscript𝒘𝑘\bm{w}_{k}bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, {1/2,1,2,,E}1212superscript𝐸\{1/2,1,2,...,E^{\prime}\}{ 1 / 2 , 1 , 2 , … , italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } as the local iteration of SCG network parameters 𝝋𝝋\bm{\varphi}bold_italic_φ, and t𝑡titalic_t as the global communication round. Additionally, tE𝑡𝐸tEitalic_t italic_E denotes the steps before global semantic centroid generating, and tE+1/2𝑡𝐸12tE+1/2italic_t italic_E + 1 / 2 represents the step between global semantic centroid generating and the first iteration of round t𝑡titalic_t.

Theorem 1

(One-round deviation bound). Let Assumptions 1 to 4 hold. For arbitrary client after each round, it satisfies,

𝔼[(t+1)E+1/2]𝔼delimited-[]subscript𝑡1𝐸12\displaystyle\mathbb{E}[\mathcal{L}_{(t+1)E+1/2}]blackboard_E [ caligraphic_L start_POSTSUBSCRIPT ( italic_t + 1 ) italic_E + 1 / 2 end_POSTSUBSCRIPT ] tE+1/2(ηL1η22)e=1/2E1tE+e22absentsubscript𝑡𝐸12𝜂subscript𝐿1superscript𝜂22superscriptsubscript𝑒12𝐸1superscriptsubscriptnormsubscript𝑡𝐸𝑒22\displaystyle\!\leq\!\mathcal{L}_{tE+1/2}\!-\!(\eta\!-\!\frac{L_{1}\eta^{2}}{2% })\!\sum_{e=1/2}^{E-1}\!\|\nabla\mathcal{L}_{tE+e}\|_{2}^{2}≤ caligraphic_L start_POSTSUBSCRIPT italic_t italic_E + 1 / 2 end_POSTSUBSCRIPT - ( italic_η - divide start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ∑ start_POSTSUBSCRIPT italic_e = 1 / 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E - 1 end_POSTSUPERSCRIPT ∥ ∇ caligraphic_L start_POSTSUBSCRIPT italic_t italic_E + italic_e end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+L1Eη22ρ2+λL2ηEG.subscript𝐿1𝐸superscript𝜂22superscript𝜌2𝜆subscript𝐿2𝜂superscript𝐸𝐺\displaystyle+\!\frac{L_{1}E\eta^{2}}{2}\rho^{2}\!+\!\lambda L_{2}\eta E^{% \prime}G.+ divide start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_E italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_G . (18)

Theorem 1 exhibits a deviation bound of loss function for arbitrary client after each communication round. As observed in (1), the second term on the right side is always negative. By tuning the value of λ𝜆\lambdaitalic_λ and η𝜂\etaitalic_η according to the non-IID degree (for example, λ=0𝜆0\lambda=0italic_λ = 0 is equivalent to vanilla FL with IID data distribution), a specific one-round decrease is obtained to guarantee the monotonic decrease in all communication rounds, thereby ensuring the convergence of FedCL. 111Complete proof is available at https://github.com/wangyining98/FedCL

III Simulation Results

Refer to caption
(a) Learning performance vs. K
Refer to caption
(b) Task performance vs. SNR
Figure 2: (a) Learning performance of different schemes; (b)Task performance under different SNR with 20 clients.
Refer to caption
(a) Learning performance vs. m
Refer to caption
(b) Task performance vs. m
Figure 3: (a) Learning performance under varying data semantic heterogeneity m𝑚mitalic_m with 5 clients; (b) Task performance of different schemes with varying m𝑚mitalic_m.

To verify the effectiveness of the FedCL framework, we compare the proposed scheme with conventional federated learning frameworks FedAvg [14] and FedProto [12] on CIFAR-10 dataset. The semantic encoder is designed as DCGAN-like encoders [15] with output dimension of 64. The SCG is constructed by two fully connected layers with 64 outputs, and the semantic decoder is designed as a fully-connected image classification network with 10-unit outputs. The learning rates, i.e., ηksubscript𝜂𝑘\eta_{k}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and η𝜂\etaitalic_η are set to 0.0010.0010.0010.001 for both clients and server.

We divide the client dataset as m𝑚mitalic_m-way q𝑞qitalic_q-shot, where m𝑚mitalic_m determines the number of classes on each client and q𝑞qitalic_q determines the number of data samples per class. We randomly set the classes possessed by each client and change the value of m𝑚mitalic_m to adjust the heterogeneity degree of data distribution.

Refer to caption
(a) Proposed
Refer to caption
(b) FedProto
Refer to caption
(c) FedAvg
Figure 4: The t-SNE visualization of semantic representations obtained by (a) proposed FedCL framework; (b) FedProto; (c) FedAvg.

Fig. 2LABEL:sub@test_acc compares the learning performance of FedCL with other distributed learning benchmarks. With 20 clients, FedCL outperforms FedProto and FedAvg by 12.75%percent12.7512.75\%12.75 % and 38.55%percent38.5538.55\%38.55 %, respectively. FedCL also remains stable with fewer clients, unlike FedProto, which degrades with only 5 clients due to insufficient training samples. Fig. 2LABEL:sub@snr_acc examines the impact of channel SNR on model performance across different schemes. At SNR =9absent9=9= 9 dB, FedCL achieves 25.09%percent25.0925.09\%25.09 % and 66.24%percent66.2466.24\%66.24 % higher accuracy than FedProto and FedAvg, respectively. Different from FedProto and FedAvg which suffer significant degradation at low SNR, the FedCL only degrades by approximately 10%percent1010\%10 %, revealing that trainable global semantic centroids based on contrastive learning provide a better regularization for noised semantic features, which improve the robustness against the interference of wireless channel noise.

Fig. 3LABEL:sub@test_acc_m illustrates FedCL’s learning performance across varying m𝑚mitalic_m, reflecting the impact of data semantic heterogeneity on task-oriented SemCom. Note that the total amount of client data remains constant for fair comparison. A smaller m𝑚mitalic_m indicates greater semantic heterogeneity, with FedCL demonstrating superior adaptability to such scenarios. The decrease in performance with increasing m𝑚mitalic_m may result from insufficient training samples in each single category. Fig. 3LABEL:sub@m_acc compares task performance across different schemes with varying m𝑚mitalic_m when K=5𝐾5K=5italic_K = 5 and SNR =10absent10=10= 10 dB. It demonstrates that the proposed scheme notably outperforms baselines, particularly under highly heterogeneous data distributions.

Fig. 4 displays the distribution of noised semantic features from models trained under three different schemes after dimensionality reduction using t-SNE method with SNR =5absent5=5= 5 dB. The proposed FedCL demonstrates a unified semantic space, where the noised semantic features exhibit intra-semantic solidarity and inter-semantic discriminability. It preserves clearer semantic boundary against significant channel noise compared to benchmarks without trainable global semantic centroids, suggesting its superior separation capability based on the contrastive learning method.

IV Conclusion

In this letter, we proposed the FedCL framework for task-oriented communications, where personalized semantic encoders from multiple clients and a global semantic decoder at the BS were collaboratively learned. Unlike existing strategies that necessitated model structural consistency for aggregation, the proposed FedCL framework supported heterogeneous client-side semantic encoders. Additionally, we utilized contrastive learning to train the SCG for global semantic centroid generating, which regularizes heterogeneous local semantic features into discriminative global semantic space. The convergence analysis is also provided. Simulation results demonstrated that the proposed FedCL framework enhanced task performance and robustness compared to baseline schemes.

References

  • [1] Z. Lu, R. Li, K. Lu, X. Chen, E. Hossain, Z. Zhao, and H. Zhang, “Semantics-empowered communications: A tutorial-cum-survey,” IEEE Commun. Surv. Tutor., Nov. 2023, early access.
  • [2] H. Xing, H. Zhang, X. Wang, L. Xu, Z. Xiao, B. Zhao, S. Luo, L. Feng, and Y. Dai, “A multi-user deep semantic communication system based on federated learning with dynamic model aggregation,” in IEEE ICC Workshops, 2023, pp. 1612–1616.
  • [3] L. X. Nguyen, H. Q. Le, Y. L. Tun, P. Sone Aung, Y. Kyaw Tun, Z. Han, and C. S. Hong, “An efficient federated learning framework for training semantic communication system,” arXiv e-prints, Oct. 2023.
  • [4] H. Wei, W. Ni, W. Xu, F. Wang, D. Niyato, and P. Zhang, “Federated semantic learning driven by information bottleneck for task-oriented communications,” IEEE Commun. Lett., vol. 27, no. 10, pp. 2652–2656, Aug. 2023.
  • [5] R. Ye, Z. Ni, C. Xu, J. Wang, S. Chen, and Y. C. Eldar, “FedFM: Anchor-based feature matching for data heterogeneity in federated learning,” IEEE Trans. Signal Process., vol. 71, pp. 4224–4239, Oct. 2023.
  • [6] C. T Dinh, N. Tran, and J. Nguyen, “Personalized federated learning with moreau envelopes,” Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 21 394–21 405, 2020.
  • [7] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 18 661–18 673, 2020.
  • [8] C. **g, Y. Huang, Y. Zhuang, L. Sun, Z. Xiao, Y. Huang, and X. Ding, “Exploring personalization via federated representation learning on non-IID data,” Neural. Netw., vol. 163, pp. 354–366, Jun. 2023.
  • [9] Y. Tan, G. Long, J. Ma, L. Liu, T. Zhou, and J. Jiang, “Federated learning from pre-trained models: A contrastive learning approach,” Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 19 332–19 344, 2022.
  • [10] Z. Lyu, G. Zhu, J. Xu, B. Ai, and S. Cui, “Semantic communications for image recovery and classification via deep joint source and channel coding,” IEEE Trans. Wirel. Commun., Jan. 2024, early access.
  • [11] Z. Chen, W. Yi, Y. Liu, and A. Nallanathan, “Knowledge-aided federated learning for energy-limited wireless networks,” IEEE Trans. Commun., Mar. 2023.
  • [12] Y. Tan, G. Long, L. Liu, T. Zhou, Q. Lu, J. Jiang, and C. Zhang, “FedProto: Federated prototype learning across heterogeneous clients,” in Proc. AAAI Conf. Artif. Intell., vol. 36, no. 8, 2022, pp. 8432–8440.
  • [13] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 1597–1607.
  • [14] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artif. Intell. Stat., 2017, pp. 1273–1282.
  • [15] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, Nov. 2015.