\setlength

2em

Visual Language Model based Cross-modal Semantic Communication Systems

Feibo Jiang Member, IEEE Chuanguo Tang Li Dong Kezhi Wang Senior Member, IEEE Kun Yang Fellow, IEEE Cunhua Pan Senior Member, IEEE Feibo Jiang ([email protected]) is with Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, China. Chuanguo Tang ([email protected]) is with School of Information Science and Engineering, Hunan Normal University, Changsha, China. Li Dong ([email protected]) is with Changsha Social Laboratory of Artificial Intelligence, Hunan University of Technology and Business, Changsha, China. Kezhi Wang ([email protected]) is with the Department of Computer Science, Brunel University London, UK. Kun Yang ([email protected]) is with the School of Computer Science and Electronic Engineering, University of Essex, Colchester, CO4 3SQ, U.K., also with Changchun Institute of Technology. Cunhua Pan ([email protected]) is with the National Mobile Communications Research Laboratory, Southeast University, Nan**g 210096, China.
Abstract

Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.

Index Terms:
Semantic communication, knowledge base, vision language model, large language model, continual learning.

I Introduction

As mobile communication technology has evolved from the first generation to the fifth generation, there has been a significant increase in transmission rates, approaching system capacities close to their limits [1]. In recent years, various emerging applications, such as the metaverse and virtual reality, have introduced substantial data streams [2]. Furthermore, these applications necessitate extensive connectivity over limited spectrum resources while demanding lower latency, posing significant challenges to conventional source-channel coding. Semantic Communication (SC) operates in the semantic domain by extracting the inherent meaning of data, eliminating redundant information, and achieving data compression while preserving its essential semantic content [3].

With the rapid development of deep learning, many researchers have begun to explore end-to-end Image Semantic Communication (ISC) systems based on deep neural networks. For instance, ISC systems constructed using deep learning approaches such as Convolutional Neural Networks (CNN), Vision Transformers (ViT), and others have surpassed traditional solutions. Despite the significant achievements in the research of ISC based on deep learning, there remain some challenges:

1) Low semantic density

Images are natural signals with heavy spatial redundancy [4]. Traditional ISC systems directly encode the entire image, focusing on extracting low-level semantic information at the pixel level. However, text is a human-invented signal that possesses high semantic and information density. Summarizing image information through text can surpass the low-level pixel-level semantics and achieve a more sophisticated high-level semantic understanding of objects and scenarios. Moreover, traditional ISC systems lack the ability to leverage the interpretability of knowledge bases (KBs), resulting in a black-box model based on deep learning for the semantic encoder and decoder with limited explainability of semantics.

2) Catastrophic forgetting

ISC systems often operate in dynamic environments, leading to a drift in the feature distribution of transmitted image data and channel state over time. Consequently, the real data distribution becomes inconsistent with the distribution during training, resulting in a decline in the performance of the semantic encoder and decoder. Continual learning of the semantic encoder and decoder is necessary to improve the performance of the ISC system. However, during continual learning, the existing knowledge of the encoder and decoder may be disrupted or overwritten by new knowledge, leading to catastrophic forgetting in the learning process [5]. As a result, it becomes unable to adapt to semantic transmission in dynamic environments.

3) Uncertain Signal-to-Noise Ratio (SNR)

In wireless communications, traditional deep learning-based ISC systems typically consider a few discrete SNR conditions during the training phase, which cannot cover all possible SNR scenarios. As a result, the performance may severely degrade when there is a mismatch between the channel conditions during training and inference phases [6]. Training the semantic/channel encoder and decoder with consideration for multiple SNR conditions and performing switching based on specific SNR values during the inference phase can lead to substantial storage and computational overhead [7].

Vision Language Models (VLMs) with billions of parameters represent the latest advancements in the field of large AI models. Through extensive pre-training on vast amounts of data, these VLMs acquire rich language and visual knowledge, leading to significant breakthroughs in areas such as natural language processing and computer vision [8]. In ISC systems, VLMs demonstrate immense potential. Leveraging their capabilities in understanding and generating textual and visual content, VLMs enable more accurate semantic comprehension and semantic feature extraction, thereby offering a more intelligent and efficient ISC experience. Therefore, we propose a novel VLM-based Cross-modal Semantic Communication (VLM-CSC) system to address the aforementioned challenges in ISC systems. Our contributions can be summarized as follows:

1) Cross-modal Knowledge Base (CKB)

We introduce a CKB, which consists of a Bootstrap** Language-Image Pre-Training (BLIP)-based KB at the transmitter for generating high-quality text descriptions consistent with images, and a Stable Diffusion (SD)-based KB at the receiver for reconstructing images matching the text descriptions. The text descriptions can be regarded as the extraction of high-level semantics from the images with low-level pixels, thereby enhancing the semantic density of the transmitted information. Additionally, these descriptions enable users to understand the extracted semantic content, thereby enhancing the explainability of the CSC system.

2) Memory-assisted Encoder and Decoder (MED)

We employ a MED to track changes in dynamic environments while avoiding catastrophic forgetting during the learning process. Specifically, we design a storage pool consisting of two types of memory: Short-Term Memory (STM) and Long-Term Memory (LTM). The STM is used to store the new data from the current environment, while the LTM stores historically significant data from previously encountered distributions. When training the CSC system, we input data from both the STM and LTM. This enables the semantic encoder and decoder to review all the knowledge from previously trained data with different distributions while learning from the new data. As a result, the CSC system can acquire encoding and decoding capabilities for the new data distribution without significantly compromising its performance on the previously trained data distribution, thus avoiding catastrophic forgetting.

3) Noise Attention Module (NAM)

We present a NAM to dynamically adjust semantic coding and channel coding based on different SNR conditions. Specifically, after each encoder and decoder layer, we employ an attention module to adjust the weights for different encoders and decoders according to the SNR values provided by the channel feedback. When the SNR is high, the NAM evenly allocates higher weights to the semantic encoder and decoder to improve the encoding and decoding quality of the semantic features. Conversely, when the SNR is low, the NAM assigns higher weights to the channel encoder and decoder, improving the channel coding to combat the intense channel noise. This design ensures that the semantic features maintain high robustness under varying SNR conditions.

The rest of this paper is structured as follows. Section II presents the related work, Section III introduces the system model, Section IV provides a detailed description of the proposed VLM-CSC system, Section V outlines the experimental setup and results, and Section VI concludes the paper.

II Related work

II-A Deep learning enabled ISC systems

Deep learning techniques are commonly employed in the construction of encoders and decoders for ISC systems. In [9], a comprehensive SC system based on CNNs was initially introduced, showcasing superior performance in Peak Signal-to-Noise Ratio (PSNR) when compared to traditional compression algorithms. In [10], a novel Nonlinear Transform Source-Channel Coding (NTSCC) for SC systems was proposed, which leveraged a Variational AutoEncoder (VAE) to map the source signal to the latent space, and executed nonlinear transformation and channel coding in the space. Additionally, [11] presented an innovative SC system incorporating Semantic Slice-Models (SeSM) to facilitate adaptable model resemblance under diverse requirements. Furthermore, [12] introduced a Reinforcement Learning-based Adaptive Semantic Coding (RL-ASC) for image data. RL-ASC utilized a combination of VAE, RL, and generative adversarial networks (GANs) to encode, allocate, and decode semantic concepts.

Although convolutional and ViT-based autoencoders have shown promising results, their feature extraction capabilities are limited compared to state-of-the-art VLMs. This limitation arises from constraints posed by model parameters and the availability of training data.

II-B Vision language models

VLMs are a class of large AI models capable of simultaneously processing both image and text information [13]. They find extensive application across various visual language tasks, encompassing image description, visual question answering, text-to-image generation, and other multimodal tasks. In [14], a contrastive loss function was utilized to train both image encoders and text encoders. This loss function aimed to minimize the feature space distance between matching image-text pairs, enabling the learning of semantically relevant visual language features while reducing the dependence on large amounts of annotated data. In [15], images were treated as prefixes in language models. They were decomposed into multiple blocks, concatenated with text sequences as input, and used to predict the subsequent parts of the text sequences. Furthermore, in [16], a cross-attention mechanism was employed to integrate visual and language features. This mechanism allowed the two modalities to reference and enhance each other, facilitating the learning of more comprehensive and refined visual language features. The approach demonstrated applicability to various downstream tasks.

VLMs aim to understand the correlation between images and text, enabling accurate visual description or image generation. Future research involves deep integration of self-supervised pre-training techniques and VLMs. This integration will help extract cross-modal relationships between visual and language features, providing a stronger foundation for downstream tasks.

II-C Continual learning

Continual learning can effectively mitigate the problem of catastrophic forgetting in dynamic environments. In [17], the authors discuss continual learning in Mobile Edge Computing (MEC) networks, focusing on age-aware optimization for data selection and aggregator placement. They also present a prototype implementation involving diverse user equipment and cloudlets. In [18], the authors propose a continual learning digital predistortion algorithm for linearizing radio frequency power amplifiers in 6G wireless communications. The algorithm demonstrates effectiveness in adapting to both new and known operating states with low long-term complexity. In [19], the authors address the challenge of forgetting tasks in cross-edge federated learning by preserving past knowledge through continual learning. They achieve enhanced accuracy across various tasks with minimal storage cost. Furthermore, in [20], the authors employ continual learning to enable adaptive downlink beamforming optimization in dynamic environments. The proposed approach addresses task mismatch and exhibits good adaptability with low complexity.

Recent advancements in continual learning have been directed towards more challenging scenarios, specifically those where task boundaries are unknown. In these contexts, researchers have focused on develo** sample selection strategies to identify which samples should be stored in the buffer for model training. This approach aims to improve the efficiency and effectiveness of continual learning in handling unknown task boundaries.

Refer to caption
Figure 1: The system model of the CSC.

III System model and problem formulation

The considered CSC system consists of three components: a transmitter, a receiver, and a physical channel, as illustrated in Fig. 1. The physical channel ensures the correct exchange of semantic information over the transmission medium with dynamic SNR.

III-A Transmitter

The input to the transmitter is an image represented by the matrix 𝐱H×W×C𝐱superscript𝐻𝑊𝐶\bm{\mathrm{x}}\in\mathbb{R}^{H\times W\times C}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, whose size is H(height)×W(weight)×C(channel)𝐻𝑒𝑖𝑔𝑡𝑊𝑤𝑒𝑖𝑔𝑡𝐶𝑐𝑎𝑛𝑛𝑒𝑙H(height)\times W(weight)\times C(channel)italic_H ( italic_h italic_e italic_i italic_g italic_h italic_t ) × italic_W ( italic_w italic_e italic_i italic_g italic_h italic_t ) × italic_C ( italic_c italic_h italic_a italic_n italic_n italic_e italic_l ). In the transmitter, the input image 𝐱𝐱\bm{\mathrm{x}}bold_x is mapped to symbols 𝐲𝐲\bm{\mathrm{y}}bold_y for transmission over the physical channel. The transmitter consists of three independent components: a CKB for cross-modal semantic extraction, a semantic encoder, and a channel encoder. The CKB is used to extract semantic information from the image and represent it as the corresponding textual information. The semantic encoder and channel encoder are responsible for semantic coding, and channel coding and modulation, ensuring that the encoded semantic information can be smoothly transmitted over the physical channel. The encoded symbol sequence 𝐲𝐲\bm{\mathrm{y}}bold_y can be represented as:

𝐲=Cβ(Sα(Kθ(𝐱),μ),μ)𝐲subscript𝐶𝛽subscript𝑆𝛼subscript𝐾𝜃𝐱𝜇𝜇\bm{\mathrm{y}}=C_{\beta}(S_{\alpha}(K_{\theta}(\bm{\mathrm{x}}),\mu),\mu)bold_y = italic_C start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) , italic_μ ) , italic_μ ) (1)

where Kθ()subscript𝐾𝜃K_{\theta}(\cdot)italic_K start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is the CKB with the parameter set θ𝜃\thetaitalic_θ, Sα()subscript𝑆𝛼S_{\alpha}(\cdot)italic_S start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( ⋅ ) is the semantic encoder with the parameter set α𝛼\alphaitalic_α, and Cβ()subscript𝐶𝛽C_{\beta}(\cdot)italic_C start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( ⋅ ) is the channel encoder with the parameter set β𝛽\betaitalic_β, μ𝜇\muitalic_μ is the channel SNR that can be estimated and fed back to the semantic encoder and channel encoder.

III-B Wireless channel

The transmitter sends encoded symbols 𝐲𝐲\bm{\mathrm{y}}bold_y, which is transmitted through the physical channel to the receiver. The channel output sequence 𝐲^bold-^𝐲\bm{\mathrm{\hat{y}}}overbold_^ start_ARG bold_y end_ARG at the receiver can be expressed as:

𝐲^=𝐡𝐲+𝐧bold-^𝐲𝐡𝐲𝐧\bm{\mathrm{\hat{y}}}=\bm{\mathrm{h}}\bm{\mathrm{y}}+\bm{\mathrm{n}}overbold_^ start_ARG bold_y end_ARG = bold_hy + bold_n (2)

where 𝐡𝐡\bm{\mathrm{h}}bold_h represents the channel gain, and 𝐧𝐧\bm{\mathrm{n}}bold_n is Additive White Gaussian Noise (AWGN).

III-C Recevier

Similar to the transmitter, the receiver consists of three components: a channel decoder, a semantic decoder, and a cross-modal knowledge base for semantic reconstruction. The semantic decoder and channel decoder are used to decode textual information from received symbols, while the cross-modal knowledge base is employed for image reconstruction based on the corresponding textual information. The decoded image can be represented as:

𝐱^=Kθ1(Sδ1(Cγ1(𝐲^,μ),μ))bold-^𝐱superscriptsubscript𝐾superscript𝜃1superscriptsubscript𝑆𝛿1superscriptsubscript𝐶𝛾1bold-^𝐲𝜇𝜇\bm{\mathrm{\hat{x}}}=K_{\theta^{\prime}}^{-1}(S_{\delta}^{-1}(C_{\gamma}^{-1}% (\bm{\mathrm{\hat{y}}},\mu),\mu))overbold_^ start_ARG bold_x end_ARG = italic_K start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( overbold_^ start_ARG bold_y end_ARG , italic_μ ) , italic_μ ) ) (3)

where Cγ1()superscriptsubscript𝐶𝛾1C_{\gamma}^{-1}(\cdot)italic_C start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ⋅ ) is the channel decoder with the parameter set γ𝛾\gammaitalic_γ, Sδ1()superscriptsubscript𝑆𝛿1S_{\delta}^{-1}(\cdot)italic_S start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ⋅ ) is the semantic decoder with the parameter set δ𝛿\deltaitalic_δ and Kθ1()superscriptsubscript𝐾superscript𝜃1K_{\theta^{\prime}}^{-1}(\cdot)italic_K start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ⋅ ) is the cross-modal knowledge base with the parameter set θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

For the purpose of reconstructing image information from the semantic level, maintaining the consistency of textual semantics between 𝐬𝐬\bm{\mathrm{s}}bold_s and 𝐬^bold-^𝐬\bm{\mathrm{\hat{s}}}overbold_^ start_ARG bold_s end_ARG is crucial. Here, 𝐬=Kθ(𝐱)𝐬subscript𝐾𝜃𝐱\bm{\mathrm{s}}=K_{\theta}(\bm{\mathrm{x}})bold_s = italic_K start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) represents the extracted textual semantic information from the image, and 𝐬^=Sδ1(Cγ1(𝐲^,μ),μ)bold-^𝐬superscriptsubscript𝑆𝛿1superscriptsubscript𝐶𝛾1bold-^𝐲𝜇𝜇\bm{\mathrm{\hat{s}}}=S_{\delta}^{-1}(C_{\gamma}^{-1}(\bm{\mathrm{\hat{y}}},% \mu),\mu)overbold_^ start_ARG bold_s end_ARG = italic_S start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( overbold_^ start_ARG bold_y end_ARG , italic_μ ) , italic_μ ) represents the recovered textual semantic information after decoding. We utilize Cross-Entropy (CE) as the loss function:

LCE(𝐬,𝐬^)=l=1Lq(wl)log(p(wi))+(1q(wl))log(1p(wi))subscript𝐿𝐶𝐸𝐬bold-^𝐬superscriptsubscript𝑙1𝐿𝑞subscript𝑤𝑙𝑝subscript𝑤𝑖1𝑞subscript𝑤𝑙1𝑝subscript𝑤𝑖L_{CE}(\mathbf{s},\bm{\mathrm{\hat{s}}})=-\sum_{l=1}^{L}q(w_{l})\log(p(w_{i}))% +(1-q(w_{l}))\log(1-p(w_{i}))italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( bold_s , overbold_^ start_ARG bold_s end_ARG ) = - ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_q ( italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) roman_log ( italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + ( 1 - italic_q ( italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) roman_log ( 1 - italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (4)

where q(wl)𝑞subscript𝑤𝑙q(w_{l})italic_q ( italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) denotes the real probability of the appearance of the l𝑙litalic_l-th word wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT in the sentence 𝐬𝐬\mathbf{s}bold_s, and p(wl)𝑝subscript𝑤𝑙p(w_{l})italic_p ( italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) represents the predicted probability of the appearance of the l𝑙litalic_l-th word wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the sentence 𝐬^bold-^𝐬\bm{\mathrm{\hat{s}}}overbold_^ start_ARG bold_s end_ARG. CE is employed to measure the difference between two probability distributions. By minimizing the CE loss, the semantic encoder and decoder can learn the word distribution q(wl)𝑞subscript𝑤𝑙q(w_{l})italic_q ( italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) in the source sentence 𝐬𝐬\bm{\mathrm{s}}bold_s, which represents the meaning of words in terms of grammar, phrases, and contextual information. Hence, the goal pf the CSC system is to determine the parameters of the semantic/channel encoder and decoder αsuperscript𝛼{\alpha}^{\ast}italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, βsuperscript𝛽{\beta}^{\ast}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, δsuperscript𝛿{\delta}^{\ast}italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and γsuperscript𝛾{\gamma}^{\ast}italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that minimize the expected distortion as follows:

(α,β,δ,γ)=argminα,β,δ,γ𝔼p(μ)𝔼p(𝐬,𝐬^)[LCE(𝐬,𝐬^)]superscript𝛼superscript𝛽superscript𝛿superscript𝛾subscript𝛼𝛽𝛿𝛾subscript𝔼𝑝𝜇subscript𝔼𝑝𝐬bold-^𝐬delimited-[]subscript𝐿𝐶𝐸𝐬bold-^𝐬({\alpha}^{\ast},{\beta}^{\ast},{\delta}^{\ast},{\gamma}^{\ast})=\mathop{\arg% \min}\limits_{{\alpha},{\beta},{\delta},{\gamma}}\mathbb{E}_{p(\mu)}\mathbb{E}% _{p(\mathbf{s},\bm{\mathrm{\hat{s}}})}[L_{CE}(\mathbf{s},\bm{\mathrm{\hat{s}}})]( italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_α , italic_β , italic_δ , italic_γ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p ( italic_μ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p ( bold_s , overbold_^ start_ARG bold_s end_ARG ) end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( bold_s , overbold_^ start_ARG bold_s end_ARG ) ] (5)

where αsuperscript𝛼{\alpha}^{\ast}italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal semantic encoder parameters, βsuperscript𝛽{\beta}^{\ast}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal channel encoder parameters, γsuperscript𝛾{\gamma}^{\ast}italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal channel decoder parameters, and δsuperscript𝛿{\delta}^{\ast}italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal semantic decoder parameters. p(𝐬,𝐬^)𝑝𝐬bold-^𝐬p(\mathbf{s},\bm{\mathrm{\hat{s}}})italic_p ( bold_s , overbold_^ start_ARG bold_s end_ARG ) represents the joint probability distribution of the 𝐬𝐬\bm{\mathrm{s}}bold_s and 𝐬^bold-^𝐬\bm{\mathrm{\hat{s}}}overbold_^ start_ARG bold_s end_ARG, and p(μ)𝑝𝜇p(\mu)italic_p ( italic_μ ) represents the probability distribution of the SNR.

IV The VLM-CSC system

Compared to traditional KBs based on Deep Neural Networks (DNNs), Knowledge Graphs (KGs), and other approaches, utilizing VLMs to construct KBs has several advantages: (1) VLMs are large AI models with billions of parameters and powerful cognitive abilities concerning world knowledge. They excel in tasks related to understanding, expressing, and generating both visual and natural language data from the semantic level. (2) Unlike traditional methods that rely on manual rules or structure definitions to describe knowledge, VLMs have the ability to automatically learn and extract knowledge from data. This enables them to generate appropriate semantic information, reducing the risk of information loss or ambiguity. (3) In SC systems, the process of understanding and interpreting the generated results is crucial. VLMs have the ability to generate semantic information in a manner that is understandable to humans, enabling both parties in communication to have a more accurate understanding and interpretation of each other’s intentions and expressions.

In this section, we will provide the implementation details of the proposed VLM-CSC system, which is illustrated in Fig. 2 as follows:

Refer to caption
Figure 2: The proposed VLM-CSC system.

IV-1 Textual semantic extraction

To enhance the semantic density and interpretability of SC, a VLM called BLIP is employed at the transmitter to construct the CKB. The CKB encompasses a series of visual and language-related knowledge components. We employ the image encoder and text decoder from this CKB to perform cross-modal semantic extraction, thereby transforming the original image with low semantic density into a corresponding textual description with high semantic density. For example, through cross-modal semantic extraction, the original image in Fig. 2 is transformed to the textual description "A fire is burning on a beach near the water".

IV-2 Semantic encoder and decoder

The generated textual information from the CKB then proceeds to the semantic encoder. The semantic encoder consists of alternating transformer encoder layers and NAMs. The transformer encoder layers analyze and transform the textual information into a compact semantic representation. NAMs allow the semantic encoder to optimize the encoding process and maintain reliable semantic transmission, even in the presence of varying channel conditions. At the receiver, the semantic decoder is composed of alternating transformer decoder layers and NAMs, with a structure opposite to that of the semantic encoder, aimed at reversing the semantic encoding process to recover the original textual information.

IV-3 Channel encoder and decoder

The encoded semantic features are passed through the channel encoder to undergo channel encoding and modulation, ensuring the effective transmission of semantic information over the physical channel. Similarly, the channel encoder also consists of alternating FeedForward (FF) layers and NAMs. At the receiver, the transmitted information through the physical channel is received and decoded using the channel decoder. To maintain information consistency, the channel decoder employs a structure opposite to that of the channel encoder.

IV-4 Image reconstruction

To facilitate a better understanding of the received textual information, we design a CKB for image reconstruction using a VLM called SD. The CKB encompasses a series of visual and language-related knowledge components. We employ the text encoder, the denoising U-Net and the image decoder from this CKB to perform image reconstruction. Specifically, the textual information is first transformed into a conditional vector by the text encoder. Then, the denoising U-Net transforms the noisy image to a latent image feature vector aligning with the conditional vector. Finally, the latent image feature vector is processed by the image decoder to generate the final reconstructed image.

IV-5 Memory-assisted continual learning

During the training phase of the VLM-CSC system, the latest samples are stored in an STM. When the STM becomes full, a kernel method is employed to select representative short-term samples to be transferred to an LTM. Then, the STM is emptied to buffer new samples in the next round. The encoder and decoder sample from both STM and LTM during the training stage, thereby avoiding catastrophic forgetting. This approach ensures that the semantic encoder and decoder can access both recent and past information, allowing for continual learning and retention of previously learned knowledge.

IV-6 Training process of the VLM-CSC system

Remarkably, BLIP and SD-based CKBs are pretrained VLMs that do not need to be trained specifically for the CSC system. The training process unfolds as follows:

  • Joint training of channel encoder and decoder with NAMs: The channel encoder/decoder and NAMs are initially trained together by MED. This involves optimizing the parameters of these modules by minimizing the mutual information, which eliminates noise or fading effects during transmission and prevents signal distortion [21]. Then, the parameters of the channel encoder/decoder and NAMs are frozen. This ensures that their learned representations are preserved in subsequent training steps.

  • Joint training of semantic encoder and decoder with NAMs: The semantic encoder/decoder and NAMs are then trained by MED. The focus is on optimizing the parameters of these modules to minimize the loss between the original textual information and the reconstructed textual information. Eq. (4) can be applied as the loss function. Then, the parameters of the semantic encoder/decoder and NAMs are frozen to maintain the learned semantic representations.

  • Crossover-based iterative training: The training process iterates between the channel encoder/decoder and noise modules, and the semantic encoder/decoder and noise modules. This iteration continues until convergence of the entire VLM-CSC system is achieved.

Next, we will provide a detailed explanation of each contribution in this paper.

IV-A BLIP-based CKB for semantic extraction

The BLIP model, introduced by Salesforce AI Research, is a sophisticated VLM designed for understanding and generating content that involves both visual and textual elements [22]. The BLIP model possesses rich visual-linguistic knowledge and utilizes multiple knowledge components such as text encoders, image encoders, and image-grounded text decoders and decoders to perform various visual-linguistic tasks, such as image captioning, visual question answering, and multimodal classification. At the transmitter, we employ the BLIP model to construct the CKB and utilize the image encoder and image-grounded text decoder (abbreviated as text decoder) in the CKB to transform original image data into detailed textual descriptions containing image semantic information. The workflow of the BLIP-based CKB is illustrated in Fig. 3.

Refer to caption
Figure 3: The architecture of BLIP-based CKB.

For a given image 𝐱𝐱\bm{\mathrm{x}}bold_x, the process of extracting semantic information from image data and generating textual representation 𝐬𝐬\bm{\mathrm{s}}bold_s is as follows:

IV-A1 Image encoder

The image encoder incorporates a feature extraction module based on the ViT. This module divides the input image into smaller patches and encodes each patch. Through multiple encoder layers with Multi-head Self-Attention (MSA) and FF sublayers [23], these patch vectors undergo processing to generate the textual representation of the image, which corresponds to the image features.

Initially, the image 𝐱𝐱\bm{\mathrm{x}}bold_x is segmented into a patch sequence 𝐱psubscript𝐱𝑝\bm{\mathrm{x}}_{p}bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Each patch represents a fixed-size image region in Fig. 3. Subsequently, these patch sequences are fed into the image encoder to extract visual features from the image. The specific workflow of the image encoder is as follows:

  • MSA sublayer: the MSA layer allows the vector of each patch to interact with vectors of all other patches, capturing both global and local information in the image. The output of the MSA layer in the first image encoder layer can be calculated as follows:

    𝐦msa,1=MSA(LN(𝐱p))+𝐱psubscript𝐦𝑚𝑠𝑎1MSALNsubscript𝐱𝑝subscript𝐱𝑝\bm{\mathrm{m}}_{msa,1}=\mathrm{MSA}(\mathrm{LN}(\bm{\mathrm{x}}_{p}))+\bm{% \mathrm{x}}_{p}bold_m start_POSTSUBSCRIPT italic_m italic_s italic_a , 1 end_POSTSUBSCRIPT = roman_MSA ( roman_LN ( bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) + bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (6)

    where 𝐱psubscript𝐱𝑝\bm{\mathrm{x}}_{p}bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the p𝑝pitalic_p-th patch, MSAMSA\mathrm{MSA}roman_MSA is the multi-head self-attention operator [23] and LNLN\mathrm{LN}roman_LN is the layer normalization operator in ViT [23].

  • FF sublayer: The FF layer comprises linear layers and activation functions, facilitating non-linear transformations of vectors for each patch to enhance the model’s adaptability. The output of the FF layer in the first image encoder layer is

    𝐦ff,1=GeLU(𝐖b,fLN(𝐦msa,1)+𝐛b,f)+𝐦msa,1subscript𝐦𝑓𝑓1GeLUsubscript𝐖𝑏𝑓LNsubscript𝐦𝑚𝑠𝑎1subscript𝐛𝑏𝑓subscript𝐦𝑚𝑠𝑎1\bm{\mathrm{m}}_{ff,1}=\mathrm{GeLU}(\mathbf{W}_{b,f}\cdot\mathrm{LN}(\bm{% \mathrm{m}}_{msa,1})+\mathbf{b}_{b,f})+\bm{\mathrm{m}}_{msa,1}bold_m start_POSTSUBSCRIPT italic_f italic_f , 1 end_POSTSUBSCRIPT = roman_GeLU ( bold_W start_POSTSUBSCRIPT italic_b , italic_f end_POSTSUBSCRIPT ⋅ roman_LN ( bold_m start_POSTSUBSCRIPT italic_m italic_s italic_a , 1 end_POSTSUBSCRIPT ) + bold_b start_POSTSUBSCRIPT italic_b , italic_f end_POSTSUBSCRIPT ) + bold_m start_POSTSUBSCRIPT italic_m italic_s italic_a , 1 end_POSTSUBSCRIPT (7)

    where 𝐖b,fsubscript𝐖𝑏𝑓\mathbf{W}_{b,f}bold_W start_POSTSUBSCRIPT italic_b , italic_f end_POSTSUBSCRIPT and 𝐛b,fsubscript𝐛𝑏𝑓\mathbf{b}_{b,f}bold_b start_POSTSUBSCRIPT italic_b , italic_f end_POSTSUBSCRIPT are the weights and biases of the FF layer in the image encoder of the BLIP model, and GeLU denotes the activation function.

Finally, the output of the image encoder with L𝐿Litalic_L encoder layers is

𝐦L=LN(𝐦ff,L)subscript𝐦𝐿LNsubscript𝐦𝑓𝑓𝐿\bm{\mathrm{m}}_{L}=\mathrm{LN}(\bm{\mathrm{m}}_{ff,L})bold_m start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = roman_LN ( bold_m start_POSTSUBSCRIPT italic_f italic_f , italic_L end_POSTSUBSCRIPT ) (8)

where 𝐦ff,Lsubscript𝐦𝑓𝑓𝐿\bm{\mathrm{m}}_{ff,L}bold_m start_POSTSUBSCRIPT italic_f italic_f , italic_L end_POSTSUBSCRIPT means the output of the L𝐿Litalic_L-th encoder layer.

IV-A2 Text decoder

The text decoder of the BLIP model adopts a BERT structure, capable of generating image-related textual content, such as descriptions, titles, and dialogues, based on features extracted from images. The text decoder is composed of multiple stacked decoder layers, each decoding layer comprising three sublayers: Causal Self-Attention (CSA), Cross Attention (CA), and FF sublayers. The specific workflow of the text decoder is as follows:

  • CSA sublayer: CSA is a type of self-attention mechanism that only allows the attention model to access the current and previous inputs, but not the future inputs [24]. To ensure the causality of the textual generation process, the CSA sublayer utilizes a mask matrix to prevent the current token from accessing information from future tokens. Here, a token refers to the basic unit in the text, typically a word or a subword. The output of the CSA sublayer in the first text decoder layer is

    𝐤csa,1=CSA(LN(D0))+D0subscript𝐤𝑐𝑠𝑎1CSALNsubscript𝐷0subscript𝐷0\bm{\mathrm{k}}_{csa,1}=\mathrm{CSA}(\mathrm{LN}(D_{0}))+D_{0}bold_k start_POSTSUBSCRIPT italic_c italic_s italic_a , 1 end_POSTSUBSCRIPT = roman_CSA ( roman_LN ( italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) + italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (9)

    where CSACSA\mathrm{CSA}roman_CSA is the causal self-attention operator [24], D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial token, which is typically set as "[Decoder]" by default.

  • CA sublayer: CA allows the vector of each token to interact with the feature vectors of visual information from the input image [25]. The output of the CA sublayer in the first text decoder layer can be calculated as follows:

    𝐤ca,1=CA(LN(𝐤csa,1),𝐦L)+𝐤csa,1subscript𝐤𝑐𝑎1CALNsubscript𝐤𝑐𝑠𝑎1subscript𝐦𝐿subscript𝐤𝑐𝑠𝑎1\bm{\mathrm{k}}_{ca,1}=\mathrm{CA}(\mathrm{LN}(\bm{\mathrm{k}}_{csa,1}),\bm{% \mathrm{m}}_{L})+\bm{\mathrm{k}}_{csa,1}bold_k start_POSTSUBSCRIPT italic_c italic_a , 1 end_POSTSUBSCRIPT = roman_CA ( roman_LN ( bold_k start_POSTSUBSCRIPT italic_c italic_s italic_a , 1 end_POSTSUBSCRIPT ) , bold_m start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) + bold_k start_POSTSUBSCRIPT italic_c italic_s italic_a , 1 end_POSTSUBSCRIPT (10)

    where CACA\mathrm{CA}roman_CA is the cross attention operator [25].

  • FF sublayer: The FF layer comprises linear layers and activation functions. The output of the FF layer in the first text decoder layer is

    𝐤ff,1=ReLU(𝐖b,fLN(𝐤ca,1)+𝐛b,f)+𝐤ca,1subscript𝐤𝑓𝑓1ReLUsubscriptsuperscript𝐖𝑏𝑓LNsubscript𝐤𝑐𝑎1subscriptsuperscript𝐛𝑏𝑓subscript𝐤𝑐𝑎1\bm{\mathrm{k}}_{ff,1}=\mathrm{ReLU}(\mathbf{W}^{\prime}_{b,f}\cdot\mathrm{LN}% (\bm{\mathrm{k}}_{ca,1})+\mathbf{b}^{\prime}_{b,f})+\bm{\mathrm{k}}_{ca,1}bold_k start_POSTSUBSCRIPT italic_f italic_f , 1 end_POSTSUBSCRIPT = roman_ReLU ( bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b , italic_f end_POSTSUBSCRIPT ⋅ roman_LN ( bold_k start_POSTSUBSCRIPT italic_c italic_a , 1 end_POSTSUBSCRIPT ) + bold_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b , italic_f end_POSTSUBSCRIPT ) + bold_k start_POSTSUBSCRIPT italic_c italic_a , 1 end_POSTSUBSCRIPT (11)

    where 𝐖b,fsubscriptsuperscript𝐖𝑏𝑓\mathbf{W}^{\prime}_{b,f}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b , italic_f end_POSTSUBSCRIPT and 𝐛b,fsubscriptsuperscript𝐛𝑏𝑓\mathbf{b}^{\prime}_{b,f}bold_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b , italic_f end_POSTSUBSCRIPT are the weights and biases of the FF layer in the text decoder of the BLIP model, and ReLU denotes the activation function.

The final layer of the decoder transforms the output (via a linear projection and a softmax function) to predict the next token in the sequence. This output text is then used as an input for the next time step during the generation process until the final textual description 𝐬𝐬\bm{\mathrm{s}}bold_s of the image is produced.

IV-B SD-based CKB for image reconstruction

The SD model is an elaborate VLM collaboratively developed by Stability AI, which possesses rich visual-linguistic knowledge and is applicable to diverse tasks such as text-to-image and image-to-image generation [26]. At the receiver, we use the SD to construct the CKB and utilize the text-to-image components in the CKB to reconstruct images. The semantic reconstructor is composed of a text encoder, a feature generator, and an image decoder.

Refer to caption
Figure 4: The architecture of SD-based CKB.

For a given semantic text 𝐬^bold-^𝐬\bm{\mathrm{\hat{s}}}overbold_^ start_ARG bold_s end_ARG, the image reconstruction process through the SD model is illustrated in Fig. 4 and is described as follows:

IV-B1 Text encoder

Text encoder is applied to transform the input text sequence into a semantic vector of fixed dimensions, serving as a control condition for the image feature generator. The text encoder is composed of multiple stacked encoding layers, each containing two sub-layers: MSA and FF. The residual connection and layer normalization are applied before each sublayer. This structure is similar to the image encoder in the BLIP model.

The input to the text encoder is the sequence 𝐬^bold-^𝐬\bm{\mathrm{\hat{s}}}overbold_^ start_ARG bold_s end_ARG composed of words. Initially, each word is mapped to a fixed-length vector by word embeddings. These word embeddings, serve as the input to the text encoder. The encoder iteratively performs MSA and FF operations, ultimately producing a sequence composed of textual feature vectors.

IV-B2 Feature generator

An initial image feature vector composed of pure noise is input into the image feature generator. Textual feature vectors are injected into the noised feature vector to guide the noise removement. Through multiple iterations, noise is progressively removed, and an image feature vector containing textual information is obtained. The denoising step employs a U-Net structure, which adopts a CNN-based encoder-decoder structure to preserve spatial information while generating image semantic information. The iterative process of the image feature generator can be described by the following formula:

𝐙t1=1αt(𝐙t1αt1α¯tfθ(𝐙t,t,𝐝))+σt𝐘subscript𝐙𝑡11subscript𝛼𝑡subscript𝐙𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝑓𝜃subscript𝐙𝑡𝑡𝐝subscript𝜎𝑡𝐘\mathbf{Z}_{t-1}=\frac{1}{\sqrt{{\alpha}}_{t}}(\mathbf{Z}_{t}-\frac{1-{\alpha}% _{t}}{\sqrt{1-\overline{\alpha}_{t}}}f_{\theta}(\mathbf{Z}_{t},t,\mathbf{d}))+% {\sigma}_{t}\mathbf{Y}bold_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_d ) ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_Y (12)

where 𝐙tsubscript𝐙𝑡\mathbf{Z}_{t}bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the image feature vector at the time step t𝑡titalic_t, αtsubscript𝛼𝑡{\alpha}_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the variance of the forward diffusion process, serving as a hyperparameter. α¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\overline{\alpha}_{t}=\prod_{i=1}^{t}{\alpha}_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT represents the pre-trained noise prediction U-Net, 𝐝𝐝\mathbf{d}bold_d is the textual semantic vector, σt𝐘subscript𝜎𝑡𝐘{\sigma}_{t}\mathbf{Y}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_Y denotes the mean of the reverse diffusion process, where σt=1αtsubscript𝜎𝑡1subscript𝛼𝑡{\sigma}_{t}=\sqrt{{1-\alpha}_{t}}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG, and 𝐘𝒩(0,𝐈)similar-to𝐘𝒩0𝐈\mathbf{Y}\sim\mathcal{N}(0,\mathbf{I})bold_Y ∼ caligraphic_N ( 0 , bold_I ) with 𝐈𝐈\mathbf{I}bold_I being the identity matrix.

IV-B3 Image decoder

Due to the computational inefficiency of the diffusion operation, the denoising process of the image is performed in the compressed semantic space. Multiple iterations of denoising are conducted in the reduced semantic (feature) space, significantly improving the efficiency of image processing. Finally, we utilize the decoder of a Variational Autoencoder (VAE) to map the feature data in the semantic space back to the pixel space, reconstructing images that adhere to semantic consistency. As VAE learns the latent structure of a large amount of image data distribution, the decoder can provide more detailed information consistent with key semantics in the image by employing upsampling and interpolation during the decoding process, thereby enhancing the image quality in the pixel space.

IV-C Memory-assisted encoder and decoder

In dynamic environments, both the distribution of the transmitted contents and channel states will change over time. This necessitates that the CSC system continuously adjusts based on new input data and channel states to adapt to the evolving data distribution. However, such adjustments may lead to parameter updates in the encoder and decoder of the CSC system, potentially causing the catastrophic forgetting issue where old parameter values are overwritten or ignored [5]. Hence, continual learning diminishes the robustness of the encoder and decoder in the CSC system.

The memory-based learning strategy addresses the catastrophic forgetting problem in continual learning by diversifying the memorized content [27]. We design a MED method with STM and LTM for both semantic encoder and decoder. Below, we present the workflow of the MED as follows:

Refer to caption
Figure 5: Memory-assisted encoder and decoder.

We denote stm={𝐬istm}i=1nstmsubscript𝑠𝑡𝑚superscriptsubscriptsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚𝑖1subscript𝑛𝑠𝑡𝑚\mathcal{M}_{stm}=\{\bm{\mathrm{s}}_{i}^{stm}\}_{i=1}^{n_{stm}}caligraphic_M start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT = { bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and ltm={𝐬jltm}j=1nltmsubscript𝑙𝑡𝑚superscriptsubscriptsuperscriptsubscript𝐬𝑗𝑙𝑡𝑚𝑗1subscript𝑛𝑙𝑡𝑚\mathcal{M}_{ltm}=\{\bm{\mathrm{s}}_{j}^{ltm}\}_{j=1}^{n_{ltm}}caligraphic_M start_POSTSUBSCRIPT italic_l italic_t italic_m end_POSTSUBSCRIPT = { bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l italic_t italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as the sets representing dynamic samples stored in STM and LTM. 𝐬istmsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚\bm{\mathrm{s}}_{i}^{stm}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT denotes the i𝑖iitalic_i-th sample in STM, and 𝐬jltmsuperscriptsubscript𝐬𝑗𝑙𝑡𝑚\bm{\mathrm{s}}_{j}^{ltm}bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT represents the j𝑗jitalic_j-th sample in LTM. nstmsubscript𝑛𝑠𝑡𝑚n_{stm}italic_n start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT and nltmsubscript𝑛𝑙𝑡𝑚n_{ltm}italic_n start_POSTSUBSCRIPT italic_l italic_t italic_m end_POSTSUBSCRIPT denote the current number of samples, respectively. When the STM pool becomes full, it is necessary to select representative samples from it and transfer them to the LTM. Hence, let nstmMaxsuperscriptsubscript𝑛𝑠𝑡𝑚𝑀𝑎𝑥n_{stm}^{Max}italic_n start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_a italic_x end_POSTSUPERSCRIPT represent the maximum number of samples that can be stored in stmsubscript𝑠𝑡𝑚\mathcal{M}_{stm}caligraphic_M start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT. The sample selection process can be illustrated in Fig. 5 and described as follows:

IV-C1 Relevance evaluation

During the inference phase of the CSC system, new samples being processed are continuously added to the STM. When the number of samples in the STM exceeds the specified maximum, an evaluation action is executed. The primary objective of this stage is to assess the relevance of samples. We evaluate the distance between two samples stored in STM and LTM using a Radial Basis Function (RBF) kernel:

RBF(𝐬istm,𝐬jltm)=exp(𝐯istm𝐯jltm22τ2)RBFsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚superscriptsubscript𝐬𝑗𝑙𝑡𝑚superscriptnormsuperscriptsubscript𝐯𝑖𝑠𝑡𝑚superscriptsubscript𝐯𝑗𝑙𝑡𝑚22superscript𝜏2{\mathrm{RBF}}(\bm{\mathrm{s}}_{i}^{stm},\bm{\mathrm{s}}_{j}^{ltm})=\exp(-% \frac{{\|\bm{\mathrm{v}}_{i}^{stm}-\bm{\mathrm{v}}_{j}^{ltm}\|}^{2}}{2\tau^{2}})roman_RBF ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT , bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT ) = roman_exp ( - divide start_ARG ∥ bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT - bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (13)

where 𝐯istmsuperscriptsubscript𝐯𝑖𝑠𝑡𝑚\bm{\mathrm{v}}_{i}^{stm}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT and 𝐯jltmsuperscriptsubscript𝐯𝑗𝑙𝑡𝑚\bm{\mathrm{v}}_{j}^{ltm}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT are feature vectors extracted by the semantic encoder from samples 𝐬istmsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚\bm{\mathrm{s}}_{i}^{stm}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT and 𝐬jltmsuperscriptsubscript𝐬𝑗𝑙𝑡𝑚\bm{\mathrm{s}}_{j}^{ltm}bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT, respectively. τ𝜏\tauitalic_τ is the scale hyperparameter for the kernel function, and we set τ=10𝜏10\tau=10italic_τ = 10 to ensure that the output of RBF(,)RBF\mathrm{RBF}(\cdot,\cdot)roman_RBF ( ⋅ , ⋅ ) is within [0,1]01[0,1][ 0 , 1 ]. Eq. (13) can be further accelerated through matrix operations, expressed as:

𝐒=Fexp((𝐁stm(𝐁ltm)T)(𝐁stm(𝐁ltm)T)/2τ2)𝐒subscriptFexpdirect-productsuperscript𝐁𝑠𝑡𝑚superscriptsuperscript𝐁𝑙𝑡𝑚Tsuperscript𝐁𝑠𝑡𝑚superscriptsuperscript𝐁𝑙𝑡𝑚T2superscript𝜏2\bm{\mathrm{S}}=\mathrm{F_{exp}}(-(\mathbf{B}^{stm}(-\mathbf{B}^{ltm})^{% \mathrm{T}})\odot(\mathbf{B}^{stm}(-\mathbf{B}^{ltm})^{\mathrm{T}})/2\tau^{2})bold_S = roman_F start_POSTSUBSCRIPT roman_exp end_POSTSUBSCRIPT ( - ( bold_B start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT ( - bold_B start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ⊙ ( bold_B start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT ( - bold_B start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) / 2 italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (14)

where 𝐁stmsuperscript𝐁𝑠𝑡𝑚\mathbf{B}^{stm}bold_B start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT and 𝐁ltmsuperscript𝐁𝑙𝑡𝑚\mathbf{B}^{ltm}bold_B start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT are feature matrices corresponding to stmsubscript𝑠𝑡𝑚\mathcal{M}_{stm}caligraphic_M start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT and ltmsubscript𝑙𝑡𝑚\mathcal{M}_{ltm}caligraphic_M start_POSTSUBSCRIPT italic_l italic_t italic_m end_POSTSUBSCRIPT, respectively. ()TsuperscriptT(\cdot)^{\mathrm{T}}( ⋅ ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT and direct-product\odot represent transpose and Hadamard product, respectively. Fexp()subscriptFexp\mathrm{F_{exp}}(\cdot)roman_F start_POSTSUBSCRIPT roman_exp end_POSTSUBSCRIPT ( ⋅ ) is the exponential function applied element-wise to the matrix [26].

IV-C2 Sample selection

The primary objective of this stage is to select samples from STM that are significantly different from those in LTM, ensuring diversity in the memory. We calculate the average similarity score between sample 𝐬istmsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚\bm{\mathrm{s}}_{i}^{stm}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT and each sample in LTM using RBF kernel:

R(𝐬istm)=1nltmk=1nltmRBF(𝐬istm,𝐬kltm).Rsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚1subscript𝑛𝑙𝑡𝑚superscriptsubscript𝑘1subscript𝑛𝑙𝑡𝑚RBFsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚superscriptsubscript𝐬𝑘𝑙𝑡𝑚{\mathrm{R}}(\bm{\mathrm{s}}_{i}^{stm})=\frac{1}{n_{ltm}}\sum_{k=1}^{n_{ltm}}{% \mathrm{RBF}}(\bm{\mathrm{s}}_{i}^{stm},\bm{\mathrm{s}}_{k}^{ltm}).roman_R ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l italic_t italic_m end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l italic_t italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_RBF ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT , bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT ) . (15)

When the computed similarity score is greater than a given threshold λ𝜆\lambdaitalic_λ, we transfer the sample from STM to LTM:

R(𝐬istm)>λltm=ltm𝐬istm.Rsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚𝜆subscript𝑙𝑡𝑚subscript𝑙𝑡𝑚superscriptsubscript𝐬𝑖𝑠𝑡𝑚{\mathrm{R}}(\bm{\mathrm{s}}_{i}^{stm})>\lambda\Rightarrow\mathcal{M}_{ltm}=% \mathcal{M}_{ltm}\cup\bm{\mathrm{s}}_{i}^{stm}.roman_R ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT ) > italic_λ ⇒ caligraphic_M start_POSTSUBSCRIPT italic_l italic_t italic_m end_POSTSUBSCRIPT = caligraphic_M start_POSTSUBSCRIPT italic_l italic_t italic_m end_POSTSUBSCRIPT ∪ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT . (16)

After the selection is complete, stmsubscript𝑠𝑡𝑚\mathcal{M}_{stm}caligraphic_M start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT is emptied to buffer new samples in the next round. Then, both the STM and LTM are used to train the semantic encoder and decoder through continual learning. The workflow of MED for the semantic encoder and decoder is illustrated in Algorithm 1.

Algorithm 1 Memory-assisted Encoder and Decoder
0:  𝐬,stm𝐬subscript𝑠𝑡𝑚\bm{\mathrm{s}},\mathcal{M}_{stm}bold_s , caligraphic_M start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT
0:  ltmsubscript𝑙𝑡𝑚\mathcal{M}_{ltm}caligraphic_M start_POSTSUBSCRIPT italic_l italic_t italic_m end_POSTSUBSCRIPT
1:  if nstmnstmMaxsubscript𝑛𝑠𝑡𝑚superscriptsubscript𝑛𝑠𝑡𝑚𝑀𝑎𝑥n_{stm}\geq n_{stm}^{Max}italic_n start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT ≥ italic_n start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_a italic_x end_POSTSUPERSCRIPT then
2:     Calculate the kernel distance RBF(𝐬istm,𝐬jltm)RBFsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚superscriptsubscript𝐬𝑗𝑙𝑡𝑚\mathrm{RBF}(\bm{\mathrm{s}}_{i}^{stm},\bm{\mathrm{s}}_{j}^{ltm})roman_RBF ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT , bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_t italic_m end_POSTSUPERSCRIPT ) between samples in STM and LTM according to Eq. (13).
3:     Calculate the average similarity score R(𝐬istm)Rsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚{\mathrm{R}}(\bm{\mathrm{s}}_{i}^{stm})roman_R ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT ) between sample 𝐬istmsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚\bm{\mathrm{s}}_{i}^{stm}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT and each sample in LTM according to Eq. (15).
4:  else
5:     Feed current 𝐬𝐬\bm{\mathrm{s}}bold_s into stmsubscript𝑠𝑡𝑚\mathcal{M}_{stm}caligraphic_M start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT.
6:  end if
7:  if R(𝐬istm)>λRsuperscriptsubscript𝐬𝑖𝑠𝑡𝑚𝜆{\mathrm{R}}(\bm{\mathrm{s}}_{i}^{stm})>\lambdaroman_R ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_m end_POSTSUPERSCRIPT ) > italic_λ then
8:     Transfer the i𝑖iitalic_i-th sample from stmsubscript𝑠𝑡𝑚\mathcal{M}_{stm}caligraphic_M start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT to ltmsubscript𝑙𝑡𝑚\mathcal{M}_{ltm}caligraphic_M start_POSTSUBSCRIPT italic_l italic_t italic_m end_POSTSUBSCRIPT according to Eq. (16).
9:  end if
10:  Clear stmsubscript𝑠𝑡𝑚\mathcal{M}_{stm}caligraphic_M start_POSTSUBSCRIPT italic_s italic_t italic_m end_POSTSUBSCRIPT.

IV-D Noise attention module

Inspired by the feature attention module in [7], we propose a NAM based on SNR values. The NAM leverages a new noise attention network to determine the importance of each feature vector during the process of encoding and decoding, assigning weights to semantic coding and channel coding. This allows for achieving integrated encoding of both semantic and channel information according to the current SNR.

Specifically, in unfavorable channel conditions, higher weights are allocated to the channel encoder and lower weights are allocated to the semantic encoder for the same source information. This allocation strategy enhances robustness in the channel encoder to mitigate the effects of severe channel noise. Conversely, in favorable channel conditions, lower weights are assigned to the channel encoder and higher weights are assigned to the semantic encoder for the same source information. This increased allocation of weights to the semantic encoder aims to enhance semantic quality.

Refer to caption
Figure 6: Noise attention module.

The structure of the NAM is illustrated in Fig. 6, and a detailed description of the workflow is provided below:

IV-D1 SNR projection

Firstly, the SNR projection module extends the SNR values to the same dimension as feature vectors in the encoder and decoder. The module is a fully connected network comprising three FF layers. The first two FF layers employ the ReLU activation function, while the third FF layer utilizes the Sigmoid activation function. It transforms the input SNR value r𝑟ritalic_r to a vector 𝐯𝐯\mathbf{v}bold_v. The map** process from r𝑟ritalic_r to 𝐯𝐯\mathbf{v}bold_v is as follows:

𝐯=ReLU(𝐖n2ReLU(𝐖n1r+bn1)+bn2)superscript𝐯ReLUsubscript𝐖subscript𝑛2ReLUsubscript𝐖subscript𝑛1𝑟subscript𝑏subscript𝑛1subscript𝑏subscript𝑛2\bm{\mathrm{v}}^{\prime}=\mathrm{ReLU}(\mathbf{W}_{n_{2}}\cdot\mathrm{ReLU}(% \mathbf{W}_{n_{1}}\cdot r+b_{n_{1}})+b_{n_{2}})bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_ReLU ( bold_W start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ roman_ReLU ( bold_W start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_r + italic_b start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_b start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (17)
𝐯=Sigmoid(𝐖n3𝐯+bn3)𝐯Sigmoidsubscript𝐖subscript𝑛3superscript𝐯subscript𝑏subscript𝑛3\bm{\mathrm{v}}=\mathrm{Sigmoid}(\mathbf{W}_{n_{3}}\cdot\bm{\mathrm{v}}^{% \prime}+b_{n_{3}})bold_v = roman_Sigmoid ( bold_W start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (18)

where ReLU and Sigmoid denote the activation functions, and 𝐖nisubscript𝐖subscript𝑛𝑖\mathbf{W}_{n_{i}}bold_W start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and bnisubscript𝑏subscript𝑛𝑖b_{n_{i}}italic_b start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the weights and biases of FF layers, respectively.

IV-D2 Feature scaling

Subsequently, we combine the input features with the projected SNR to obtain a scaling factor 𝐊𝐊\bm{\mathrm{K}}bold_K, which records the importance of each intermediate feature vector for semantic/channel encoder and decoder as follows:

𝐊=Sigmoid(𝐞𝐯)𝐊Sigmoid𝐞𝐯\bm{\mathrm{K}}=\mathrm{Sigmoid}(\bm{\mathrm{e}}\cdot\bm{\mathrm{v}})bold_K = roman_Sigmoid ( bold_e ⋅ bold_v ) (19)

where the Sigmoid activation function is used to constrain the output to the interval (0, 1). The 𝐞𝐞\bm{\mathrm{e}}bold_e is the output of the intermediate feature vectors 𝐆𝐆\bm{\mathrm{G}}bold_G after passing through the fourth FF layer as follows:

𝐞=𝐖n4𝐆+bn4𝐞subscript𝐖subscript𝑛4𝐆subscript𝑏subscript𝑛4\bm{\mathrm{e}}=\mathbf{W}_{n_{4}}\cdot\bm{\mathrm{G}}+b_{n_{4}}bold_e = bold_W start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ bold_G + italic_b start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (20)

where 𝐖n4subscript𝐖subscript𝑛4\mathbf{W}_{n_{4}}bold_W start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and bn4subscript𝑏subscript𝑛4b_{n_{4}}italic_b start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the weights and biases of the fourth FF layer.

Finally, the intermediate feature vector 𝐆𝐆\bm{\mathrm{G}}bold_G are multiplied by the scaling factor 𝐊𝐊\bm{\mathrm{K}}bold_K to obtain the calibrated vector 𝐀𝐀\bm{\mathrm{A}}bold_A as follows:

Ai=KiGisubscript𝐴𝑖subscript𝐾𝑖subscript𝐺𝑖A_{i}=K_{i}\cdot G_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (21)

where Aisubscript𝐴𝑖{A}_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the i𝑖iitalic_i-th element in 𝐀𝐀\bm{\mathrm{A}}bold_A, Gisubscript𝐺𝑖{G}_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the i𝑖iitalic_i-th element in 𝐆𝐆\bm{\mathrm{G}}bold_G, and Kisubscript𝐾𝑖{K}_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the i𝑖iitalic_i-th element in 𝐊𝐊\bm{\mathrm{K}}bold_K.

Algorithm 2 Noise Attention Module
0:  r,𝐆𝑟𝐆r,\bm{\mathrm{G}}italic_r , bold_G
0:  𝐀𝐀\bm{\mathrm{A}}bold_A
1:  Transform the SNR value r𝑟ritalic_r for projection and obtain 𝐯𝐯\bm{\mathrm{v}}bold_v according to Eqs. (17)-(18).
2:  Transform intermediate feature vector 𝐆𝐆\bm{\mathrm{G}}bold_G to the vector 𝐞𝐞\bm{\mathrm{e}}bold_e According to Eq. (20).
3:  Calculate the scaling factor 𝐊𝐊\bm{\mathrm{K}}bold_K according to Eq. (19).
4:  Calculate the calibrated vector 𝐀𝐀\bm{\mathrm{A}}bold_A according to Eq. (21).
5:  Return 𝐀𝐀\bm{\mathrm{A}}bold_A

The NAM is embedded into the feature vectors of both the semantic/channel encoder and decoder to enhance the robustness of the CSC system. The workflow of NAM is illustrated in Algorithm 2.

V Numerical results

In this section, we evaluate the performance of the proposed VLM-CSC system by comparing it with other SC systems.

V-A Simulation settings

The datasets employed in this study include publicly available Kaggle datasets such as CIFAR, BIRDS, CATSvsDOGS, and EPPs [28]. The configuration of the experiments is detailed as follows:

The pretrained BLIP has 129MB parameters, and the pretrained SD model has 1.99GB parameters. The semantic encoder comprises three transformer encoder layers alternated with NAMs. Each transformer encoder layer has 8 heads and the feature dimension is 128. The channel encoder is composed of two FF hidden layers alternating with NAMs, where the first hidden layer has 256 neurons and the second FNN layer has 128 neurons. To maintain information consistency, the semantic and channel decoder employs a structure opposite to that of the encoder. In NAM, the four FF layers have neuron quantities of 56, 128, 56, and 56, respectively. Additionally, the maximum sample size for STM is 500, and the threshold for sample selection is 0.05.

The experimental training and testing environment involves the Windows 2016 server with Python3.8, PyTorch 1.8.0 and CUDA 11.6. Computational resources are provided by an Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz and NVIDIA Tesla T4.

V-B Evaluation metrics

The proposed VLM-CSC system transforms image data to textual semantic data through the BLIP-based knowledge base, encodes it using a semantic encoder, decodes it at the receiver, and finally reconstructs the image through the SD-based knowledge base. To assess the performance of the VLM-CSC system, two corresponding metrics are designed: (1) Image-level, examining the accuracy of semantic reconstruction for image data; (2) Text-level, examining the accuracy of semantic recovery for text data.

V-B1 Image-level: Semantic Service Quality (SSQ)

In performance assessment of the SC system, the emphasis on semantic layer transmission should be directed towards whether information, after undergoing semantic recovery, can meet the expectations of subsequent tasks. The general quality metric for semantic services is denoted by [29]:

SSQ=ST(S^)ST(S)𝑆𝑆𝑄𝑆𝑇^𝑆𝑆𝑇𝑆SSQ=\frac{ST(\hat{S})}{ST(S)}italic_S italic_S italic_Q = divide start_ARG italic_S italic_T ( over^ start_ARG italic_S end_ARG ) end_ARG start_ARG italic_S italic_T ( italic_S ) end_ARG (22)

where S𝑆Sitalic_S represents the unprocessed source information at the transmitter, S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG represents the recovered information at the semantic level by the receiver, and ST()𝑆𝑇ST(\cdot)italic_S italic_T ( ⋅ ) signifies the performance of the source information or recovered information when executing subsequent tasks, which is the classification accuracy in our study.

V-B2 Text-level: Bilingual Evaluation Understudy (BLEU)

The BLEU score outputs a number between 0 and 1, indicating how similar the decoded text is to the transmitted text, with 1 representing the highest similarity. For a transmission sentence 𝐬𝐬\bm{\mathrm{s}}bold_s with length lssubscript𝑙𝑠l_{s}italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and a decoded sentence 𝐬^bold-^𝐬\bm{\mathrm{\hat{s}}}overbold_^ start_ARG bold_s end_ARG with length ls^subscript𝑙^𝑠l_{\hat{s}}italic_l start_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG end_POSTSUBSCRIPT, BLEU can be expressed as [30]:

logBLEU=min(1ls^ls,0)+n=1NunlogpnBLEU1subscript𝑙^𝑠subscript𝑙𝑠0superscriptsubscript𝑛1𝑁subscript𝑢𝑛subscript𝑝𝑛\log{\mathrm{BLEU}}=\min(1-\frac{l_{\hat{s}}}{l_{s}},0)+\sum_{n=1}^{N}u_{n}% \log p_{n}roman_log roman_BLEU = roman_min ( 1 - divide start_ARG italic_l start_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG , 0 ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (23)

where the "n-gram" refers to a contiguous sequence of n𝑛nitalic_n words from a given sample of text or speech, unsubscript𝑢𝑛u_{n}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the weight of the n𝑛nitalic_n-grams, and pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the n𝑛nitalic_n-grams score, defined as:

pn=kmin(Ck(𝐬^),Ck(𝐬))kmin(Ck(𝐬^))subscript𝑝𝑛subscript𝑘subscript𝐶𝑘^𝐬subscript𝐶𝑘𝐬subscript𝑘subscript𝐶𝑘bold-^𝐬p_{n}=\frac{\sum_{k}\min({C_{k}}(\mathbf{\hat{s}}),C_{k}(\bm{\mathrm{s}}))}{% \sum_{k}\min({C_{k}}(\bm{\mathrm{\hat{s}}}))}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_min ( italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG bold_s end_ARG ) , italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_s ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_min ( italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_s end_ARG ) ) end_ARG (24)

where Ck()subscript𝐶𝑘C_{k}(\cdot)italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ ) is the frequency count function for the k𝑘kitalic_k-th element in the n𝑛nitalic_n-th grams.

V-C Performance comparison of VLM-base KBs

To evaluate the performance of extracting semantic information from images using KBs, we employ three VLMs (BLIP, LEMON[31], and RAM[32]) to construct the sender-side KBs in the CSC system. The receiver-side KB is uniformly implemented using the SD model. Subsequently, we assess the CSC system’s performance on the AWGN channel. SSQ is utilized as the evaluation metric on the CATSvsDOGS dataset [28]. The experimental outcomes are illustrated in Fig. 7.

Refer to caption
Figure 7: SSQ of CSC systems based on different VLMs.

From Fig. 7, it is evident that the CSC system based on BLIP exhibits the highest SSQ, followed by the one based on LEMON, while the CSC system based on RAM performs the poorest, significantly lower than the CSC systems based on BLIP and LEMON. Furthermore, the CSC system based on BLIP maintains robust performance even at low SNR values. The experimental results indicate that the CSC system constructed based on BLIP accurately extracts image semantics and sustains commendable performance across different SNR levels.

V-D Performance evaluation for MED

To demonstrate the performance of the proposed MED, we conduct experiments comparing VLM-CSC with the MED module against VLM-CSC without the MED module. The evaluation is performed across different image datasets. The image datasets include Cifar, Birds, and CatsVSDogs [28].BLEU scores for semantic similarity serve as the evaluation metric. Additionally, when assessing the performance of VLM-CSC on image datasets with different distributions, the channel is fixed to Rayleigh. The continual learning map, originally proposed by Google, is employed to visualize the performance changes of existing tasks when a new task is introduced. The experimental results are illustrated by the continual learning map in Fig. 8.

Refer to caption
Figure 8: The continual learning map for BLEU scores across diverse image datasets are evaluated in the following scenarios:(a) The BLEU (1-grams) of VLM-CSC without MED across different image datasets. (b) The BLEU (2-grams) of VLM-CSC without MED across different image datasets. (c) The BLEU (1-grams) of VLM-CSC across different image datasets. (d) The BLEU (2-grams) of VLM-CSC across different image datasets.

Figure Fig. 8 (a) and (b) illustrate a significant performance drop in the VLM-CSC system without the MED module on the previous Cifar dataset after learning subsequent datasets such as Birds and CatsVSDogs. In contrast, Fig. 8 (c) and (d) reveal that the VLM-CSC system with the MED module only exhibits a marginal decline in performance on the previous Cifar dataset after learning subsequent datasets like Birds and CatsVSDogs.

The experimental results from Fig. 8 underscore that the proposed MED module enables the CSC system to overcome catastrophic forgetting during the continual learning process. This facilitates knowledge learning from multiple image datasets, enhancing the generalization of the CSC system in dynamic environments.

V-E Performance evaluation for NAM

To demonstrate the performance of the proposed NAM, we conduct an experimental comparison between VLM-CSC with and without NAM. Semantic similarity, measured by BLEU score, serves as the evaluation metric. Specifically, the proposed VLM-CSC system is trained under a uniform distribution of SNRtrain𝑆𝑁subscript𝑅𝑡𝑟𝑎𝑖𝑛SNR_{train}italic_S italic_N italic_R start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ranging from 0 dB to 10 dB, while the VLM-CSC system without NAM is trained at specific SNRtrain𝑆𝑁subscript𝑅𝑡𝑟𝑎𝑖𝑛SNR_{train}italic_S italic_N italic_R start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT values of 1 dB, 4 dB, 7 dB, and 10 dB. Subsequently, the performance of the VLM-CSC system is evaluated at specific SNRtest𝑆𝑁subscript𝑅𝑡𝑒𝑠𝑡SNR_{test}italic_S italic_N italic_R start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT values ranging from 0 dB to 10 dB. The experimental results are depicted in Fig. 9.

Refer to caption
Figure 9: The performance of NAM in the VLM-CSC system. NNA represents the VLM-CSC system without NAM.

The findings depicted in Figure 9 demonstrate that the performance of the proposed VLM-CSC system outperforms any VLM-CSC system without NAM, specifically trained at distinct SNRtrain𝑆𝑁subscript𝑅𝑡𝑟𝑎𝑖𝑛SNR_{train}italic_S italic_N italic_R start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT values. This observation highlights the capability of the VLM-CSC system, equipped with NAM, to address the performance degradation challenges caused by the mismatch between the SNR during training and deployment stages in conventional ISC systems. This improvement contributes to the robustness of the VLM-CSC system across different SNR values.

V-F Semantic communication performance evaluation

To evaluate the performance of the VLM-CSC system in image classification tasks, we compare it with JSCC based on CNN [33] and WITT based on ViT [34]. The metric used for performance evaluation is classification accuracy. Additionally, we assess the bandwidth-saving capabilities of VLM-CSC by considering the compression ratio between transmitted data and original images as the evaluation metric. The experimental results are presented in Fig. 10.

Refer to caption
Figure 10: Performance comparison of VLM-CSC with other ISC systems. (a) SSQ. (b) Compression ratio and trainable parameters. (c) Semantic alignment.

Fig. 10 (a) clearly demonstrates that, at low SNR levels, the superior performance of VLM-CSC in the classification task with the CATSvsDOGS dataset, and WITT shows slightly lower results, particularly with decreased performance compared to VLM-CSC. At high SNR levels, WIIT and JSCC exhibit superior SSQ compared to VLM-CSC due to their direct transmission of images. Fig. 10 (b) depicts the compression ratio and trainable parameters, with VLM-CSC achieving the lowest of all, followed by JSCC, while WITT attains the highest compression ratio and trainable parameters. Fig. 10 (c) illustrates that the reconstructed image highly aligns with the original image and the image description, validating the VLM-CSC system’s ability to ensure semantic consistency across modalities.

The experimental results depicted in Fig. 10 demonstrate that the proposed VLM-CSC exhibits overall superior performance in image classification tasks compared to other ISC systems at low SNR levels. Then, the compression ratio of transmitted data is significantly lower for VLM-CSC compared to other ISC systems, indicating that VLM-CSC can effectively conserve transmission bandwidth while preserving high-quality semantic transmission. Moreover, due to the absence of training VLMs, the VLM-CSC system exhibits the minimum number of trainable parameters, resulting in the lowest training complexity.

VI Conclusion

This paper introduces a novel VLM-CSC system capable of converting images into text descriptions for transmission over wireless channels, and reconstructing the image at the receiver. The system includes three main contributions: CKB for image-to-text and text-to-image conversion, MED for continual learning in dynamic environments, and NAM for joint semantic and channel encoding based on SNR. Corresponding performance metrics are designed to evaluate the VLM-CSC system from both image and text perspectives. Experimental validations are conducted under various image datasets. Results demonstrate the effectiveness and robustness of the VLM-CSC system in preserving semantic similarity between the image and text, as well as its adaptability to dynamic environments.

References

  • [1] R. Li, Z. Zhao, X. Zhou, G. Ding, Y. Chen, Z. Wang, and H. Zhang, “Intelligent 5g: When cellular networks meet artificial intelligence,” IEEE Wireless communications, vol. 24, no. 5, pp. 175–183, 2017.
  • [2] M. Xu, W. C. Ng, W. Y. B. Lim, J. Kang, Z. Xiong, D. Niyato, Q. Yang, X. S. Shen, and C. Miao, “A full dive into realizing the edge-enabled metaverse: Visions, enabling technologies, and challenges,” IEEE Communications Surveys & Tutorials, 2022.
  • [3] W. Yang, H. Du, Z. Q. Liew, W. Y. B. Lim, Z. Xiong, D. Niyato, X. Chi, X. S. Shen, and C. Miao, “Semantic communications for future internet: Fundamentals, applications, and challenges,” IEEE Communications Surveys & Tutorials, 2022.
  • [4] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
  • [5] H. Zhang, S. Shao, M. Tao, X. Bi, and K. B. Letaief, “Deep learning-enabled semantic communication systems with task-unaware transmitter and dynamic data,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 170–185, 2022.
  • [6] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source-channel coding for wireless image transmission,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019.
  • [7] J. Xu, B. Ai, W. Chen, A. Yang, P. Sun, and M. Rodrigues, “Wireless image transmission using deep source channel coding with attention modules,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2315–2328, 2021.
  • [8] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  • [9] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source-channel coding for wireless image transmission,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019.
  • [10] J. Dai, S. Wang, K. Tan, Z. Si, X. Qin, K. Niu, and P. Zhang, “Nonlinear transform source-channel coding for semantic communications,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 8, pp. 2300–2316, 2022.
  • [11] C. Dong, H. Liang, X. Xu, S. Han, B. Wang, and P. Zhang, “Semantic communication system based on semantic slice models propagation,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 202–213, 2022.
  • [12] D. Huang, F. Gao, X. Tao, Q. Du, and J. Lu, “Toward semantic communications: Deep learning-based image semantic coding,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 55–71, 2022.
  • [13] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. Yu, and L. Sun, “A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt 2023. arxiv 2023,” arXiv preprint arXiv:2303.04226.
  • [14] A. Fürst, E. Rumetshofer, J. Lehner, V. T. Tran, F. Tang, H. Ramsauer, D. Kreil, M. Kopp, G. Klambauer, A. Bitto et al., “Cloob: Modern hopfield networks with infoloob outperform clip,” Advances in neural information processing systems, vol. 35, pp. 20 450–20 468, 2022.
  • [15] Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “Simvlm: Simple visual language model pretraining with weak supervision,” arXiv preprint arXiv:2108.10904, 2021.
  • [16] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, 2019.
  • [17] Z. Xu, L. Wang, W. Liang, Q. Xia, W. Xu, P. Zhou, and O. F. Rana, “Age-aware data selection and aggregator placement for timely federated continual learning in mobile edge computing,” IEEE Transactions on Computers, 2023.
  • [18] Y. Yu, P. Chen, X.-W. Zhu, J. Zhai, and C. Yu, “Continual learning digital predistortion of rf power amplifier for 6g ai-empowered wireless communication,” IEEE Transactions on Microwave Theory and Techniques, vol. 70, no. 11, pp. 4916–4927, 2022.
  • [19] Z. Zhang, B. Guo, W. Sun, Y. Liu, and Z. Yu, “Cross-fcl: Toward a cross-edge federated continual learning framework in mobile edge computing systems,” IEEE Transactions on Mobile Computing, 2022.
  • [20] H. Zhou, W. Xia, H. Zhao, J. Zhang, Y. Ni, and H. Zhu, “Continual learning-based fast beamforming adaptation in downlink miso systems,” IEEE Wireless Communications Letters, vol. 12, no. 1, pp. 36–39, 2022.
  • [21] F. Jiang, Y. Peng, L. Dong, K. Wang, K. Yang, C. Pan, and X. You, “Large ai model-based semantic communications,” arXiv preprint arXiv:2307.03492, 2023.
  • [22] J. Li, D. Li, C. ** language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning.   PMLR, 2022, pp. 12 888–12 900.
  • [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [24] X. Yang, H. Zhang, G. Qi, and J. Cai, “Causal attention for vision-language tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9847–9857.
  • [25] C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.
  • [26] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  • [27] F. Ye and A. G. Bors, “Continual variational autoencoder learning via online cooperative memorization,” in European Conference on Computer Vision.   Springer, 2022, pp. 531–549.
  • [28] P. Koehn, “Europarl: A parallel corpus for statistical machine translation,” in Proceedings of machine translation summit x: papers, 2005, pp. 79–86.
  • [29] C. Dong, H. Liang, X. Xu, S. Han, B. Wang, and P. Zhang, “Semantic communication system based on semantic slice models propagation,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 202–213, 2022.
  • [30] H. Xie, Z. Qin, G. Y. Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,” IEEE Transactions on Signal Processing, vol. 69, pp. 2663–2675, 2021.
  • [31] X. Hu, Z. Gan, J. Wang, Z. Yang, Z. Liu, Y. Lu, and L. Wang, “Scaling up vision-language pre-training for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 17 980–17 989.
  • [32] Y. Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y. Xie, Y. Qin, T. Luo, Y. Li, S. Liu et al., “Recognize anything: A strong image tagging model,” arXiv preprint arXiv:2306.03514, 2023.
  • [33] D. B. Kurka and D. Gündüz, “Deepjscc-f: Deep joint source-channel coding of images with feedback,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 178–193, 2020.
  • [34] K. Yang, S. Wang, J. Dai, K. Tan, K. Niu, and P. Zhang, “Witt: A wireless image transmission transformer for semantic communications,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.